Schema for All GENCODE VM16 - All GENCODE annotations from VM16 (Ensembl 91)
  Database: mm10    Primary Table: wgEncodeGencodeCompVM16    Row Count: 120,780   Data last updated: 2017-12-17
Format description: A gene prediction with some additional info.
On download server: MariaDB table dump directory
fieldexampleSQL type info description
bin 608smallint(5) unsigned range Indexing field to speed chromosome range queries.
name ENSMUST00000193812.1varchar(255) values Name of gene (usually transcript_id from GTF)
chrom chr1varchar(255) values Reference sequence chromosome or scaffold
strand +char(1) values + or - for strand
txStart 3073252int(10) unsigned range Transcription start position (or end position for minus strand item)
txEnd 3074322int(10) unsigned range Transcription end position (or start position for minus strand item)
cdsStart 3073252int(10) unsigned range Coding region start (or end position for minus strand item)
cdsEnd 3073252int(10) unsigned range Coding region end (or start position for minus strand item)
exonCount 1int(10) unsigned range Number of exons
exonStarts 3073252,longblob   Exon start positions (or end positions for minus strand item)
exonEnds 3074322,longblob   Exon end positions (or start positions for minus strand item)
score 0int(11) range score
name2 4933401J01Rikvarchar(255) values Alternate name (e.g. gene_id from GTF)
cdsStartStat noneenum('none', 'unk', 'incmpl', 'cmpl') values Status of CDS start annotation (none, unknown, incomplete, or complete)
cdsEndStat noneenum('none', 'unk', 'incmpl', 'cmpl') values Status of CDS end annotation (none, unknown, incomplete, or complete)
exonFrames -1,longblob   Reading frame of the start of the CDS region of the exon, in the direction of transcription (0,1,2), or -1 if there is no CDS region.

Connected Tables and Joining Fields
        mm10.wgEncodeGencodeAttrsVM16.transcriptId (via wgEncodeGencodeCompVM16.name)

Sample Rows
 
binnamechromstrandtxStarttxEndcdsStartcdsEndexonCountexonStartsexonEndsscorename2cdsStartStatcdsEndStatexonFrames
608ENSMUST00000193812.1chr1+307325230743223073252307325213073252,3074322,04933401J01Riknonenone-1,
608ENSMUST00000082908.1chr1+310201531021253102015310201513102015,3102125,0Gm26206nonenone-1,
609ENSMUST00000162897.1chr1-320590032163443205900320590023205900,3213608,3207317,3216344,0Xkr4nonenone-1,-1,
609ENSMUST00000159265.1chr1-320652232156323206522320652223206522,3213438,3207317,3215632,0Xkr4nonenone-1,-1,
76ENSMUST00000070533.4chr1-321448136714983216021367134833214481,3421701,3670551,3216968,3421901,3671498,0Xkr4cmplcmpl1,2,0,
610ENSMUST00000195335.1chr1-336573033685493365730336573013365730,3368549,0Gm37180nonenone-1,
610ENSMUST00000192336.1chr1-337555533777883375555337555513375555,3377788,0Gm37363nonenone-1,
611ENSMUST00000194099.1chr1-346497634672853464976346497613464976,3467285,0Gm37686nonenone-1,
611ENSMUST00000161581.1chr1+346658635135533466586346658623466586,3513404,3466687,3513553,0Gm1992nonenone-1,-1,
611ENSMUST00000192973.1chr1-351245035145073512450351245013512450,3514507,0Gm37329nonenone-1,

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

All GENCODE VM16 (wgEncodeGencodeVM16) Track Description
 

Description

The GENCODE Genes track (version M16, Dec 2017) shows high-quality manual annotations merged with evidence-based automated annotations across the entire mouse genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA manual annotation process and Ensembl automatic annotation pipeline. Priority is given to the manually curated HAVANA annotation using predicted Ensembl annotations when there are no corresponding manual annotations. The M16 annotation was carried out on genome assembly GRCm38 (mm10).

The Ensembl human and mouse data sets are the same gene annotations as GENCODE for the corresponding release.

Display Conventions and Configuration

This track is a multi-view composite track that contains differing data sets (views). Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide.

Views available on this track are:
Genes
The gene annotations in this view are divided into three subtracks:
  • GENCODE Basic set is a subset of the Comprehensive set. The selection criteria are described in the methods section.
  • GENCODE Comprehensive set contains all GENCODE coding and non-coding transcript annotations, including polymorphic pseudogenes. This includes both manual and automatic annotations. This is a super-set of the Basic set.
  • GENCODE Pseudogenes include all annotations except polymorphic pseudogenes.
PolyA
  • GENCODE PolyA contains polyA signals and sites manually annotated on the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of transcripts containing at least 3 A's not matching the genome.

Maximum number of transcripts to display is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks. Starting with the GENCODE human V42 and mouse VM31 releases, transcripts are assigned rank within the gene. The ranks may be used to filter the number of transcripts displayed in a principled manner. Transcript ranking is not available in the lift37 releases. See Methods for details of rank assignment.

Filtering is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks using the following criteria:

  • Transcript class: filter by the basic biological function of a transcript annotation
    • All - don't filter by transcript class
    • coding - display protein coding transcripts, including polymorphic pseudogenes
    • nonCoding - display non-protein coding transcripts
    • pseudo - display pseudogene transcript annotations
    • problem - display problem transcripts (Biotypes of retained_intron, TEC, or disrupted_domain)
  • Transcript Annotation Method: filter by the method used to create the annotation
    • All - don't filter by transcript class
    • manual - display manually created annotations, including those that are also created automatically
    • automatic - display automatically created annotations, including those that are also created manually
    • manual_only - display manually created annotations that were not annotated by the automatic method
    • automatic_only - display automatically created annotations that were not annotated by the manual method
  • Transcript Biotype: filter transcripts by Biotype
  • Support Level: filter transcripts by transcription support level

Coloring for the gene annotations is based on the annotation type:

  • coding
  • non-coding
  • pseudogene
  • problem
  • all polyA annotations

Methods

The GENCODE project aims to annotate all evidence-based gene features on the human and mouse reference sequence with high accuracy by integrating computational approaches (including comparative methods), manual annotation and targeted experimental verification. This goal includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. For a detailed description of the methods and references used, see Harrow et al. (2006).

GENCODE Basic Set selection: The GENCODE Basic Set is intended to provide a simplified subset of the GENCODE transcript annotations that will be useful to the majority of users. The goal was to have a high-quality basic set that also covered all loci. Selection of GENCODE annotations for inclusion in the basic set was determined independently for the coding and non-coding transcripts at each gene locus.

  • Criteria for selection of coding transcripts (including polymorphic pseudogenes) at a given locus:
    • All full-length coding transcripts (except problem transcripts or transcripts that are nonsense-mediated decay) were included in the basic set.
    • If there were no transcripts meeting the above criteria, then the partial coding transcript with the largest CDS was included in the basic set (excluding problem transcripts).
  • Criteria for selection of non-coding transcripts at a given locus:
    • All full-length non-coding transcripts (except problem transcripts) with a well characterized Biotype (see below) were included in the basic set.
    • If there were no transcripts meeting the above criteria, then the largest non-coding transcript was included in the basic set (excluding problem transcripts).
  • If no transcripts were included by either of the above criteria, the longest problem transcript is included.

Non-coding transcript categorization: Non-coding transcripts are categorized using their biotype and the following criteria:

  • well characterized: antisense, Mt_rRNA, Mt_tRNA, miRNA, rRNA, snRNA, snoRNA
  • poorly characterized: 3prime_overlapping_ncrna, lincRNA, misc_RNA, non_coding, processed_transcript, sense_intronic, sense_overlapping

Transcript ranking: Within each gene, transcripts have been ranked according to the following criteria. The ranking approach is preliminary and will change is future releases.

  • Protein_coding genes
    1. MANE or Ensembl canonical
      -1st: MANE Select / Ensembl canonical
      -2nd: MANE Plus Clinical
    2. Coding biotypes
      -1st: protein_coding and protein_coding_LoF
      -2nd: NMDs and NSDs
      -3rd: retained intron and protein_coding_CDS_not_defined
    3. Completeness
      -1st: full length
      -2nd: CDS start/end not found
    4. CARS score (only for coding transcripts)
    5. Transcript genomic span and length (only for non-coding transcripts)
  • Non-coding genes
    1. Transcript biotype
      -1st: transcript biotype identical to gene biotype
    2. Ensembl canonical
    3. GENCODE basic
    4. Transcript genomic span
    5. Transcript length

Transcription Support Level (TSL): It is important that users understand how to assess transcript annotations that they see in GENCODE. While some transcript models have a high level of support through the full length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative. The Transcription Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl.

The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in mouse. Mouse transcript sequences from the International Nucleotide Sequence Database Collaboration (GenBank, ENA, and DDBJ) are used as the evidence for this analysis. Exonerate RNA alignments from Ensembl, BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as suspect. GENCODE annotations for protein-coding and non-protein-coding transcripts are compared with the evidence alignments.

Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. Methods for evaluating single-exon genes are still being developed and they are not included in the current analysis. Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment does not indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All intron boundaries must match exactly. The transcript start and end locations are allowed to differ.

The following categories are assigned to each of the evaluated annotations:

  • tsl1 - all splice junctions of the transcript are supported by at least one non-suspect mRNA
  • tsl2 - the best supporting mRNA is flagged as suspect or the support is from multiple ESTs
  • tsl3 - the only support is from a single EST
  • tsl4 - the best supporting EST is flagged as suspect
  • tsl5 - no single transcript supports the model structure
  • tslNA - the transcript was not analyzed for one of the following reasons:
    • pseudogene annotation, including transcribed pseudogenes
    • immunoglobin gene transcript
    • T-cell receptor transcript
    • single-exon transcript (will be included in a future version)

APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods. It provides value to the annotations of the human, mouse, zebrafish, rat, and pig genomes. APPRIS has selected a single CDS variant for each gene as the 'PRINCIPAL' isoform. Principal isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable.

  • PRINCIPAL:1 - Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS.
  • PRINCIPAL:2 - Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant.
  • PRINCIPAL:3 - Where the APPRIS core modules are unable to choose a clear principal variant and more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated.
  • PRINCIPAL:4 - Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant.
  • PRINCIPAL:5 - Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant. For genes in which the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the "candidate" variants not chosen as principal are labeled in the following way:
  • ALTERNATIVE:1 - Candidate transcript(s) models that are conserved in at least three tested species.
  • ALTERNATIVE:2 - Candidate transcript(s) models that appear to be conserved in fewer than three tested species. Non-candidate transcripts are not tagged and are considered as "Minor" transcripts. Further information and additional web services can be found at the APPRIS website.

Downloads

GENCODE GFF3 and GTF files are available from the GENCODE release M16 site.

Release Notes

GENCODE version M16 corresponds to Ensembl 91.

See also: The GENCODE Project

Credits

The GENCODE project is an international collaboration funded by NIH/NHGRI grant U41HG007234. More information is available at www.gencodegenes.org, Participating GENCODE institutions and personnel can be found here.

References

Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, Sisu C, Wright JC, Armstrong J, Barnes I et al. GENCODE 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D916-D923. PMID: 33270111; PMC: PMC7778937; DOI: 10.1093/nar/gkaa1087

A full list of GENCODE publications are available at The GENCODE Project web site.

Data Release Policy

GENCODE data are available for use without restrictions.