Prediction Archive AceView Genes Track Settings
 
AceView Gene Models With Alt-Splicing

Track collection: Gene Prediction Archive

+  Description
+  All tracks in this collection (7)

Display mode:      Duplicate track

Gene Class:

Color track by codons: Help on codon coloring

Show codon numbering:

Display data as a density graph:

Data schema/format description and download
Assembly: Human Feb. 2009 (GRCh37/hg19)
Data last updated at UCSC: 2011-03-22

Description

This track shows AceView gene models constructed from cDNA and genomic evidence by Danielle and Jean Thierry-Mieg using the Acembly program.

AceView is the only database that defines the genes genome-wide by using only, but exhaustively, the public experimental cDNA sequences from the same species. The analysis relies on the quality of the genome sequence and exploits sophisticated cDNA-to-genome co-alignment algorithms to provide a comprehensive and non-redundant representation of the GenBank, dbEST, GSS, Trace and RefSeq cDNA sequences. In a way, the AceView transcripts represent a fully annotated non-redundant ‘nr’ view of the public RNAs, minus cloning artefacts, contaminations and bad quality sequences. AceView transcripts represent a 10 times compaction relative to the raw data, with minimal loss of sequence information.

87% of the public RNA sequences are coalesced into AceView alternative transcripts and genes, thereby identifying close to twice as many main genes as there are "known genes" in both human and mouse. 18% to 25% of the spliced genes appear non-coding, in mouse and human respectively. Alternative transcripts are prominent in both species. The typical human gene produces on average eight distinct alternatively spliced forms from three promoters and with three non-overlapping terminal exons. It has on average three cassette exons and four internal donor or acceptor sites. The AceView site further proposes a thorough biological annotation of the reconstructed genes, including association to diseases and tissue specificity of the alternative transcripts.

AceView combines respect for the experimental data with extensive quality control. Evaluated in the ENCODE regions, AceView transcripts are close to indistinguishable from the manually curated Gencode reference genes (see Thierry-Mieg, 2006, or compare the two tracks in the Genome Browser), but over the entire genome the number of transcripts exceeds Havana/Vega by a factor of three and RefSeq by a factor of six.

Display Conventions and Configuration

This track follows the display conventions for gene tracks. Gene models that fall into the "main" class are displayed in purple; "putative" genes are displayed in pink.

The main genes include at least one transcript which is spliced or putatively protein coding. Spliced genes contain at least one well-defined standard intron, i.e., an intron with a GT-AG or GC-AG boundary, supported by at least one clone matching exactly, with no ambiguous bases, 8 bases of the genome on each side of the intron.

The putative genes have no standard intron and do not encode good proteins, yet are supported by more than six cDNA clones.

The track description page offers the following filter and configuration options:

  • Gene Class filter: Select the main or putative option to filter the display.
  • Color track by codons: Select the genomic codons option to color and label each codon in a zoomed-in display to facilitate validation and comparison to gene predictions. Click the Codon coloring help link on the track description page for more information about this feature.

Methods

The millions of cDNA sequences available from the public databases (GenBank, dbEST, GSS, Traces, etc.) are aligned cooperatively on the genome sequence, taking care to keep the paired 5' and 3' reads from single clones associated in the same transcript. Useful information about tissue, stage, publications, isolation procedure and so on is gathered.

AceView alignments on the genome use knowledge on sequencing errors gained from analyzing sequencing traces and cooperative refinements. They are usually obtained over the entire length of the EST or mRNA, (average 98.8% aligned, 0.2% mismatches in mRNAs or 95.5% aligned, 1.4% mismatches in ESTs).

Multiple alignments are evaluated and the sequences are stringently kept only in their best position genome-wide. Less than 1% of the mRNAs and less than 2% of the ESTs will ultimately be aligned in more than one gene, usually in the ~1% closely repeated genes.

The cDNA sequences are then processed and cleaned: the vectors and polyA are clipped, the reads submitted on the wrong strand are flipped, and the small insertion or deletion polymorphisms are identified.

Eventual cDNA clone rearrangements or anomalous alignments are flagged and filtered (akin to manually) so as not to lose unique valuable information while avoiding pollution of the database with poorly supported anomalous data.

Unfortunately, cDNA libraries are still far from saturation, so after 20% of the suspicious entries have been removed, a single good-quality cDNA sequence, aligned with standard introns on the genome, is considered sufficient evidence for a given mRNA structure. That is because cDNA sequences are difficult to obtain, but they remain the cleanest and most reliable information to best define the molecular genes. Unspliced non-coding genes are however reported (in the putative class) only if they are supported by six or more accessions. Others belong to what is termed ‘the cloud’ (not displayed on the UCSC Genome Browser).

The cDNA sequences are clustered into a minimal number of alternative transcript variants, preferring partial transcripts to artificially extended ones. Sequences are concatenated by simple contact, but the combinatorics are voided by allowing each cDNA accession to contribute to a single alternative variant, preferably one where it merges silently without bringing any new sequence information. As a result, for instance, all shorter reads compatible with a full-length mRNA will be absorbed in that transcript and will not be available to allow for extensions on other incompatible transcripts.

About 70% of the variants, clearly identified on the Acembly site, have their entire coding region supported by a single cDNA; the others may be illicit concatenations that could be split when more data become available.

For each transcript, the consensus sequence of the cDNAs most compatible to the genome sequence is generated. Single base insertion, deletion, transition or transversion is shown graphically in the mRNA view, where frequent SNPs become evident.

The main sequence of the transcript used in the annotation is that of the footprint of the transcript on the genome, which is of better quality than the mRNAs: this procedure corrects up to 2% sequencing errors.

Putative protein-coding regions are predicted from the mRNA sequence and annotated using BlastP, PFAM, Psort2, and comparison to AceView proteins from other species. Best proteins are scored (see the FAQ on the Acembly site) and transcripts are putatively proposed to be protein-coding or non-coding.

Expression, cDNA support, tissue specificity, sequences of alternative transcripts, introns and exons, alternative promoters, alternative exons and alternative polyadenylation sites are evaluated and annotated on the Acembly web site.

The reconstructed alternative transcripts are then grouped into genes if they share at least one exact intron boundary or if they have substantial sequence overlap.

Coding and non-coding genes are defined, and genes in antisense are flagged.

AceView genes are matched molecularly to Entrez genes and named according to the official nomenclature or the Entrez Gene nomenclature. For novel genes not in Entrez, AceView creates new gene names that are maintained from release to release until the genes receive an official or Entrez gene name.

Each gene is annotated in depth, with the intention of AceView serving as a one-stop knowledgebase for systems biology. Selected functional annotations are gathered from various sources, including expression data, protein interactions and GO annotations. In particular, possible disease associations are extracted directly from PubMed, in addition to OMIM and GAD, and the users can help refine those annotations.

Finally, lists of the most closely related genes by function, pathway, protein complex, GO annotation, disease, cellular localization or all criteria taken together are proposed, to stimulate research and development.

Click the "AceView Gene Summary" on an individual transcript's details page to access the gene on the NCBI AceView website.

Credits

Thanks to Danielle and Jean Thierry-Mieg at NCBI for providing this track for human, worm and mouse.

References

Thierry-Mieg D, Thierry-Mieg J. AceView: a comprehensive cDNA-supported gene and transcripts annotation. Genome Biol. 2006;7 Suppl 1:S12.1-14. PMID: 16925834; PMC: PMC1810549

AceView web site: https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly