Schema for AUGUSTUS - AUGUSTUS ab initio gene predictions v3.1
  Database: mm10    Primary Table: augustusGene    Row Count: 30,421   Data last updated: 2021-04-09
Format description: A gene prediction with some additional info.
On download server: MariaDB table dump directory
fieldexampleSQL type info description
bin 76smallint(5) unsigned range Indexing field to speed chromosome range queries.
name g1.t1varchar(255) values Name of gene (usually transcript_id from GTF)
chrom chr1varchar(255) values Reference sequence chromosome or scaffold
strand -char(1) values + or - for strand
txStart 3209012int(10) unsigned range Transcription start position (or end position for minus strand item)
txEnd 3671345int(10) unsigned range Transcription end position (or start position for minus strand item)
cdsStart 3216021int(10) unsigned range Coding region start (or end position for minus strand item)
cdsEnd 3671318int(10) unsigned range Coding region end (or start position for minus strand item)
exonCount 10int(10) unsigned range Number of exons
exonStarts 3209012,3215916,3224177,331...longblob   Exon start positions (or end positions for minus strand item)
exonEnds 3209518,3216968,3224238,331...longblob   Exon end positions (or start positions for minus strand item)
score 0int(11) range score
name2 g1varchar(255) values Alternate name (e.g. gene_id from GTF)
cdsStartStat cmplenum('none', 'unk', 'incmpl', 'cmpl') values Status of CDS start annotation (none, unknown, incomplete, or complete)
cdsEndStat cmplenum('none', 'unk', 'incmpl', 'cmpl') values Status of CDS end annotation (none, unknown, incomplete, or complete)
exonFrames -1,1,0,2,1,1,1,1,2,0,longblob   Reading frame of the start of the CDS region of the exon, in the direction of transcription (0,1,2), or -1 if there is no CDS region.

Sample Rows
 
binnamechromstrandtxStarttxEndcdsStartcdsEndexonCountexonStartsexonEndsscorename2cdsStartStatcdsEndStatexonFrames
76g1.t1chr1-3209012367134532160213671318103209012,3215916,3224177,3310844,3381235,3421701,3437310,3531954,3624054,3670626,3209518,3216968,3224238,3310941,3381335,3421878,3437394,3532152,3624128,3671345,0g1cmplcmpl-1,1,0,2,1,1,1,1,2,0,
9g2.t1chr1-3999157435491539995564352825283999157,4007655,4019069,4024735,4041887,4092616,4120014,4142611,4147811,4148653,4163854,4170204,4206659,4226610,4228442,4230551, ...3999617,4007737,4019148,4024890,4042107,4092780,4120073,4142766,4147963,4148744,4163941,4170404,4206837,4226823,4228619,4230627, ...0g2cmplcmpl2,1,0,1,0,1,2,0,1,0,0,1,0,0,0,2,1,2,1,0,1,0,0,1,1,0,0,-1,
619g3.t1chr1-449092744936454491715449340624490927,4493099,4492668,4493645,0g3cmplcmpl1,0,
619g3.t2chr1-449092744934454491715449340624490927,4493099,4492668,4493445,0g3cmplcmpl1,0,
619g4.t1chr1+449368545381334497611453690354493685,4497593,4505506,4524677,4536835,4493959,4497654,4505586,4525329,4538133,0g4cmplcmpl-1,0,1,0,1,
77g5.t1chr1-457116246104754573121461038664571162,4573113,4576074,4598112,4598908,4610278,4571468,4573150,4576262,4598241,4598958,4610475,0g5cmplcmpl-1,1,2,2,0,0,
621g6.t1chr1+477120547725894771266477219914771205,4772589,0g6cmplcmpl0,
621g7.t3chr1-477622447857254776463478567754776224,4777524,4782567,4783950,4785572,4776801,4777648,4782733,4784105,4785725,0g7cmplcmpl1,0,2,0,0,
621g7.t1chr1-477637547857254776463478567754776375,4777524,4782567,4783950,4785572,4776801,4777648,4782733,4784105,4785725,0g7cmplcmpl1,0,2,0,0,
621g7.t2chr1-477637547857754776463478567754776375,4777524,4782567,4783950,4785572,4776801,4777648,4782733,4784105,4785775,0g7cmplcmpl1,0,2,0,0,

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

AUGUSTUS (augustusGene) Track Description
 

Description

This track shows ab initio predictions from the program AUGUSTUS (version 3.1). The predictions are based on the genome sequence alone.

For more information on the different gene tracks, see our Genes FAQ.

Methods

Statistical signal models were built for splice sites, branch-point patterns, translation start sites, and the poly-A signal. Furthermore, models were built for the sequence content of protein-coding and non-coding regions as well as for the length distributions of different exon and intron types. Detailed descriptions of most of these different models can be found in Mario Stanke's dissertation. This track shows the most likely gene structure according to a Semi-Markov Conditional Random Field model. Alternative splicing transcripts were obtained with a sampling algorithm (--alternatives-from-sampling=true --sample=100 --minexonintronprob=0.2 --minmeanexonintronprob=0.5 --maxtracks=3 --temperature=2).

The different models used by Augustus were trained on a number of different species-specific gene sets, which included 1000-2000 training gene structures. The --species option allows one to choose the species used for training the models. Different training species were used for the --species option when generating these predictions for different groups of assemblies.

Assembly Group Training Species
Fish zebrafish
Birds chicken
Human and all other vertebrates human
Nematodes caenorhabditis
Drosophila fly
A. mellifera honeybee1
A. gambiae culex
S. cerevisiae saccharomyces

This table describes which training species was used for a particular group of assemblies. When available, the closest related training species was used.

Credits

Thanks to the Stanke lab for providing the AUGUSTUS program. The training for the chicken version was done by Stefanie König and the training for the human and zebrafish versions was done by Mario Stanke.

References

Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008 Mar 1;24(5):637-44. PMID: 18218656

Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. PMID: 14534192