Schema for RetroGenes V9 - Retroposed Genes V9, Including Pseudogenes
  Database: hg38    Primary Table: ucscRetroAli9    Row Count: 15,491   Data last updated: 2015-01-16
Format description: Summary info about a patSpace alignment
fieldexampleSQL type info description
bin 585smallint(5) unsigned range Indexing field to speed chromosome range queries.
matches 2032int(10) unsigned range Number of bases that match that aren't repeats
misMatches 25int(10) unsigned range Number of bases that don't match
repMatches 0int(10) unsigned range Number of bases that match but are part of repeats
nCount 0int(10) unsigned range Number of 'N' bases
qNumInsert 0int(10) unsigned range Number of inserts in query
qBaseInsert 0int(10) unsigned range Number of bases inserted in query
tNumInsert 0int(10) unsigned range Number of inserts in target
tBaseInsert 0int(10) unsigned range Number of bases inserted in target
strand -char(2) values + or - for strand. First character query, second target (optional)
qName AK021903.1-1varchar(255) values Query sequence name
qSize 2057int(10) unsigned range Query sequence size
qStart 0int(10) unsigned range Alignment start position in query
qEnd 2057int(10) unsigned range Alignment end position in query
tName chr1varchar(255) values Target sequence name
tSize 248956422int(10) unsigned range Target sequence size
tStart 87548int(10) unsigned range Alignment start position in target
tEnd 89605int(10) unsigned range Alignment end position in target
blockCount 1int(10) unsigned range Number of blocks in alignment
blockSizes 2057,longblob   Size of each block
qStarts 0,longblob   Start of each block in query.
tStarts 87548,longblob   Start of each block in target.

Connected Tables and Joining Fields
        hg38.ucscRetroCds9.id (via ucscRetroAli9.qName)
      hg38.ucscRetroInfo9.name (via ucscRetroAli9.qName)
      hg38.ucscRetroOrtho9.name (via ucscRetroAli9.qName)
      hg38.ucscRetroSeq9.acc (via ucscRetroAli9.qName)

Sample Rows
 
binmatchesmisMatchesrepMatchesnCountqNumInsertqBaseInserttNumInserttBaseInsertstrandqNameqSizeqStartqEndtNametSizetStarttEndblockCountblockSizesqStartstStarts
585203225000000-AK021903.1-1205702057chr1248956422875488960512057,0,87548,
732588471000000+NM_015125.3-1547305471chr12489564221310671348364732,28,87,12,246,220,35,95,33,67,112,59,176,7,93,79,92,61,27,112,149,233,21,47,56,7,9,7,7,8,12,11,28,12,6,36,66,53,45,52,22,11,9, ...0,33,70,160,188,435,655,690,788,847,914,1034,1094,1278,1286,1380,1461,1554,1617,1644,1757,4198,4432,4454,4502,4558,4565,4574,458 ...131067,131099,131127,131214,131226,131472,131693,131733,131828,131861,131929,132041,132100,132276,132283,132376,132455,132547,13 ...
58626632000000-AK128836.1-340288373106chr1248956422135466135765333,136,129,922,2926,3062,135466,135499,135636,
58634081000000-BC038571.1-311544111006chr124895642215825115867611119,13,12,19,10,17,12,16,17,142,44,148,281,294,311,334,366,384,396,451,478,699,158251,158370,158386,158398,158417,158427,158444,158457,158473,158490,158632,
58648043000000-AK130965.1-15382538chr1248956422258511259036623,47,48,290,86,29,0,29,77,128,418,507,258511,258534,258581,258629,258921,259007,
58762624000000-AK021903.1-3205701614chr12489564222657652664152198,452,443,1605,265765,265963,
58743596000000-AK130965.1-25382538chr1248956422347908348441576,188,152,86,29,0,77,266,418,507,347908,347984,348172,348326,348412,
58762822000000-AK021903.1-4205701614chr12489564223551473557972198,452,443,1605,355147,355345,
58719380000000+AK126884.1-83448174451chr1248956422367880368170852,21,18,10,29,42,33,68,174,226,247,268,278,308,350,383,367880,367933,367956,367974,367990,368019,368068,368102,
58837855000000+AK308843.1-113600448chr1248956422439768440208540,54,128,124,87,0,40,103,231,361,439768,439812,439866,439997,440121,

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

RetroGenes V9 (ucscRetroAli9) Track Description
 

Description

Retrotransposition is a process involving the copying of DNA by a group of enzymes that have the ability to reverse transcribe spliced mRNAs, and the insertion of these processed mRNAs back into the genome resulting in single-exon copies of genes and sometime chimeric genes. Retrogenes are mostly non-functional pseudogenes but some are functional genes that have acquired a promoter from a neighboring gene, or transcribed pseudogenes, and some are anti-sense transcripts that may impede mRNA translation.

Methods

All mRNAs of a species from GenBank were aligned to the genome using lastz (Miller lab, Pennsylvania State University). mRNAs that aligned twice in the genome (once with introns and once without introns) were initially screened. Next, a series of features were scored to determine candidates for retrotransposition events. These features included position and length of the polyA tail, percent coverage of the retrogene alignment to the parent, degree of synteny with mouse, coverage of repetitive elements, number of exons that can still be aligned to the retrogene, number of putative introns removed at the retrogene locus and degree of divergence from the parent gene. Retrogenes were classified using a threshold score function that is a linear combination of this set of features. Retrogenes in the final set were selected using a score threshold based on a ROC plot against the Vega annotated pseudogenes.

Retrogene Statistics table:

  • Expression of Retrogene: The following values are possible where those that are not expressed are classed as pseudogene or mrna:
    • pseudogene indicates that the parent gene has been annotated by one of NCBI's RefSeq, UCSC Genes or Mammalian Gene Collection (MGC).
    • mrna indicates that the parent gene is a spliced mrna that has no annotation in NCBI's RefSeq, UCSC Genes or Mammalian Gene Collection (MGC). Therefore, the retrogene is a product of a potentially non-annotated parent gene and is a putative pseudogene of that putative parent gene.
    • expressed weak indicates that there is a mRNA overlapping the retrogene, indicating possible transcription. noOrf indicates that an ORF was not identified by BESTORF.
    • expressed indicates that there is a medium level of mRNAs/ESTs mapping to the retrogene locus, indicating possible transcription.
    • expressed strong indicates that there is a mRNA overlapping the retrogene, and at least five spliced ESTs indicating probable transcription. noOrf indicates that an ORF was not identified by BESTORF.
    • expressed shuffle indicates that the retrogene was inserted into a pre-existing annotated gene.
  • Score: Weighted sum of features (mentioned above) of the potential retrogene.
  • Percent Gene Alignment Coverage (Bases Matching Parent): Shows the percentage of the parent gene aligning to this region.
  • Intron Count: Number of introns is the number of gaps in the alignment between the parent mRNA and the genome where gaps are >80 bp and the ratio of the mRNA alignment gap to the genome alignment gap is less than 30% after removing repeats.
  • Gap Count: Numer of gaps in the alignment of between the parent mRNA and the genome after removing repeats. Gaps are not counted if the gap on the mRNA side of the alignment is a similar size to the gap in the genome alignment.
  • BESTORF Score: BESTORF (written by Victor Solovyev) predicts potential open reading frames (ORFs) in mRNAs/ESTs with very high accuracy using a Markov chain model of coding regions and a probabilistic model of translation start codon potential. The score threshold for finding an ORF is 50 (Jim Kent, personal communication).

Break in Orthology table:

Retrogenes inserted into the genome since the mouse/human divergence show a break in the human genome syntenic net alignments to the mouse genome. A break in orthology score is calculated and weighted before contributing to the final retrogene score. The break in orthology score ranges from 0-130 and it represents the portion of the genome that is missing in each species relative to the reference genome (human hg38) at the retrogene locus as defined by syntenic alignment nets. If the score is 0, there is orthologous DNA and no break in ortholog with the other species; this could be an ancient retrogene; duplicated pseudogenes may also score low because they are often generated via large segmental duplication events so the size of the pseudogene is small relative to the size of the inserted duplicated sequence. Scores greater than 100 represent cases where the retrogene alignment has no flanking alignment resulting from an ancient insertion or other complex rearrangement.

Breaks in orthology with human and dog tend to be due to genomic insertions in the rodent lineage so sequence gaps are not treated as orthology breaks. Relative orthology of human/mouse and dog/mouse nets are used to avoid false positives due to deletions in the human genome. Since older retrogenes will not show a break in orthology, this feature is weighted lower than other features when scoring putative retrogenes.

Credits

The RetroFinder program and browser track were developed by Robert Baertsch at UCSC.

References

Baertsch R, Diekhans M, Kent WJ, Haussler D, Brosius J. Retrocopy contributions to the evolution of the human genome. BMC Genomics. 2008 Oct 8;9:466. PMID: 18842134; PMC: PMC2584115

Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784

Pei B, Sisu C, Frankish A, Howald C, Habegger L, Mu XJ, Harte R, Balasubramanian S, Tanzer A, Diekhans M et al. The GENCODE pseudogene resource. Genome Biol. 2012 Sep 26;13(9):R51. PMID: 22951037; PMC: PMC3491395

Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961

Zheng D, Frankish A, Baertsch R, Kapranov P, Reymond A, Choo SW, Lu Y, Denoeud F, Antonarakis SE, Snyder M et al. Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. Genome Res. 2007 Jun;17(6):839-51. PMID: 17568002; PMC: PMC1891343