Schema for TOGA vs. hg38 - TOGA annotations using human/hg38 as reference
  Database: cavPor3    Primary Table: HLTOGAannotvHg38v1 Data last updated: 2022-06-20
Big Bed File Download: /gbdb/cavPor3/TOGAvHg38v1/
Item Count: 68,039
The data is stored in the binary BigBed format.

Format description: TOGA predicted gene model
chromchrMReference sequence chromosome or scaffold
chromStart2033Start position in chromosome
chromEnd2108End position in chromosome
nameENST00000399974.MTRNR2L4.8861Name or ID of item, ideally both human readable and unique
score1000Score (0-1000)
strand++ or - for strand
thickStart2033Start of where display should be thick (start codon)
thickEnd2108End of where display should be thick (stop codon)
itemRgb0,0,200RGB value (use R,G,B string in input file)
blockCount1Number of blocks
blockSizes75,Comma separated list of block sizes
chromStarts0,Start positions relative to chromStart
ref_trans_idENST00000399974.MTRNR2L4Reference transcript ID
ref_regionchr16:3370978-3372668Transcript region in the reference
query_regionchrM:2033-2108Region in the query
chain_score0.9805133938789368Chain orthology probability score
chain_synteny1Chain synteny log10 value
chain_flank0.1625Chain flank feature
chain_gl_cds_fract0.020336605890603085Chain global CDS fraction value
chain_loc_cds_fract1.0Chain local CDS fraction value
chain_exon_cov1.0Chain exon coverage value
chain_intron_cov0Chain intron coverage value
statusIntactGene loss classification
perc_intact_ign_M0.8620689655172413% intact ignoring missing
perc_intact_int_M0.8620689655172413% intact considering missing as intact
intact_codon_prop0.8620689655172413% intact codons
ouf_prop0.0% out of chain
mid_intact1Is middle 80% intact
mid_pres1Is middle 80% fully present
     ||  || |||| ||||||  |       |

HTML-formatted protein alignment
svg_line none SP ENST00000399974.MTRNR2L4.8861 SVG inactivating mutations visualization
ref_linkENST00000399974Reference transcript link
HTML-formatted inactivating mutations table
Exon number: 1

Exon region: chrM:2033-2108
Nucleotide percent identity: 66.67 | BLOSUM: 49.28
Intersects assembly gaps: NO
Exon alignment class: A
Detected within expected region (exp:1884-2148): YES

Sequence alignment between reference and query exon:
     ||||| | || |||||| | ||||||||| ||||  |||||||||||||| ||  || |||||| || | |         

         | |
que: ----TAA

HTML-formatted exon alignment

Sample Rows
chrM20332108ENST00000399974.MTRNR2L4.88611000+203321080,0,200175,0,ENST00000399974.MTRNR2L4chr16:3370978-3372668chrM:2033-21080.980513393878936810.16250.0203366058906030851.01.00Intact0.86206896551724130.86206896551724130.86206896551724130.011ref: MATQGFSCLLLSVSEIDLSMKRQYKQIR*     ||  || |||| ||||||  | ... ...ENST00000399974Exon number: 1Exon region: chrM:2033-2108Nucleotide percent identity: 66.67 | BLOSUM: ...
TOGA vs. hg38 (HLTOGAannotvHg38v1) Track Description


TOGA (Tool to infer Orthologs from Genome Alignments) is a homology-based method that integrates gene annotation, inferring orthologs and classifying genes as intact or lost.


As input, TOGA uses a gene annotation of a reference species (human/hg38 for mammals, chicken/galGal6 for birds) and a whole genome alignment between the reference and query genome.

TOGA implements a novel paradigm that relies on alignments of intronic and intergenic regions and uses machine learning to accurately distinguish orthologs from paralogs or processed pseudogenes.

To annotate genes, CESAR 2.0 is used to determine the positions and boundaries of coding exons of a reference transcript in the orthologous genomic locus in the query species.

Display Conventions and Configuration

Each annotated transcript is shown in a color-coded classification as

  •   "intact": middle 80% of the CDS (coding sequence) is present and exhibits no gene-inactivating mutation. These transcripts likely encode functional proteins.
  •   "partially intact": 50% of the CDS is present in the query and the middle 80% of the CDS exhibits no inactivating mutation. These transcripts may also encode functional proteins, but the evidence is weaker as parts of the CDS are missing, often due to assembly gaps.
  •   "missing": <50% of the CDS is present in the query and the middle 80% of the CDS exhibits no inactivating mutation.
  •   "uncertain loss": there is 1 inactivating mutation in the middle 80% of the CDS, but evidence is not strong enough to classify the transcript as lost. These transcripts may or may not encode a functional protein.
  •   "lost": typically several inactivating mutations are present, thus there is strong evidence that the transcript is unlikely to encode a functional protein.

Clicking on a transcript provides additional information about the orthology classification, inactivating mutations, the protein sequence and protein/exon alignments.


This data was prepared by the Michael Hiller Lab


The TOGA software is available from

Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales A, Ahmed AW, Kontopoulos DG, Hilgers L, Zoonomia Consortium, Hiller M. TOGA integrates gene annotation with orthology inference at scale. bioRxiv preprint September 2022