Schema for CAT/Liftoff Genes - CAT + Liftoff Gene Annotations
|
|
Database: hub_567047_hs1 Primary Table: hub_567047_catLiftOffGenesV1 Data last updated: 2022-03-16
Big Bed File Download: /gbdb/hs1/catLiftOffGenesV1/catLiftOffGenesV1.bb Item Count: 234,903 The data is stored in the binary BigBed format.
Format description: bigCat gene models
field | example | description |
chrom | chr1 | Reference sequence chromosome or scaffold | chromStart | 165511939 | Start position in chromosome | chromEnd | 165685138 | End position in chromosome | name | AL596087.2-202 | Name | score | 0 | Score (0-1000) | strand | + | + or - for strand | thickStart | 165685138 | Start of where display should be thick (start codon) | thickEnd | 165685138 | End of where display should be thick (stop codon) | reserved | 85,212,76 | RGB value (use R,G,B string in input file) | blockCount | 3 | Number of blocks | blockSizes | 144,138,191 | Comma separated list of block sizes | chromStarts | 0,169436,173008 | Start positions relative to chromStart | name2 | AL596087.2 | Gene name | cdsStartStat | none | Status of CDS start annotation | cdsEndStat | none | Status of CDS end annotation | exonFrames | -1,-1,-1 | Exon frame {0,1,2}, or -1 if no frame for exon | txId | CHM13_T0014705 | Transcript ID | type | lncRNA | Transcript type | geneName | CHM13_G0003761 | Gene ID | geneType | lncRNA | Gene type | sourceGene | ENSG00000229588.2 | Source gene ID | sourceTranscript | ENST00000653824.1 | Source transcript ID | alignmentId | ENST00000653824.1-0 | Alignment ID | alternativeSourceTranscripts | N/A | Alternative source transcripts | Paralogy | nan | Paralogous alignment IDs | UnfilteredParalogy | nan | Unfiltered paralogous alignment IDs | collapsedGeneIds | N/A | Collapsed Gene IDs | collapsedGeneNames | N/A | Collapsed Gene Names | frameshift | nan | Frameshifted relative to source? | exonAnnotationSupport | 1,1,1 | Exon support in reference annotation | intronAnnotationSupport | 1,1 | Intron support in reference annotation | transcriptClass | ortholog | Transcript class | transcriptModes | transMap | Transcript mode(s) | validStart | True | Valid start codon | validStop | True | Valid stop codon | properOrf | True | Proper multiple of 3 ORF | extra_paralog | False | Extra paralog of gene? |
|
| |
|
|
Sample Rows
|
|
chrom | chromStart | chromEnd | name | score | strand | thickStart | thickEnd | reserved | blockCount | blockSizes | chromStarts | name2 | cdsStartStat | cdsEndStat | exonFrames | txId | type | geneName | geneType | sourceGene | sourceTranscript | alignmentId | alternativeSourceTranscripts | Paralogy | UnfilteredParalogy | collapsedGeneIds | collapsedGeneNames | frameshift | exonAnnotationSupport | intronAnnotationSupport | transcriptClass | transcriptModes | validStart | validStop | properOrf | extra_paralog |
chr1 | 165511939 | 165685138 | AL596087.2-202 | 0 | + | 165685138 | 165685138 | 85,212,76 | 3 | 144,138,191 | 0,169436,173008 | AL596087.2 | none | none | -1,-1,-1 | CHM13_T0014705 | lncRNA | CHM13_G0003761 | lncRNA | ENSG00000229588.2 | ENST00000653824.1 | ENST00000653824.1-0 | N/A | nan | nan | N/A | N/A | nan | 1,1,1 | 1,1 | ortholog | transMap | True | True | True | False |
chr1 | 165621672 | 165623641 | AL596087.1-201 | 0 | - | 165623641 | 165623641 | 255,51,255 | 1 | 1969 | 0 | AL596087.1 | none | none | -1 | CHM13_T0014706 | processed_pseudogene | CHM13_G0003762 | processed_pseudogene | ENSG00000215835.2 | ENST00000400979.2 | ENST00000400979.2-0 | N/A | nan | nan | N/A | N/A | nan | 1 | nan | ortholog | transMap | True | True | True | False |
chr1 | 165680931 | 165681810 | AL596087.2-201 | 0 | + | 165681810 | 165681810 | 85,212,76 | 3 | 80,138,205 | 0,444,674 | AL596087.2 | none | none | -1,-1,-1 | CHM13_T0014707 | lncRNA | CHM13_G0003761 | lncRNA | ENSG00000229588.2 | ENST00000425271.1 | ENST00000425271.1-0 | N/A | nan | nan | N/A | N/A | nan | 1,1,1 | 1,1 | ortholog | transMap | True | True | True | False |
chr1 | 165733771 | 165798688 | AL583804.1-201 | 0 | - | 165798688 | 165798688 | 85,212,76 | 3 | 634,105,71 | 0,64588,64846 | AL583804.1 | none | none | -1,-1,-1 | CHM13_T0014708 | lncRNA | CHM13_G0003763 | lncRNA | ENSG00000225325.1 | ENST00000448643.1 | ENST00000448643.1-0 | N/A | nan | nan | N/A | N/A | nan | 1,1,1 | 1,1 | ortholog | transMap | True | True | True | False |
chr1 | 165820713 | 165827534 | FMO7P-201 | 0 | + | 165827534 | 165827534 | 255,51,255 | 4 | 138,148,156,326 | 0,1391,3463,6495 | FMO7P | none | none | -1,-1,-1,-1 | CHM13_T0014709 | unprocessed_pseudogene | CHM13_G0003764 | unprocessed_pseudogene | ENSG00000230231.1 | ENST00000436045.1 | ENST00000436045.1-0 | N/A | nan | nan | N/A | N/A | nan | 1,1,1,1 | 1,1,1 | ortholog | transMap | True | True | True | False |
chr1 | 165820847 | 165836063 | LINC01675-202 | 0 | - | 165836063 | 165836063 | 85,212,76 | 3 | 1326,89,236 | 0,2002,14980 | LINC01675 | none | none | -1,-1,-1 | CHM13_T0014710 | lncRNA | CHM13_G0003765 | lncRNA | ENSG00000234142.2 | ENST00000662326.1 | ENST00000662326.1-0 | N/A | nan | nan | N/A | N/A | nan | 1,1,1 | 1,1 | ortholog | transMap | True | True | True | False |
chr1 | 165821842 | 165836274 | LINC01675-201 | 0 | - | 165836274 | 165836274 | 85,212,76 | 4 | 331,89,166,447 | 0,1007,8415,13985 | LINC01675 | none | none | -1,-1,-1,-1 | CHM13_T0014711 | lncRNA | CHM13_G0003765 | lncRNA | ENSG00000234142.2 | ENST00000426519.2 | ENST00000426519.2-0 | N/A | nan | nan | N/A | N/A | nan | 1,1,1,1 | 1,1,1 | ortholog | transMap | True | True | True | False |
chr1 | 165912094 | 165926621 | FMO8P-201 | 0 | + | 165926621 | 165926621 | 255,51,255 | 8 | 135,182,163,143,200,351,79,347 | 0,3775,4399,7456,8421,11119,12943,14180 | FMO8P | none | none | -1,-1,-1,-1,-1,-1,-1,-1 | CHM13_T0014712 | unprocessed_pseudogene | CHM13_G0003766 | unprocessed_pseudogene | ENSG00000238087.3 | ENST00000434461.1 | ENST00000434461.1-0 | N/A | nan | nan | N/A | N/A | nan | 1,1,1,1,1,1,1,1 | 1,1,1,1,1,1,1 | ortholog | transMap | True | True | True | False |
chr1 | 165949824 | 165971136 | FMO9P-201 | 0 | + | 165971136 | 165971136 | 85,212,76 | 7 | 69,98,144,189,163,143,558 | 0,843,8533,16872,17943,19717,20754 | FMO9P | none | none | -1,-1,-1,-1,-1,-1,-1 | CHM13_T0014713 | processed_transcript | CHM13_G0003767 | transcribed_unprocessed_pseudogene | ENSG00000215834.10 | ENST00000477875.6 | ENST00000477875.6-0 | N/A | nan | nan | N/A | N/A | nan | 1,1,1,1,1,1,1 | 1,1,1,1,1,1 | ortholog | transMap | True | True | True | False |
chr1 | 165958367 | 165977277 | FMO9P-202 | 0 | + | 165977277 | 165977277 | 255,51,255 | 8 | 134,189,163,143,200,353,73,280 | 0,8329,9400,11174,12818,15991,16489,18630 | FMO9P | none | none | -1,-1,-1,-1,-1,-1,-1,-1 | CHM13_T0014714 | transcribed_unprocessed_pseudogene | CHM13_G0003767 | transcribed_unprocessed_pseudogene | ENSG00000215834.10 | ENST00000488458.1 | ENST00000488458.1-0 | N/A | nan | nan | N/A | N/A | nan | 1,1,1,1,1,1,1,1 | 1,1,1,1,1,1,1 | ortholog | transMap | True | True | True | False |
|
| |
|
|
CAT/Liftoff Genes (hub_567047_catLiftOffGenesV1) Track Description
|
|
Description
This track represents the gene models for the T2T CHM13 assembly generated using the CAT (Compartive Annotation Toolkit) software with genes that CAT could not be mapped as well as novel paralogs, filled in from the LiftOff mappings.
The reference annotations are from GENCODE V35.
Display Conventions and Configuration
This track follows the display conventions for
gene
prediction tracks. The exons for putative non-coding genes and
untranslated regions are represented by relatively thin blocks, while those
for coding open reading frames are thicker. Gene names are displayed in 'pack'
or 'full' mode. More information about each gene can be found by clicking on
the specific gene/transcript model.
The following color key is used:
- Blue: protein coding
- Green: non-coding
- Pink: pseudogenes
Methods
This tracks combines gene annotations generated by two methods. First the
Comparative Annotation Toolkit (CAT) was used to
Liftoff was then used as a second annotation method to map genes missed by CAT and
additional gene paralogs.
Comparative Annotation Toolkit
Genome annotation for T2T CHM13
assembly was performed using Comparative Annotation Toolkit (CAT). CAT
leverages whole-genome alignments generated by Cactus to transfer annotations
from one source genome to one or more target genomes. For this annotation set,
CAT lifted over the reference GENCODE v35 annotations onto the T2T genome. CAT
also incorporated Iso-Seq data, first assembled into transcripts with
StringTie2, to make the final consensus annotation set.
Liftoff
Liftoff uses Minimap2 to align reference gene DNA
sequences to the target genome and selects the alignment(s) concordant with
the intron/exon structure with the highest sequence identity. A minimum
sequence identity of 95% was required to annotate gene paralogs. After
running Liftoff, we identified genes that did not overlap any CAT annotations
using bedtools intersect. These were combined with the CAT annotation to
create the final annotation.
Credits
This track was provide by Marina Haukness <mhauknes@ucsc.edu>
of UC Santa Cruz and Alaina Shumate
<ashumat2@jhmi.edu> of Johns Hopkins University.
References
Fiddes IT, Armstrong J, Diekhans M, Nachtweide S, Kronenberg ZN, Underwood JG, Gordon D, Earl D,
Keane T, Eichler EE et al.
Comparative Annotation Toolkit (CAT)-simultaneous clade and personal genome annotation.
Genome Res. 2018 Jul;28(7):1029-1038.
PMID: 29884752; PMC: PMC6028123
Stanke M, Diekhans M, Baertsch R, Haussler D.
Using native and syntenically mapped cDNA alignments to improve de novo gene finding.
Bioinformatics. 2008 Mar 1;24(5):637-44.
PMID: 18218656
Stanke M, Steinkamp R, Waack S, Morgenstern B.
AUGUSTUS: a web server for gene finding in eukaryotes.
Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W309-12.
PMID: 15215400; PMC: PMC441517
Armstrong J, Hickey G, Diekhans M, Fiddes IT, Novak AM, Deran A, Fang Q, Xie D, Feng S, Stiller J
et al.
Progressive Cactus is a multiple-genome aligner for the thousand-genome era.
Nature. 2020 Nov;587(7833):246-251.
PMID: 33177663; PMC: PMC7673649
Shumate A, Salzberg SL.
Liftoff: accurate mapping of gene annotations.
Bioinformatics. 2020 Dec 15;.
PMID: 33320174; PMC: PMC8289374
| |
|
|
|