category - This is either coding, noncoding,
nearCoding. A coding transcript is one where the evidence is relatively
good that it produces a protein. The nearCoding transcripts overlap coding
transcripts by at least 20 bases on the same strand, but themselves do not seem to produce
protein products. In many cases this is because they are splicing varients
with introns after the stop codon, that therefore undergo nonsense mediated decay.
Antisense transcripts overlap coding transcripts by at least 20 bases on the oppposite
The other transcripts, which are neither coding, nor overlapping coding,
are categorized as noncoding.
exon count - The number of exons in the gene. Single exon genes are
generally somewhat less reliable than multi-exon genes, though there are
many well-known genuine single exon genes such as the Histones and the Sox
ORF size - The size of the open reading frame in the mRNA. Divide by
three to get the size of the protein.
txCdsPredict score - The score from the txCdsPredict program. This
program weighs a variety of evidence including the presence of a Kozak consensus
sequence at the start codon, the length of the ORF, the presense of upstream
ORFs, homology in other species, and nonsense mediated decay. In general
a score over 1000 is almost certain to be a protein, and scores over 800 have
about a 90% chance of being a protein.
has start codon - Indicates if the initial codon is an ATG.
has end codon - Indicates if the last codon is TAA, TAG, or TGA.
selenocysteine - Indicates if this is one of the special proteins where
TGA encodes the animo acid selenocysteine rather than encoding a stop codon.
nonsense-mediated-decay - Indicates whether the final intron is more than
55 bases after the stop codon. If true, then generally the mRNA will be degraded
before it can produce a detectable amount of protein. Therefore when this condition
is true we remove the predicted coding region from the transcript.
CDS single in 3' UTR - This is a strong indicator that the coding region
(CDS) is a coincidental open reading frame rather than a true indication
that the transcript codes for protein. This indicates that the coding sequence
resides in a single exon, and that this exon is located entirely in the 3' UTR
of another transcript that codes for a different protein not overlapping the
ORF in the same frame. We remove the CDS from non-refSeq transcripts that meet
this condition, which often results from a retained intron or from missing the
initial parts of a transcript.
CDS single in intron - This is another strong indicator that the ORF is
not real. Here the coding region (CDS) lies entirely in the intron of another
transcript which has strong evidence of coding for a protein. We remove the CDS
from non-refSeq transcripts that meet this condition, which generally results
from a retained intron.
frame shift in genome - This only occurs for RefSeq transcripts. Here
a frame shift is detected in the coding region when aligning the transcript against
the genome. Since RefSeq does examine these cases carefully, it is strong evidence
that the genome sequence is in error, or that the anonymous DNA donor carried
a frame-shift mutation in this gene. In general there will be multiple independent
cDNA clones supporting the RefSeq over the genome. In the gene display on the
browser, one or two bases will be removed from the gene to keep frame intact.
stop codon in genome - This also only occurs for RefSeq transcripts, and
as with the frame shifts, there is generally multiple lines of evidence suggesting
sequencing error or mutation in the reference genome. In the gene display on the
browser three bases will be removed from the gene to avoid the stop.
retained intron - The transcript contains what is an intron in
an overlapping transcript on the same strand. In many cases this indicates
that the transcript was not completely processed. Unless specific steps are
taken to isolate cytoplasmic rather than nuclear RNA, a certain fraction of the
RNA isolated for sequencing will be incompletely processed. Transcripts with
retained introns should be viewed suspiciously, especially if they are long.
However there are cases where fully mature mRNA transcripts are made with
and without a particular intron, so transcripts with retained introns are not
eliminated from this gene set.
end bleed into intron - Very often when an intron is retained, it is so
long that the next exon is not reached and sequenced. In this case the retained
intron can't be detected directly. However high values of "end bleed" are
strongly suggestive of a retained intron. End bleed measures how far the end of a transcript extends into an intron defined by another overlapping transcript. Note
however that alternative promoters and alternative polyadenylation sites can
create end bleeds in fully mature transcripts.
RNA accession - The RefSeq, Genbank/EMBL/DJJ, Rfam, or tRNA accession
accession on which this transcript is most closely based. Note that the splice
sites when possible are taken from a consensus of RNA alignments rather than
just from a single RNA. For non-RefSeq genes the bases are taken from the genome
rather than the RNA. However the transcript does define which introns and exons
are used to build the transcript.
RNA size - The size of the RNA on which this transcript is most
closely based, including the poly-A tail (if any).
Alignment % ID - Percentage identity within of alignment of RNA
% Coverage - The percentage of the RNA covered by the alignment to
genome. This excludes the poly-A tail, if the RNA has one.
# of Alignments: - The number of times this RNA aligns to the genome
at very high stringency. More care must be taken in interpreting genes based
on transcripts with multiple alignments. We do substantial filtering to avoid
pseudogenes, but extremely recent, extremely complete pseudogenes may still
pass these filters and cause multiple alignments.
# of AT/AC introns - The number of introns in this transcript with
AT/AC rather than the usual GT/AG ends. There are roughly 300 genes with
legitimate AT/AC introns.
# of strange splices - The number of introns that have ends which are
neither GT/AG, GC/AG, nor AT/AC. Many of these are the result of sequencing
errors, or polymorphisms between the DNA donors and the RNA donors.