The ENCODE project has revealed the functional elements of segments
of the human genome in unprecedented detail. However, the ability to
clearly distinguish transcripts designated for translation into protein
versus those that serve purely regulatory roles remains elusive. The
standard means for doing this is to measure the proteins, if any, that
are produced by transcripts via mass spectrometry-based proteogenomic
mapping. In this process, chromatographically fractionated peptides are
fed into a tandem mass spectrometer (MS/MS). The series of fragment
masses produced in MS/MS create a signature that can then be used to
identify the peptide from a protein or DNA sequence database. For
proteogenomic mapping, this identifying spectrum is mapped directly
back to its most likely encoding locus on a genome sequence (Giddings, et al. 2003). This
allows the direct verification of protein-encoding transcripts.
The proteogenomic track displays mass spectrometry data that have
been matched to the genomic sequence for selected cell lines, using a
workflow and software specifically designed for this purpose.
The proteogenomic tracks can be used to identify which parts of the
genome are translated into proteins, to verify which transcripts
discovered by ENCODE are protein-encoding, and can also reveal new
genes and/or splice variants of genes. Of particular interest may be
its ability to reveal the translation of small open reading frames (ORFs), antisense
transcripts, or sites annotated as introns that encode proteins.
Display Conventions and Configuration
The display for this track shows peptide mappings as contiguous,
rectangular items. These items are rendered in grayscale according to the
score, with darker items representing higher-confidence peptide mappings.
The name of each item is the amino acid
sequence of the peptide. If a period (.) appears at the end of a
name, it signifies a stop codon.
In addition to the displayed genomic coordinates, several additional
fields are available for each track item.
- The Raw Score reflects
the strength of the peptide mapping, in contrast to
the Score field which
reflects the confidence of the mapping. The Score field is computed as -100×log10(E-Value) for the peptide mapping, and
scores of 200 or greater have an estimated 5% false discovery rate
(FDR) while scores of 230 or greater have an estimated 1%
FDR. The Raw Score offers
an additional level of confidence: raw scores of 300 or greater have
an estimated 5% false discovery rate. Note
that Raw Score is not
normalized for the length of the peptide mapping,
while Score is.
Consequently, short mappings might have a
strong Raw Score but a
- The Spectrum ID is a
semi-unique identifier of the spectrum associated with the peptide
mapping, and can be used to track the origins of the mapping.
- The Peptide Rank
indicates the rank of each peptide/spectrum mapping. A spectrum can
be chimeric, containing more than one peptide, and the spectrum can be
mapped with confidence to two or more distinct peptides. Peptides
with ranks greater than 3 are deleted from the track.
- The Peptide Repeat Count
indicates the number of places in the genome that match the peptide
sequence. This reflects the uniqueness of the peptide mapping in the
genome. Any mappings to highly-duplicated regions will have a
high Peptide Repeat Count and
peptides which were repeated more than 10 times in the genome were
deleted from the track.
ENCODE cell lines K562 and GM12878 were used for large scale
proteomic analysis. Cell lines were cultured according to standard
ENCODE cell culture protocols
and in-gel digestion was completed according to the standard
protocol (Shevchenko, et al. 2007).
The proteolytic enzyme trypsin was used to digest the proteins
in order to produce short, MS/MS analyzable peptides. Trypsin is a
common protease that typically cleaves proteins after Arginine or Lysine. The
metadata parameter enzyme specifies the restriction enzyme used
for digestion. Tandem mass spectrometry (RPLC-MS/MS) analysis was then
performed on an Eksigent Ultra-LTQ Orbitrap system. However, due
to enzyme inefficiency, it does not always cleave at Arginine or Lysine, so there
may be peptides that include an uncleaved Arg/Lys site. The number of such
missed cleavages allowed in the search is described by the metadata
We performed proteogenomic mapping (Jaffe, et al., 2004) with two missed cleavages
allowed and using the whole human genomic sequence (UCSC hg19) via the
genome fingerprint scanning (GFS) program (Giddings, et al. 2003) and newly
We used HMM_Score (Khatun, et al. 2008) to accurately match MS/MS
spectra to their corresponding genome sequences. E-values are
calculated, which estimate the number of results at the given score
level which would be expected by random chance. We then empirically
derived the false discovery rate for a given E-Value using a decoy
database search and only those matches falling within the specified 5%
FDR rate (E-Value <0.01) are included in the track. The results with
10% FDR (E-Value <0.05) are available under the Downloads page as Raw Signal.
This is Release 2 (July 2012). It contains a total of seven Proteogenomics experiments
with the addition of one experiment available by download only. Unlike other ENCODE data,
these data are not archived at GEO but at Proteome Commons.
The first 32 digits of the Tranche Hash for each data set is stored as the labExpId.
Proteogenomic mapping: Dr. Jainab Khatun, Brian Risk, Mustaque
Ahamed, Christopher Maier, Dr. John Wrobel and Dennis Crenshaw (Giddings Lab).
Proteomic analysis: Drs. Yanbao Yu and Ling Xie (Chen Lab).
Giddings MC, Shah AA, Gesteland R, Moore B.
Genome-based peptide fingerprint scanning.
Proc Natl Acad Sci U S A. 2003 Jan 7;100(1):20-5.
Jaffe JD, Berg HC, Church GM.
Proteogenomic mapping as a complementary method to perform genome annotation.
Proteomics. 2004 Jan;4(1):59-77.
Khatun J, Hamlett E, Giddings MC.
Incorporating sequence information into the scoring function: a hidden Markov model for improved peptide identification.
Bioinformatics. 2008 Mar 1;24(5):674-81.
Shevchenko A, Tomas H, Havlis J, Olsen JV, Mann M.
In-gel digestion for mass spectrometric characterization of proteins and proteomes.
Nat Protoc. 2006;1(6):2856-60.
Data Release Policy
Data users may freely use ENCODE data, but may not, without prior
consent, submit publications that use an unpublished ENCODE dataset
nine months following the release of the dataset. This date is listed
the Restricted Until column on the track configuration page
the download page. The full data release policy for ENCODE is available