Schema for RepeatMasker - Repeating Elements by RepeatMasker
  Database: hg38    Primary Table: rmsk    Row Count: 5,683,690   Data last updated: 2022-10-19
Format description: RepeatMasker .out record
On download server: MariaDB table dump directory
fieldexampleSQL type description
bin 585smallint(5) unsigned Indexing field to speed chromosome range queries.
swScore 463int(10) unsigned Smith Waterman alignment score
milliDiv 13int(10) unsigned Base mismatches in parts per thousand
milliDel 6int(10) unsigned Bases deleted in parts per thousand
milliIns 17int(10) unsigned Bases inserted in parts per thousand
genoName chr1varchar(255) Genomic sequence name
genoStart 10000int(10) unsigned Start in genomic sequence
genoEnd 10468int(10) unsigned End in genomic sequence
genoLeft -248945954int(11) -#bases after match in genomic sequence
strand +char(1) Relative orientation + or -
repName (TAACCC)nvarchar(255) Name of repeat
repClass Simple_repeatvarchar(255) Class of repeat
repFamily Simple_repeatvarchar(255) Family of repeat
repStart 1int(11) Start (if strand is +) or -#bases after match (if strand is -) in repeat sequence
repEnd 471int(11) End in repeat sequence
repLeft 0int(11) -#bases after match (if strand is +) or start (if strand is -) in repeat sequence
id 1char(1) First digit of id field in RepeatMasker .out file. Best ignored.

Sample Rows
 
binswScoremilliDivmilliDelmilliInsgenoNamegenoStartgenoEndgenoLeftstrandrepNamerepClassrepFamilyrepStartrepEndrepLeftid
58546313617chr11000010468-248945954+(TAACCC)nSimple_repeatSimple_repeat147101
585361211421513chr11046811447-248944975-TAR1Satellitetelo-39917124832
5854842511320chr11150411675-248944747-L1MC5aLINEL1-23823951993
5852392941910chr11167711780-248944642-MER5BDNAhAT-Charlie-7410414
585318230370chr11526415355-248941067-MIR3SINEMIR-119143495
58518232019chr11579715849-248940573+(TGCTCC)nSimple_repeatSimple_repeat15206
5851813700chr11671216744-248939678+(TGG)nSimple_repeatSimple_repeat13207
5852393381290chr11890619048-248937374+L2aLINEL229423104-3228
5859943126025chr11997120405-248936017+L3LINECR126803129-9709
585270331727chr12053020679-248935743+Plat_L3LINECR128022947-6391

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

RepeatMasker (rmsk) Track Description
 

Description

This track was created by using Arian Smit's RepeatMasker program, which screens DNA sequences for interspersed repeats and low complexity DNA sequences. The program outputs a detailed annotation of the repeats that are present in the query sequence (represented by this track), as well as a modified version of the query sequence in which all the annotated repeats have been masked (generally available on the Downloads page). RepeatMasker uses the Repbase Update library of repeats from the Genetic Information Research Institute (GIRI). Repbase Update is described in Jurka (2000) in the References section below.

This track and the masking information in our hg38 genome download FASTA files was created in 2010 with the original RepBase library from 2010-03-02 and RepeatMasker 3.0.1. Since April 2019, RepBase is under a commercial license, we cannot distribute it or update the track using the RepBase library without a license. Therefore, and for compatibility with past results, given how central the masking is for many other annotations, we decided to not update the repeatmasking of hg38. However, you can show the small differences between the RepeatMasker 3/RepBase from 2010 and RepeatMasker 4/DFAM from 2020 using the track "RepeatMasker Viz" in the same track group. It contains two subtracks, one with the old and one with the new data. Also, these tracks have many more visusalisation options than the original RepeatMasker track.

However, the last track update time of this track at UCSC is not 2010, because we had to add repeatmasking annotations to the rarely used _alt and _fix "patch" sequences of the hg38 genome. The repeatmasking annotations of the main chromosomes were unaffected and have not changed since 2010. For more information on genome patches, see our blog post.

Display Conventions and Configuration

In full display mode, this track displays up to ten different classes of repeats:

  • Short interspersed nuclear elements (SINE), which include ALUs
  • Long interspersed nuclear elements (LINE)
  • Long terminal repeat elements (LTR), which include retroposons
  • DNA repeat elements (DNA)
  • Simple repeats (micro-satellites)
  • Low complexity repeats
  • Satellite repeats
  • RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA)
  • Other repeats, which includes class RC (Rolling Circle)
  • Unknown

The level of color shading in the graphical display reflects the amount of base mismatch, base deletion, and base insertion associated with a repeat element. The higher the combined number of these, the lighter the shading.

A "?" at the end of the "Family" or "Class" (for example, DNA?) signifies that the curator was unsure of the classification. At some point in the future, either the "?" will be removed or the classification will be changed.

Methods

Data are generated using the RepeatMasker -s flag. Additional flags may be used for certain organisms. Repeats are soft-masked. Alignments may extend through repeats, but are not permitted to initiate in them. See the FAQ for more information.

Credits

Thanks to Arian Smit, Robert Hubley and GIRI for providing the tools and repeat libraries used to generate this track.

References

Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. http://www.repeatmasker.org. 1996-2010.

Repbase Update is described in:

Jurka J. Repbase Update: a database and an electronic journal of repetitive elements. Trends Genet. 2000 Sep;16(9):418-420. PMID: 10973072

For a discussion of repeats in mammalian genomes, see:

Smit AF. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev. 1999 Dec;9(6):657-63. PMID: 10607616

Smit AF. The origin of interspersed repeats in the human genome. Curr Opin Genet Dev. 1996 Dec;6(6):743-8. PMID: 8994846