Schema for CHM13 alignments - CHM13 (GCA_009914755.4) v1_nfLO liftOver alignments
  Database: hg38    Primary Table: chm13LiftOver Data last updated: 2022-03-30
Big Bed File: /gbdb/hg38/bbi/chm13LiftOver/hg38-chm13v2.ncbi-qnames.over.chain.bb
Item Count: 747
Format description: bigChain pairwise alignment
fieldexampledescription
chromchr1Reference sequence chromosome or scaffold
chromStart72346156Start position in chromosome
chromEnd103697915End position in chromosome
name17Name or ID of item, ideally both human readable and unique
score1000Score (0-1000)
strand++ or - for strand
tSize248956422size of target sequence
qNameCP068277.2name of query sequence
qSize248387328size of query sequence
qStart72179591start of alignment on query sequence
qEnd103546781end of alignment on query sequence
chainScore31316542score from chain

Sample Rows
 
chromchromStartchromEndnamescorestrandtSizeqNameqSizeqStartqEndchainScore
chr172346156103697915171000+248956422CP068277.22483873287217959110354678131316542
chr1103697915108241242181000+248956422CP068277.22483873281037350571082747474527084
chr1108244913108244941611000-248956422CP068277.224838732813990449513990452328
chr1108310529108453102621000-248956422CP068277.2248387328139970097140112581142409
chr1108483673109682998191000+248956422CP068277.22483873281085167861097114851193399
chr1109701443120834554201000+248956422CP068277.224838732810971148512084289411110459
chr1120836006121686192211000+248956422CP068277.2248387328120845404121724719850024
chr1121686192121746060221000+248956422CP068277.224838732812173591812179985659856
chr1121746060121776357231000+248956422CP068277.224838732812600570312604700930201
chr1121777326121804742241000+248956422CP068277.224838732812630136612633300327047

CHM13 alignments (chm13LiftOver) Track Description
 

Description

These tracks show the one-to-one v1_nfLO alignments of the GRCh38/hg38 to the T2T-CHM13 v2.0 assembly.

Display Conventions

The track displays boxes joined together by either single or double lines, with the boxes represent aligning regions, single lines indicating gaps that are largely due to a deletion in the CHM13 v2.0 assembly or an insertion in the GRCh38/hg38, and double lines representing more complex gaps that involve substantial sequence in both assembly.

Methods

GRCh38/hg38 pre-processing

To prevent ambiguous alignments, all false duplications, as determined by the Genome in a Bottle Consortium (GCA_000001405.15_GRCh38_GRC_exclusions_T2Tv2.bed), as well as the GRCh38 modeled centromeres, were masked from the GRCh38/hg38 primary assembly. In addition, unlocalized and unplaced (random) contigs were removed.

Alignment and Chain Creation

For the minimap2-based pipeline, the initial chain file was generated using nf-LO v1.5.1 with minimap2 v2.24 alignments. These chains were then split at all locations that contained unaligned segments greater than 1kbp or gaps greater than 10kbp. Split chain files were then converted to PAF format with extended CIGAR strings using chaintools (v0.1), and alignments between nonhomologous chromosomes were removed. The trim-paf operation of rustybam (v0.1.29) was next used to remove overlapping alignments in the query sequence, and then the target sequence, to create 1:1 alignments. PAF alignments were converted back to the chain format with paf2chain commit f68eeca, and finally, chaintools was used to generate the inverted chain file.

Full commands with parameters used were:


    nextflow run main.nf --source GRCh38.fa --target chm13v2.0.fasta --outdir dir -profile local --aligner minimap2
    python chaintools/src/split.py -c input.chain -o input-split.chain
    python chaintools/src/to_paf.py -c input-split.chain -t target.fa -q query.fa -o input-split.paf
    awk '$1==$6' input-split.paf | rb break-paf --max-size 10000  | rb trim-paf -r | rb invert | rb trim-paf -r | rb invert > out.paf
    paf2chain -i out.paf > out.chain
    python chaintools/src/invert.py -c out.chain -o out_inverted.chain

The above process does not add chain ids or scores. The UCSC utilities chainMergeSort and chainScore are used to update the chains:


    chainMergeSort out.chain | chainScore stdin chm13v2.0.2bit hg38.2bit chm13v2.0-hg38.chain
    chainMergeSort out_inverted.chain | chainScore stdin hg38.2bit chm13v2.0.2bit hg38-chm13v2.0.chain

Rustybam trim-paf uses dynamic programming and the CIGAR string to find an optimal splitting point between overlapping alignments in the query sequence. It starts its trimming with the largest overlap and then recursively trims smaller overlaps.

Results were validated by using chaintools to confirm that there were no overlapping sequences with respect to both CHM13v2.0 and GRCh38 in the released chain file. In addition, trimmed alignments were visually inspected with SafFire to confirm their quality.

Chains were swapped to make GRCh38/hg38 the target.

Credits

The v1_nflo chains were generated by Nae-Chyun Chen<naechyun.chen@gmail.com> and Mitchell Vollger<mvollger@uw.edu>

References

Nurk S, Koren S, Rhie A, Rautiainen M, et al. The complete sequence of a human genome. bioRxiv, 2021.