The string decomposition problem and its applications to centromere analysis and assembly

Standard

The string decomposition problem and its applications to centromere analysis and assembly. / Дворкина, Татьяна Евгеньевна; Бзикадзе, Андрей Важевич; Певзнер, Павел Аркадьевич.

в: Bioinformatics, Том 36, № 1, 01.07.2020, стр. i93-i101.

Результаты исследований: Научные публикации в периодических изданиях › статья › Рецензирование

Author

Дворкина, Татьяна Евгеньевна ; Бзикадзе, Андрей Важевич ; Певзнер, Павел Аркадьевич. / The string decomposition problem and its applications to centromere analysis and assembly. в: Bioinformatics. 2020 ; Том 36, № 1. стр. i93-i101.

BibTeX

@article{0fde65b04ccc4d5fb2bcf320e37d59c3,

title = "The string decomposition problem and its applications to centromere analysis and assembly",

abstract = "MOTIVATION: Recent attempts to assemble extra-long tandem repeats (such as centromeres) faced the challenge of translating long error-prone reads from the nucleotide alphabet into the alphabet of repeat units. Human centromeres represent a particularly complex type of high-order repeats (HORs) formed by chromosome-specific monomers. Given a set of all human monomers, translating a read from a centromere into the monomer alphabet is modeled as the String Decomposition Problem. The accurate translation of reads into the monomer alphabet turns the notoriously difficult problem of assembling centromeres from reads (in the nucleotide alphabet) into a more tractable problem of assembling centromeres from translated reads. RESULTS: We describe a StringDecomposer (SD) algorithm for solving this problem, benchmark it on the set of long error-prone Oxford Nanopore reads generated by the Telomere-to-Telomere consortium and identify a novel (rare) monomer that extends the set of known X-chromosome specific monomers. Our identification of a novel monomer emphasizes the importance of identification of all (even rare) monomers for future centromere assembly efforts and evolutionary studies. To further analyze novel monomers, we applied SD to the set of recently generated long accurate Pacific Biosciences HiFi reads. This analysis revealed that the set of known human monomers and HORs remains incomplete. SD opens a possibility to generate a complete set of human monomers and HORs for using in the ongoing efforts to generate the complete assembly of the human genome. AVAILABILITY AND IMPLEMENTATION: StringDecomposer is publicly available on https://github.com/ablab/stringdecomposer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.",

keywords = "EVOLUTION, SATELLITE DNA, TANDEM REPEATS",

author = "Дворкина, {Татьяна Евгеньевна} and Бзикадзе, {Андрей Важевич} and Певзнер, {Павел Аркадьевич}",

year = "2020",

month = jul,

day = "1",

doi = "10.1093/bioinformatics/btaa454",

language = "English",

volume = "36",

pages = "i93--i101",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "1",

}

RIS

TY - JOUR

T1 - The string decomposition problem and its applications to centromere analysis and assembly

AU - Дворкина, Татьяна Евгеньевна

AU - Бзикадзе, Андрей Важевич

AU - Певзнер, Павел Аркадьевич

PY - 2020/7/1

Y1 - 2020/7/1

N2 - MOTIVATION: Recent attempts to assemble extra-long tandem repeats (such as centromeres) faced the challenge of translating long error-prone reads from the nucleotide alphabet into the alphabet of repeat units. Human centromeres represent a particularly complex type of high-order repeats (HORs) formed by chromosome-specific monomers. Given a set of all human monomers, translating a read from a centromere into the monomer alphabet is modeled as the String Decomposition Problem. The accurate translation of reads into the monomer alphabet turns the notoriously difficult problem of assembling centromeres from reads (in the nucleotide alphabet) into a more tractable problem of assembling centromeres from translated reads. RESULTS: We describe a StringDecomposer (SD) algorithm for solving this problem, benchmark it on the set of long error-prone Oxford Nanopore reads generated by the Telomere-to-Telomere consortium and identify a novel (rare) monomer that extends the set of known X-chromosome specific monomers. Our identification of a novel monomer emphasizes the importance of identification of all (even rare) monomers for future centromere assembly efforts and evolutionary studies. To further analyze novel monomers, we applied SD to the set of recently generated long accurate Pacific Biosciences HiFi reads. This analysis revealed that the set of known human monomers and HORs remains incomplete. SD opens a possibility to generate a complete set of human monomers and HORs for using in the ongoing efforts to generate the complete assembly of the human genome. AVAILABILITY AND IMPLEMENTATION: StringDecomposer is publicly available on https://github.com/ablab/stringdecomposer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

AB - MOTIVATION: Recent attempts to assemble extra-long tandem repeats (such as centromeres) faced the challenge of translating long error-prone reads from the nucleotide alphabet into the alphabet of repeat units. Human centromeres represent a particularly complex type of high-order repeats (HORs) formed by chromosome-specific monomers. Given a set of all human monomers, translating a read from a centromere into the monomer alphabet is modeled as the String Decomposition Problem. The accurate translation of reads into the monomer alphabet turns the notoriously difficult problem of assembling centromeres from reads (in the nucleotide alphabet) into a more tractable problem of assembling centromeres from translated reads. RESULTS: We describe a StringDecomposer (SD) algorithm for solving this problem, benchmark it on the set of long error-prone Oxford Nanopore reads generated by the Telomere-to-Telomere consortium and identify a novel (rare) monomer that extends the set of known X-chromosome specific monomers. Our identification of a novel monomer emphasizes the importance of identification of all (even rare) monomers for future centromere assembly efforts and evolutionary studies. To further analyze novel monomers, we applied SD to the set of recently generated long accurate Pacific Biosciences HiFi reads. This analysis revealed that the set of known human monomers and HORs remains incomplete. SD opens a possibility to generate a complete set of human monomers and HORs for using in the ongoing efforts to generate the complete assembly of the human genome. AVAILABILITY AND IMPLEMENTATION: StringDecomposer is publicly available on https://github.com/ablab/stringdecomposer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

KW - EVOLUTION

KW - SATELLITE DNA

KW - TANDEM REPEATS

UR - http://www.scopus.com/inward/record.url?scp=85087865288&partnerID=8YFLogxK

UR - https://www.mendeley.com/catalogue/b99c9ff0-c959-30d8-b0ad-deec86841502/

U2 - 10.1093/bioinformatics/btaa454

DO - 10.1093/bioinformatics/btaa454

M3 - Article

C2 - 32657390

VL - 36

SP - i93-i101

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 1

ER -

ID: 71332139