Результаты исследований: Научные публикации в периодических изданиях › статья › Рецензирование
Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage. / Barbitoff, Yury A.; Polev, Dmitrii E.; Glotov, Andrey S.; Serebryakova, Elena A.; Shcherbakova, Irina V.; Kiselev, Artem M.; Kostareva, Anna A.; Glotov, Oleg S.; Predeus, Alexander V.
в: Scientific Reports, Том 10, № 1, 2057, 06.02.2020.Результаты исследований: Научные публикации в периодических изданиях › статья › Рецензирование
}
TY - JOUR
T1 - Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage
AU - Barbitoff, Yury A.
AU - Polev, Dmitrii E.
AU - Glotov, Andrey S.
AU - Serebryakova, Elena A.
AU - Shcherbakova, Irina V.
AU - Kiselev, Artem M.
AU - Kostareva, Anna A.
AU - Glotov, Oleg S.
AU - Predeus, Alexander V.
N1 - Funding Information: We thank Anna Shuvalova and Olga Romanova for help in library preparation. This research was done using equipment of Biobank of the Research Park of SPBU. The research was supported by Russian Science Foundation (grants no. 14–50–00069, 18-75-00006,), CAF Charity Foundation, and D.O. Ott Research Institute of Obstetrics, Gynaecology and Reproductology, project 558-2019-0012 (АААА-А19119021290033-1) of FSBSI. We also thank Resource Center “Computational Center” of Saint Petersburg State University (project no. 110-7198-609) for providing computing resources and data storage. Publisher Copyright: © 2020, The Author(s). Copyright: Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020/2/6
Y1 - 2020/2/6
N2 - Advantages and diagnostic effectiveness of the two most widely used resequencing approaches, whole exome (WES) and whole genome (WGS) sequencing, are often debated. WES dominated large-scale resequencing projects because of lower cost and easier data storage and processing. Rapid development of 3rd generation sequencing methods and novel exome sequencing kits predicate the need for a robust statistical framework allowing informative and easy performance comparison of the emerging methods. In our study we developed a set of statistical tools to systematically assess coverage of coding regions provided by several modern WES platforms, as well as PCR-free WGS. We identified a substantial problem in most previously published comparisons which did not account for mappability limitations of short reads. Using regression analysis and simple machine learning, as well as several novel metrics of coverage evenness, we analyzed the contribution from the major determinants of CDS coverage. Contrary to a common view, most of the observed bias in modern WES stems from mappability limitations of short reads and exome probe design rather than sequence composition. We also identified the ~ 500 kb region of human exome that could not be effectively characterized using short read technology and should receive special attention during variant analysis. Using our novel metrics of sequencing coverage, we identified main determinants of WES and WGS performance. Overall, our study points out avenues for improvement of enrichment-based methods and development of novel approaches that would maximize variant discovery at optimal cost.
AB - Advantages and diagnostic effectiveness of the two most widely used resequencing approaches, whole exome (WES) and whole genome (WGS) sequencing, are often debated. WES dominated large-scale resequencing projects because of lower cost and easier data storage and processing. Rapid development of 3rd generation sequencing methods and novel exome sequencing kits predicate the need for a robust statistical framework allowing informative and easy performance comparison of the emerging methods. In our study we developed a set of statistical tools to systematically assess coverage of coding regions provided by several modern WES platforms, as well as PCR-free WGS. We identified a substantial problem in most previously published comparisons which did not account for mappability limitations of short reads. Using regression analysis and simple machine learning, as well as several novel metrics of coverage evenness, we analyzed the contribution from the major determinants of CDS coverage. Contrary to a common view, most of the observed bias in modern WES stems from mappability limitations of short reads and exome probe design rather than sequence composition. We also identified the ~ 500 kb region of human exome that could not be effectively characterized using short read technology and should receive special attention during variant analysis. Using our novel metrics of sequencing coverage, we identified main determinants of WES and WGS performance. Overall, our study points out avenues for improvement of enrichment-based methods and development of novel approaches that would maximize variant discovery at optimal cost.
KW - Base Sequence/genetics
KW - Data Interpretation, Statistical
KW - Exome/genetics
KW - Genome, Human/genetics
KW - High-Throughput Nucleotide Sequencing/statistics & numerical data
KW - Humans
KW - Machine Learning
KW - Models, Genetic
KW - Open Reading Frames/genetics
KW - Regression Analysis
KW - Whole Exome Sequencing/statistics & numerical data
KW - Whole Genome Sequencing/statistics & numerical data
KW - PERFORMANCE
KW - CAPTURE
UR - http://www.scopus.com/inward/record.url?scp=85079051766&partnerID=8YFLogxK
U2 - 10.1038/s41598-020-59026-y
DO - 10.1038/s41598-020-59026-y
M3 - Article
C2 - 32029882
AN - SCOPUS:85079051766
VL - 10
JO - Scientific Reports
JF - Scientific Reports
SN - 2045-2322
IS - 1
M1 - 2057
ER -
ID: 70416819