Challenges in DNA sequence analysis on a production grid

Standard

Challenges in DNA sequence analysis on a production grid. / Van Schaik, Barbera D.C.; Santcroos, Mark; Korkhov, Vladimir; Jongejan, Aldo; Willemsen, Marcel; Van Kampen, Antoine H.C.; Olabarriaga, S. lvia D.

In: Proceedings of Science, Vol. 2012-March, 2012, p. 1-18.

Research output: Contribution to journal › Conference article › peer-review

Harvard

Van Schaik, BDC, Santcroos, M, Korkhov, V, Jongejan, A, Willemsen, M, Van Kampen, AHC & Olabarriaga, SLD 2012, 'Challenges in DNA sequence analysis on a production grid', Proceedings of Science, vol. 2012-March, pp. 1-18. <http://pos.sissa.it/archive/conferences/162/039/EGICF12-EMITC2_039.pdf>

APA

Van Schaik, B. D. C., Santcroos, M., Korkhov, V., Jongejan, A., Willemsen, M., Van Kampen, A. H. C., & Olabarriaga, S. L. D. (2012). Challenges in DNA sequence analysis on a production grid. Proceedings of Science, 2012-March, 1-18. http://pos.sissa.it/archive/conferences/162/039/EGICF12-EMITC2_039.pdf

Vancouver

Van Schaik BDC, Santcroos M, Korkhov V, Jongejan A, Willemsen M, Van Kampen AHC et al. Challenges in DNA sequence analysis on a production grid. Proceedings of Science. 2012;2012-March:1-18.

Author

Van Schaik, Barbera D.C. ; Santcroos, Mark ; Korkhov, Vladimir ; Jongejan, Aldo ; Willemsen, Marcel ; Van Kampen, Antoine H.C. ; Olabarriaga, S. lvia D. / Challenges in DNA sequence analysis on a production grid. In: Proceedings of Science. 2012 ; Vol. 2012-March. pp. 1-18.

BibTeX

@article{1b5afe6c8d65402c9790cd01452392ea,

title = "Challenges in DNA sequence analysis on a production grid",

abstract = "Modern DNA sequencing machines produce data in the range of 1-100 GB per experiment and with ongoing technological developments this amount is rapidly increasing. The majority of experiments involve re-sequencing of human genomes and exomes to find genomic regions that are associated with disease. There are many sequence analysis tools freely available, e.g. for sequence alignment, quality control and variant detection, and frequently new tools are developed to address new biological questions. Since 2008 we use workflow technology to allow easy incorporation of such software in our data analysis pipelines, as well as to leverage grid infrastructures for the analysis of large datasets in parallel. The size of the datasets has grown from 1 GB to 70 GB in 3 years, therefore adjustments were needed to optimize these workflows. Procedures have been implemented for faster data transfer to and from grid resources, and for fault recovery at run time. A split-and-merge procedure for a frequently used sequence alignment tool, BWA, resulted in a three-fold reduction of the total time needed to complete an experiment and increased efficiency by a reduction in number of failures. The success rate was increased from 10% to 70%. In addition, steps to resubmit workflows for partly failed workflows were automated, which saved user intervention. Here we present our current procedure of analyzing data from DNA sequencing experiments, comment on the experiences and focus on the improvements needed to scale up the analysis of genomics data at our hospital.",

author = "{Van Schaik}, {Barbera D.C.} and Mark Santcroos and Vladimir Korkhov and Aldo Jongejan and Marcel Willemsen and {Van Kampen}, {Antoine H.C.} and Olabarriaga, {S. lvia D.}",

year = "2012",

language = "English",

volume = "2012-March",

pages = "1--18",

journal = "Proceedings of Science",

issn = "1824-8039",

publisher = "Sissa Medialab Srl",

note = "2012 EGI Community Forum / EMI 2nd Technical Conference, EGICF-EMITC 2012 ; Conference date: 26-03-2012 Through 30-03-2012",

}

RIS

TY - JOUR

T1 - Challenges in DNA sequence analysis on a production grid

AU - Van Schaik, Barbera D.C.

AU - Santcroos, Mark

AU - Korkhov, Vladimir

AU - Jongejan, Aldo

AU - Willemsen, Marcel

AU - Van Kampen, Antoine H.C.

AU - Olabarriaga, S. lvia D.

PY - 2012

Y1 - 2012

N2 - Modern DNA sequencing machines produce data in the range of 1-100 GB per experiment and with ongoing technological developments this amount is rapidly increasing. The majority of experiments involve re-sequencing of human genomes and exomes to find genomic regions that are associated with disease. There are many sequence analysis tools freely available, e.g. for sequence alignment, quality control and variant detection, and frequently new tools are developed to address new biological questions. Since 2008 we use workflow technology to allow easy incorporation of such software in our data analysis pipelines, as well as to leverage grid infrastructures for the analysis of large datasets in parallel. The size of the datasets has grown from 1 GB to 70 GB in 3 years, therefore adjustments were needed to optimize these workflows. Procedures have been implemented for faster data transfer to and from grid resources, and for fault recovery at run time. A split-and-merge procedure for a frequently used sequence alignment tool, BWA, resulted in a three-fold reduction of the total time needed to complete an experiment and increased efficiency by a reduction in number of failures. The success rate was increased from 10% to 70%. In addition, steps to resubmit workflows for partly failed workflows were automated, which saved user intervention. Here we present our current procedure of analyzing data from DNA sequencing experiments, comment on the experiences and focus on the improvements needed to scale up the analysis of genomics data at our hospital.

AB - Modern DNA sequencing machines produce data in the range of 1-100 GB per experiment and with ongoing technological developments this amount is rapidly increasing. The majority of experiments involve re-sequencing of human genomes and exomes to find genomic regions that are associated with disease. There are many sequence analysis tools freely available, e.g. for sequence alignment, quality control and variant detection, and frequently new tools are developed to address new biological questions. Since 2008 we use workflow technology to allow easy incorporation of such software in our data analysis pipelines, as well as to leverage grid infrastructures for the analysis of large datasets in parallel. The size of the datasets has grown from 1 GB to 70 GB in 3 years, therefore adjustments were needed to optimize these workflows. Procedures have been implemented for faster data transfer to and from grid resources, and for fault recovery at run time. A split-and-merge procedure for a frequently used sequence alignment tool, BWA, resulted in a three-fold reduction of the total time needed to complete an experiment and increased efficiency by a reduction in number of failures. The success rate was increased from 10% to 70%. In addition, steps to resubmit workflows for partly failed workflows were automated, which saved user intervention. Here we present our current procedure of analyzing data from DNA sequencing experiments, comment on the experiences and focus on the improvements needed to scale up the analysis of genomics data at our hospital.

UR - http://www.scopus.com/inward/record.url?scp=84878834394&partnerID=8YFLogxK

M3 - Conference article

VL - 2012-March

SP - 1

EP - 18

JO - Proceedings of Science

JF - Proceedings of Science

SN - 1824-8039

T2 - 2012 EGI Community Forum / EMI 2nd Technical Conference, EGICF-EMITC 2012

Y2 - 26 March 2012 through 30 March 2012

ER -

ID: 5385327