Research output: Contribution to journal › Conference article › peer-review
Challenges in DNA sequence analysis on a production grid. / Van Schaik, Barbera D.C.; Santcroos, Mark; Korkhov, Vladimir; Jongejan, Aldo; Willemsen, Marcel; Van Kampen, Antoine H.C.; Olabarriaga, S. lvia D.
In: Proceedings of Science, Vol. 2012-March, 2012, p. 1-18.Research output: Contribution to journal › Conference article › peer-review
}
TY - JOUR
T1 - Challenges in DNA sequence analysis on a production grid
AU - Van Schaik, Barbera D.C.
AU - Santcroos, Mark
AU - Korkhov, Vladimir
AU - Jongejan, Aldo
AU - Willemsen, Marcel
AU - Van Kampen, Antoine H.C.
AU - Olabarriaga, S. lvia D.
PY - 2012
Y1 - 2012
N2 - Modern DNA sequencing machines produce data in the range of 1-100 GB per experiment and with ongoing technological developments this amount is rapidly increasing. The majority of experiments involve re-sequencing of human genomes and exomes to find genomic regions that are associated with disease. There are many sequence analysis tools freely available, e.g. for sequence alignment, quality control and variant detection, and frequently new tools are developed to address new biological questions. Since 2008 we use workflow technology to allow easy incorporation of such software in our data analysis pipelines, as well as to leverage grid infrastructures for the analysis of large datasets in parallel. The size of the datasets has grown from 1 GB to 70 GB in 3 years, therefore adjustments were needed to optimize these workflows. Procedures have been implemented for faster data transfer to and from grid resources, and for fault recovery at run time. A split-and-merge procedure for a frequently used sequence alignment tool, BWA, resulted in a three-fold reduction of the total time needed to complete an experiment and increased efficiency by a reduction in number of failures. The success rate was increased from 10% to 70%. In addition, steps to resubmit workflows for partly failed workflows were automated, which saved user intervention. Here we present our current procedure of analyzing data from DNA sequencing experiments, comment on the experiences and focus on the improvements needed to scale up the analysis of genomics data at our hospital.
AB - Modern DNA sequencing machines produce data in the range of 1-100 GB per experiment and with ongoing technological developments this amount is rapidly increasing. The majority of experiments involve re-sequencing of human genomes and exomes to find genomic regions that are associated with disease. There are many sequence analysis tools freely available, e.g. for sequence alignment, quality control and variant detection, and frequently new tools are developed to address new biological questions. Since 2008 we use workflow technology to allow easy incorporation of such software in our data analysis pipelines, as well as to leverage grid infrastructures for the analysis of large datasets in parallel. The size of the datasets has grown from 1 GB to 70 GB in 3 years, therefore adjustments were needed to optimize these workflows. Procedures have been implemented for faster data transfer to and from grid resources, and for fault recovery at run time. A split-and-merge procedure for a frequently used sequence alignment tool, BWA, resulted in a three-fold reduction of the total time needed to complete an experiment and increased efficiency by a reduction in number of failures. The success rate was increased from 10% to 70%. In addition, steps to resubmit workflows for partly failed workflows were automated, which saved user intervention. Here we present our current procedure of analyzing data from DNA sequencing experiments, comment on the experiences and focus on the improvements needed to scale up the analysis of genomics data at our hospital.
UR - http://www.scopus.com/inward/record.url?scp=84878834394&partnerID=8YFLogxK
M3 - Conference article
VL - 2012-March
SP - 1
EP - 18
JO - Proceedings of Science
JF - Proceedings of Science
SN - 1824-8039
T2 - 2012 EGI Community Forum / EMI 2nd Technical Conference, EGICF-EMITC 2012
Y2 - 26 March 2012 through 30 March 2012
ER -
ID: 5385327