Challenges in DNA sequence analysis on a production grid

Links

http://pos.sissa.it/archive/conferences/162/039/EGICF12-EMITC2_039.pdf

Barbera D.C. Van Schaik
Mark Santcroos
Vladimir Korkhov
Aldo Jongejan
Marcel Willemsen
Antoine H.C. Van Kampen
S. lvia D. Olabarriaga

Modern DNA sequencing machines produce data in the range of 1-100 GB per experiment and with ongoing technological developments this amount is rapidly increasing. The majority of experiments involve re-sequencing of human genomes and exomes to find genomic regions that are associated with disease. There are many sequence analysis tools freely available, e.g. for sequence alignment, quality control and variant detection, and frequently new tools are developed to address new biological questions. Since 2008 we use workflow technology to allow easy incorporation of such software in our data analysis pipelines, as well as to leverage grid infrastructures for the analysis of large datasets in parallel. The size of the datasets has grown from 1 GB to 70 GB in 3 years, therefore adjustments were needed to optimize these workflows. Procedures have been implemented for faster data transfer to and from grid resources, and for fault recovery at run time. A split-and-merge procedure for a frequently used sequence alignment tool, BWA, resulted in a three-fold reduction of the total time needed to complete an experiment and increased efficiency by a reduction in number of failures. The success rate was increased from 10% to 70%. In addition, steps to resubmit workflows for partly failed workflows were automated, which saved user intervention. Here we present our current procedure of analyzing data from DNA sequencing experiments, comment on the experiences and focus on the improvements needed to scale up the analysis of genomics data at our hospital.

Original language	English
Pages (from-to)	1-18
Journal	Proceedings of Science
Volume	2012-March
State	Published - 2012
Event	2012 EGI Community Forum / EMI 2nd Technical Conference, EGICF-EMITC 2012 - Munich, Germany Duration: 26 Mar 2012 → 30 Mar 2012

Scopus subject areas

General

ID: 5385327