Supporting data for "Optimized Distributed Systems Achieve Significant Performance Improvement on Sorted Merging of Massive VCF Files"

Dataset type: Workflow, Software, Genome-Mapping
Data released on April 03, 2018

Sun X; Gao J; Jin P; Eng C; Burchard EG; Beaty TH; Ruczinski I; Mathias RA; Barnes KC; Wang F; Qin ZS; CAAPA Consortium (2018): Supporting data for "Optimized Distributed Systems Achieve Significant Performance Improvement on Sorted Merging of Massive VCF Files" GigaScience Database.


Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by genomic locations. In particular, merging a large number of Variant Call Format (VCF) files are frequently encountered in large scale whole genome sequencing or whole exome sequencing projects. Traditional single machine based methods become increasingly inefficient when processing hundreds or even thousands of VCF files due to the excessive computation time and I/O bottleneck. The distributed systems and the more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized working flow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance.
In this study, we custom design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase and Spark, to perform sorted merging of large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks which are conquered in an ordered, parallel and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI) based high performance computing (HPC) implementation and the popular VCFTools.
Our experiments suggest that all three schemas either deliver a significant improvement in efficiency or render much better strong/weak scalabilities over traditional methods. We believe that our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems. Please note that these files are also available from the AWS S3 at

Additional details

Read the peer-reviewed publication(s):

Sun, X., Gao, J., Jin, P., Eng, C., Burchard, E. G., … Beaty, T. H. (2018). Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files. GigaScience, 7(6). doi:10.1093/gigascience/giy052

Additional information:

File NameSample IDData TypeFile FormatSizeRelease Date 
mixed archivearchive35.36 MB2018-03-02
TextTAR3.37 GB2018-03-02
ReadmeTEXT2.75 KB2018-03-02
TextTAR2.81 GB2018-03-02
Displaying 1-4 of 4 File(s).
Funding body Awardee Award ID Comments
National Institutes of Health Kathleen Barnes R01 HL104608
National Institutes of Health Peng Jin R01 NS051630
National Institutes of Health Peng Jin P01 NS097206
National Institutes of Health Peng Jin U54 NS091859
National Science Foundation Fusheng Wang ACI 1443054
National Science Foundation Fusheng Wang IIS 1350885
Date Action
May 2, 2018 Dataset publish
July 4, 2018 Manuscript Link added : 10.1093/gigascience/giy052