Supporting data for "Genomic diversity affects the accuracy of bacterial SNP calling pipelines"

Dataset type: Genomic, Software, Workflow
Data released on January 13, 2020

Bush SJ; Foster D; Eyre DW; Clark EL; De Maio N; Shaw LP; Stoesser N; Peto TEA; Crook DW; Walker AS (2020): Supporting data for "Genomic diversity affects the accuracy of bacterial SNP calling pipelines" GigaScience Database.


Accurately identifying SNPs from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained.
This study evaluates the performance of 209 SNP calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally-sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia and Klebsiella.
We evaluated the performance of 209 SNP calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic, bacteria such as Escherichia coli, but less dominant for clonal species such as Mycobacterium tuberculosis.
The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often employed the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup or Strelka.

Additional details

Read the peer-reviewed publication(s):

(PubMed: 32025702)

Additional information:

File NameSample IDData TypeFile FormatSizeRelease Date 
Mixed archiveTAR999.23 MB2019-12-26
TextUNKNOWN66.05 KB2019-12-26
TextUNKNOWN251.1 KB2019-12-26
TextUNKNOWN131.78 KB2019-12-26
TextUNKNOWN702.9 KB2019-12-26
TextUNKNOWN266.63 KB2019-12-26
Tabular dataEXCEL101.75 KB2019-12-26
Genome sequenceTAR1.08 GB2019-12-26
Genome sequenceTAR1.07 GB2019-12-26
Genome sequenceTAR82.69 MB2019-12-26
Displaying 1-10 of 13 File(s).
Funding body Awardee Award ID Comments
National Institute for Health Research HPRU-2012-10041 Health Protection Research Unit
The Antimicrobial Resistance Funders’ Forum (AMRFF) LP Shaw NE/N019989/1 Antimicrobial Resistance Cross Council Initiative
Biotechnology and Biological Sciences Research Council BB/P013740/1
Date Action
January 13, 2020 Dataset publish
January 24, 2020 Manuscript Link added : 10.1093/gigascience/giaa007
February 7, 2020 Funder updated : Biotechnology and Biological Sciences Research Council
October 7, 2022 Manuscript Link updated : 10.1093/gigascience/giaa007