Supporting data for "Libra: scalable k-mer based tool for massive all-vs-all metagenome comparisons."

Dataset type: Metagenomic, Software
Data released on December 17, 2018

Choi I; Ponsero AJ; Bomhoff M; Youens-Clark K; Hartman JH; Hurwitz BL (2018): Supporting data for "Libra: scalable k-mer based tool for massive all-vs-all metagenome comparisons." GigaScience Database.


Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity, and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content.
We developed a tool called Libra that performs all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe ( that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community.
A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets—such as data reduction, read count normalization, and presence/absence distance metrics—greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.

Additional details

Read the peer-reviewed publication(s):

(PubMed: 30597002)

Additional information:

Accessions (data generated as part of this study):

DOI: 10.7946/MQ0G
BioProject: PRJNA397434
SRA: SRP115095

File NameSample IDData TypeFile FormatSizeRelease Date 
TextTEXT6.99 KB2018-12-17
GitHub archivearchive5.22 MB2018-12-11
GitHub archivearchive13.06 KB2018-12-11
ReadmeTEXT2.34 KB2018-12-11
Displaying 1-4 of 4 File(s).
Funding body Awardee Award ID Comments
National Science Foundation BL Hurwitz 1640775 Directorate for Computer and Information Science and Engineering
Date Action
December 17, 2018 Dataset publish
January 7, 2019 Manuscript Link added : 10.1093/gigascience/giy165
November 11, 2022 Manuscript Link updated : 10.1093/gigascience/giy165