Supporting data for "Binning unassembled short reads based on k-mer covariance using sparse coding"

Dataset type: Metagenomic, Software, Bioinformatics
Data released on March 09, 2020

Kyrgyzov O; Prost V; Gazut S; Farcy B; Brüls T (2020): Supporting data for "Binning unassembled short reads based on k-mer covariance using sparse coding" GigaScience Database.


Sequence binning techniques enable the recovery of a growing number of genomes from complex microbial metagenomes and typically require prior metagenome assembly, incurring the computational cost and drawbacks of the latter, e.g. biases against low-abundance genomes and inability to conveniently assemble multi-terabyte datasets.
We present here a scalable pre-assembly binning scheme (i.e. operating on unassembled short reads) enabling latent genomes recovery by leveraging sparse dictionary learning and elastic-net regularization, and its use to recover hundreds of metagenome-assembled genomes, including very low-abundance genomes, from a joint analysis of microbiomes from the LifeLines-Deep population cohort (n=1135, > 10^10 reads).
We showed that sparse coding techniques can be leveraged to carry out read-level binning at large scale, and that despite lower genome reconstruction yields compared to assembly-based approaches, bin-first strategies can complement the more widely used assembly-first protocols by targeting distinct genome segregation profiles. Read enrichment levels across six orders of magnitude in relative abundance were observed, indicating that the method is able to recover genomes consistently segregating at low levels.

Additional details

Read the peer-reviewed publication(s):

(PubMed: 32219339)

Additional information:

Github links:

File NameSample IDData TypeFile FormatSizeRelease Date 
TextTEXT0.64 KB2020-03-12
AnnotationTAR8.69 MB2020-03-05
AnnotationTAR4.53 MB2020-03-05
GitHub archivearchive17.01 MB2020-03-05
GitHub archiveTAR101.12 KB2020-03-05
AnnotationTAR77.08 MB2020-03-05
ReadmeTEXT3.58 KB2020-03-09
TextTEXT0.69 KB2020-03-05
AnnotationTAR16.53 MB2020-03-05
AnnotationTAR80.94 GB2020-03-12
Displaying 1-10 of 11 File(s).
Funding body Awardee Award ID Comments
Investissements d'Avenir T Brüls FSN-CISN2 ADAMme
Date Action
March 9, 2020 Dataset publish
March 17, 2020 Manuscript Link added : 10.1093/gigascience/giaa028
October 7, 2022 Manuscript Link updated : 10.1093/gigascience/giaa028