Software and supporting data for "Fast-SG: An alignment-free algorithm for hybrid assembly"

Dataset type: Software, Genomic
Data released on April 24, 2018

Di Genova A; Ruz GA; Sagot MF; Maass A (2018): Software and supporting data for "Fast-SG: An alignment-free algorithm for hybrid assembly" GigaScience Database. http://dx.doi.org/10.5524/100437

DOI10.5524/100437

Long read sequencing technologies are the ultimate solution for genome repeats, allowing near reference level reconstructions of large genomes. However, long read de novo assembly pipelines are computationally intense and require a considerable amount of coverage, thereby hindering their broad application to the assembly of large genomes. Alternatively, hybrid assembly methods which combine short and long read sequencing technologies can reduce the time and cost required to produce de novo assemblies of large genomes. Here, we propose a new method, called Fast-SG, which uses a new ultra-fast alignment-free algorithm specifically designed for constructing a scaffolding graph using light-weight data structures.Fast-SG can construct the graph from either short or long reads. This allows the reuse of efficient algorithms designed for short read data and permits the definition of novel modular hybrid assembly pipelines. Using comprehensive standard datasets and benchmarks, we show how Fast-SG outperforms the state-of-the-art short read aligners when building the scaffolding graph, and can be used to extract linking information from either raw or error-corrected long reads. We also show how a hybrid assembly approach using Fast-SG with shallow long read coverage (5X) and moderate computational resources can produce long-range and accurate reconstructions of the genomes of Arabidopsis thaliana (Ler-0) and human (NA12878).

Additional details

Read the peer-reviewed publication(s):

Di Genova, A., Ruz, G. A., Sagot, M.-F., & Maass, A. (2018). Fast-SG: an alignment-free algorithm for hybrid assembly. GigaScience, 7(5). doi:10.1093/gigascience/giy048

Additional information:

https://github.com/adigenova/fast-sg





Sample IDTaxonomic IDCommon NameGenbank NameScientific NameSample Attributes
Human Chromosome 149606HumanhumanHomo sapiens Description:Representative sample of Homo sapiens ...
Alternative accession-SRA File:ERR163027
Sequencing method:Illumina
...
+
NA128789606HumanhumanHomo sapiens Description:Representative sample of Homo sapiens sequences used for demonstration of sequence assembly tools
Sequencing method:ONT
Relevant electronic resources:https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md, ftp://ftp.broadinstitute.org/pub/crd/Discovar/assemblies/51400.newchem/a.lines.fasta
A. thaliana (Ler-0)3702mouse-ear cressthale cressArabidopsis thaliana Description:Representative sample of Arabidopsis thaliana sequences used for demonstration of sequence assembly tools
Sequencing method:Illumina, PacBio
Relevant electronic resources:http://labshare.cshl.edu/shares/schatzlab/www-data/ectools/arabidopsis/Illumina_2x300_R1.fastq.gz, http://labshare.cshl.edu/shares/schatzlab/www-data/ectools/arabidopsis/Illumina_2x300_R2.fastq.gz, https://downloads.pacbcloud.com/public/SequelData/ArabidopsisDemoData/SequenceData/1_A01_custome r/m54113_160913_184949.subreads.bam, http://gembox.cbcb.umd.edu/shared/canu/asm/canu/
S_aureus 1280  Staphylococcus aureus Description:Representative sample of Staphylococcu...
Alternative accession-SRA File:SRR022865
Sequencing method:Illumina
...
+
S. cerevisiae W303580240  Saccharomyces cerevisiae W303 Description:Representative sample of Saccharomyces cerevisiae W303 sequences used for demonstration of sequence assembly tools
Sequencing method:PacBio
Relevant electronic resources:http://labshare.cshl.edu/shares/schatzlab/www-data/ectools/w303/Pacbio.fasta.gz, http://labshare.cshl.edu/shares/schatzlab/www-data/ectools/w303/w303_illumina.fa.gz
E.coli K-1283333  Escherichia coli K-12 Description:Representative sample of Escherichia coli K-12 sequences used for demonstration of sequence assembly tools
Sequencing method:Illumina, PacBio, ONT
Relevant electronic resources:http://labshare.cshl.edu/shares/schatzlab/www-data/nanocorr/2015.07.07/Ecoli_S1_L001_R1_001.fastq.gz, https://s3.amazonaws.com/files.pacb.com/datasets/secondary-analysis/e-coli-k12-8plex/Ecoli_8plex_demo.barcoded.subreads.bam, https://s3.climb.ac.uk/nanopore/E_coli_K12_1D_R9.2_SpotON_2.pass.fasta , http://labshare.cshl.edu/shares/schatzlab/www-data/ectools/ecoli/ecoli_illumina.fa.gz
R_sphaeroides 1063  Rhodobacter sphaeroides Description:Representative sample of Rhodobacter s...
Alternative accession-SRA File:SRR034528
Sequencing method:Illumina
...
+
P_falciparum 36329  Plasmodium falciparum 3D7 Description:Representative sample of Plasmodium fa...
Alternative accession-SRA File:ERR034295, ERR16302...
Sequencing method:Illumina
...
+
Displaying 1-8 of 8 Sample(s).




File NameSample IDData TypeFile FormatSizeRelease Date 
A. thaliana (Ler-0)Mixed archiveTAR615.47 MB2018-04-24
E.coli K-12Mixed archiveTAR10.41 MB2018-04-24
SoftwareTAR13.95 MB2018-03-30
Human Chromosome 14Mixed archiveTAR6.29 GB2018-04-24
MD5sumTEXT0.51 KB2018-04-24
NA12878Mixed archiveTAR751.11 MB2018-04-24
P_falciparum Mixed archiveTAR1.57 GB2018-04-24
readme.txtTEXT2.7 KB2018-04-24
R_sphaeroides Mixed archiveTAR307.36 MB2018-04-24
S_aureus Mixed archiveTAR68.09 MB2018-04-24
Displaying 1-10 of 11 File(s).
Date Action
April 24, 2018 Dataset publish
April 24, 2018 Description updated from : Long read sequencing technologies are the ultimate solution for genome repeats, allowing near reference level reconstructions of large genomes. However, long read de novo assembly pipelines are computationally intense and require a considerable amount of coverage, thereby hindering their broad application to the assembly of large genomes. Alternatively, hybrid assembly methods which combine short and long read sequencing technologies can reduce the time and cost required to produce de novo assemblies of large genomes. Here, we propose a new method, called Fast-SG, which uses a new ultra-fast alignment-free algorithm specifically designed for constructing a scaffolding graph using light-weight data structures.Fast-SG can construct the graph from either short or long reads. This allows the reuse of efficient algorithms designed for short read data and permits the definition of novel modular hybrid assembly pipelines. Using comprehensive standard datasets and benchmarks, we show how Fast-SG outperforms the state-of-the-art short read aligners when building the scaffolding graph, and can be used to extract linking information from either raw or error-corrected long reads. We also show how a hybrid assembly approach using Fast-SG with shallow long read coverage (5X) and moderate computational resources can produce long-range and accurate reconstructions of the genomes of Arabidopsis thaliana (Ler-0) and human (NA12878),
July 9, 2018 Manuscript Link added : 10.1093/gigascience/giy048