Supporting data for "TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads."
Dataset type: Genomic, Software, Bioinformatics
Data released on August 12, 2020
Analyses that use genome assemblies are critically affected by the contiguity, completeness, and accuracy of those assemblies. Recently, single molecule sequencing techniques generating long read information have become available and enabled substantial improvement in contig length and genome completeness, especially for large genomes (>100Mb), although bioinformatic tools for these applications are still limited.
We developed a software tool to close sequence gaps in genome assemblies, TGS-GapCloser, that uses low-depth (~10×) long single molecule reads. The algorithm extracts reads that bridge gap regions between two contigs within a scaffold, error corrects only the candidate reads, and assigns the best sequence data to each gap. As a demonstration, we used TGS-GapCloser to improve the scaftig NG50 value of three human genome assemblies by 24-fold on average with only ~10× coverage of Oxford Nanopore or Pacific Biosciences reads, covering with sequence data up to 94.8% gaps with 97.7% positive predictive value. These improved assemblies achieve 99.998% (Q46) single-base accuracy with final inserted sequences having 99.97% (Q35) accuracy, despite the high raw error rate of single molecule reads, enabling high quality downstream analyses, including up to a 31-fold increase in the scaftig NGA50 and up to 13.1% more complete BUSCO genes. Additionally, we show that even in ultra large genome assemblies, such as the ginkgo (~12Gb), TGS-GapCloser is able to cover 71.6% of gaps with sequence data.
TGS-GapCloser can close gaps in large genome assemblies using raw long reads in a fast and cost-effective way. The final assemblies generated by TGS-GapCloser have improved contiguity and completeness while maintaining high accuracy. The software is available at https://github.com/BGI-Qingdao/TGS-GapCloser
Additional details
Read the peer-reviewed publication(s):
(PubMed: 32893860)
Related datasets:
doi:10.5524/100773 Cites doi:10.5524/100613
Additional information:
https://github.com/nanopore-wgs-consortium/NA12878/blob/master/nanopore-human-genome/rel_3_4.md
Github links:
https://github.com/BGI-Qingdao/TGS-GapCloser
Accessions (data generated as part of this study):
CNSA_Project:
CNP0000796
BioProject:
PRJNA656117
Sample ID | Taxonomic ID | Common Name | Genbank Name | Scientific Name | Sample Attributes |
---|---|---|---|---|---|
Pacbio_Ginkgo biloba | 3311 | ginkgo | maidenhair tree | Ginkgo biloba | Description:DNA extracted from the female seed of ... Analyte type:DNA Plant body site:seed[PO:0009010] ... + |