Supporting data for "TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads."

Dataset type: Genomic, Software, Bioinformatics
Data released on August 12, 2020

Xu M; Guo L; Gu S; Wang O; Zhang R; Peters BA; Fan G; Liu X; Xu X; Deng L; Zhang Y (2020): Supporting data for "TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads." GigaScience Database. http://dx.doi.org/10.5524/100773

DOI10.5524/100773

Analyses that use genome assemblies are critically affected by the contiguity, completeness, and accuracy of those assemblies. Recently, single molecule sequencing techniques generating long read information have become available and enabled substantial improvement in contig length and genome completeness, especially for large genomes (>100Mb), although bioinformatic tools for these applications are still limited.
We developed a software tool to close sequence gaps in genome assemblies, TGS-GapCloser, that uses low-depth (~10×) long single molecule reads. The algorithm extracts reads that bridge gap regions between two contigs within a scaffold, error corrects only the candidate reads, and assigns the best sequence data to each gap. As a demonstration, we used TGS-GapCloser to improve the scaftig NG50 value of three human genome assemblies by 24-fold on average with only ~10× coverage of Oxford Nanopore or Pacific Biosciences reads, covering with sequence data up to 94.8% gaps with 97.7% positive predictive value. These improved assemblies achieve 99.998% (Q46) single-base accuracy with final inserted sequences having 99.97% (Q35) accuracy, despite the high raw error rate of single molecule reads, enabling high quality downstream analyses, including up to a 31-fold increase in the scaftig NGA50 and up to 13.1% more complete BUSCO genes. Additionally, we show that even in ultra large genome assemblies, such as the ginkgo (~12Gb), TGS-GapCloser is able to cover 71.6% of gaps with sequence data.
TGS-GapCloser can close gaps in large genome assemblies using raw long reads in a fast and cost-effective way. The final assemblies generated by TGS-GapCloser have improved contiguity and completeness while maintaining high accuracy. The software is available at https://github.com/BGI-Qingdao/TGS-GapCloser

Additional details

Read the peer-reviewed publication(s):

(PubMed: 32893860)

Related datasets:

doi:10.5524/100773 Cites doi:10.5524/100613

Additional information:

https://github.com/nanopore-wgs-consortium/NA12878/blob/master/nanopore-human-genome/rel_3_4.md

Github links:

https://github.com/BGI-Qingdao/TGS-GapCloser

Accessions (data generated as part of this study):

CNSA_Project: CNP0000796
BioProject: PRJNA656117





Sample IDTaxonomic IDCommon NameGenbank NameScientific NameSample Attributes
Pacbio_Ginkgo biloba3311ginkgomaidenhair treeGinkgo biloba Description:DNA extracted from the female seed of ...
Analyte type:DNA
Plant body site:seed[PO:0009010]
...
+
Displaying 1-1 of 1 Sample(s).




File NameSample IDData TypeFile FormatSizeRelease Date 
Sequence assemblyGZIP15.66 MB2020-07-22
Sequence assemblyGZIP15.7 MB2020-07-22
Sequence assemblyGZIP16.45 MB2020-07-22
Sequence assemblyGZIP13.4 MB2020-07-22
Sequence assemblyGZIP16.17 MB2020-07-22
Sequence readsGZIP165.41 MB2020-07-22
Sequence readsGZIP16.14 MB2020-07-22
Sequence readsGZIP331.33 MB2020-07-22
Sequence readsGZIP82.69 MB2020-07-22
Sequence readsGZIP459.6 MB2020-07-22
Displaying 1-10 of 52 File(s).
Funding body Awardee Award ID Comments
Shenzhen Municipal Government of China Peacock Plan Y Zhang KQTD2015033017150531
National Key Research and Development Program of China G Fan 2018YFD0900301-05
Qingdao Applied Basic Research Projects M Xu 19-6-2-33-cg
Date Action
August 12, 2020 Dataset publish
August 27, 2020 Manuscript Link added : 10.1093/gigascience/giaa094
October 7, 2022 Manuscript Link updated : 10.1093/gigascience/giaa094