Data released on October 20, 2015

Software and exemplar data for Rcorrector.

Next generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, due to the variation in gene expression levels and alternative splicing.
We developed a k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads. Rcorrector uses a De Bruijn graph to compactly represent all trusted k-mers in the input reads. Unlike WGS read correctors, which employ a global threshold to determine trusted k-mers, Rcorrector computes a local threshold at every position in a read.
The software as published is available directly from here, but for the most up to date version please see the project GitHub repository.

Next generation sequencing RNA-seq Error correction k-mers Rcorrector 

  • Funding body - National Science Foundation
  • Award ID - ABI-1159078
  • Comment - Liliana Florea
  • Funding body - National Science Foundation
  • Award ID - ABI-1356078
  • Comment - Liliana Florea

Sample IDTaxonomic IDCommon NameGenbank NameScientific NameSample Attributes
Geuvadis9606HumanhumanHomo sapiens Description:a lymphoblastoid cell line sequenced as part of the GEUVADIS population variation project, used in Rcorrector assessment
Alternative names:NA20508
Alternative accession-SRA_file:ERR188021
Lung9606HumanhumanHomo sapiens Description:a lung cancer cell line (HCC827/R2) used in Rcorrector assessment
Alternative names:HCC827/R2
Alternative accession-SRA_file:SRR1062943
Peach3760 peachPrunus persica Description:Plant RNA-seq data used in Rcorrector assessment
Alternative accession-SRA_file:SRR531865
Simulated9606HumanhumanHomo sapiens Description:100 million x 100 bp long paired-end reads were generated with FluxSimulator starting from the human GENCODE v.17 gene annotations. Errors were subsequently introduced with Mason.
Single-cell511145E.coli Description:E. coli K-12, strain MG1655, single-cell sequencing based on MDA (multiple displacement amplication) method; contains 29,124,078 100 bp reads
Relevant electronic resources:
File NameSample IDFile TypeFile FormatSizeRelease Date 
Single-cellTextTEXT-0 KB2015-09-04
Softwarezip978.12 KB2015-09-04
ReadmeTEXT-0 KB2015-09-04
SimulatedTranscriptome sequenceFASTQ-0 KB2015-09-04
SimulatedTranscriptome sequenceFASTQ-0 KB2015-09-04
