Supporting data for "Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes"

Dataset type: Genomic, Transcriptomic, Software
Data released on December 03, 2018

Johnson LK; Alexander H; Brown CT (2018): Supporting data for "Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes" GigaScience Database.


De novo transcriptome assemblies are required prior to analyzing RNAseq data from a species without an existing reference genome or transcriptome. Despite the prevalence of transcriptomic studies, the effects of using different workflows, or 'pipelines', on the resulting assemblies are poorly understood. Here, a pipeline was programmatically automated and used to assemble and annotate raw transcriptomic short read data collected by the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP). Transcriptome assemblies generated through this pipeline were evaluated and compared against assemblies that were previously generated with a pipeline developed by the National Center for Genome Research (NCGR). New transcriptome assemblies contained 70% of the previous contigs as well as new content. On average, 7.8% of the annotated contigs in the new assemblies were novel gene names not found in the previous assemblies. Taxonomic trends were observed in the assembly metrics, with assemblies from the Dinoflagellata and Ciliophora phyla showing a higher percentage of open reading frames and number of contigs than transcriptomes from other phyla. Given current bioinformatics approaches, there is no single 'best' reference transcriptome for a particular set of raw data. As the optimum transcriptome is a moving target, improving (or not) with new tools and approaches, automated and programmable pipelines are invaluable for managing the computationally-intensive tasks required for re-processing large sets of samples with revised pipelines. Moreover, automated and programmable pipelines facilitate the comparison of diverse sets of data by ensuring a common evaluation workflow was applied to all samples. Thus, re-assembling existing data with new tools using automated and programmable pipelines may yield more accurate identification of taxon-specific trends across samples in addition to novel and useful products for the community.

Additional details

Read the peer-reviewed publication(s):

Johnson, L. K., Alexander, H., & Brown, C. T. (2018). Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes. GigaScience. doi:10.1093/gigascience/giy158

Additional information:

Accessions (data generated as part of this study):

BioProject: PRJNA231566

File NameSample IDData TypeFile FormatSizeRelease Date 
GitHub archivearchive105.52 KB2018-10-29
readmeTEXT2.64 KB2018-10-29
Displaying 1-2 of 2 File(s).
Funding body Awardee Award ID Comments
Gordon and Betty Moore Foundation CT Brown GBMF4551
Date Action
December 3, 2018 Dataset publish