Supporting data for "Transcriptome annotation in the cloud: complexity, best practices and cost."

Dataset type: Transcriptomic, Bioinformatics
Data released on December 21, 2020

Vera-Alvarez R; Mariño-Ramírez L; Landsman D (2020): Supporting data for "Transcriptome annotation in the cloud: complexity, best practices and cost." GigaScience Database. http://dx.doi.org/10.5524/100847

DOI10.5524/100847

The NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative provides NIH-funded researchers cost-effective access to industry-leading commercial cloud providers, such as Amazon Web Services (AWS; Seattle, WA, USA) and Google Cloud Platform (GCP; Mountain View, CA, USA). These cloud providers represent an alternative for the execution of large computational biology experiments like transcriptome annotation which is a complex analytical process that requires the integration of multiple biological databases and several advanced computational tools. The core components of annotation pipelines published since 2012 are BLAST sequence alignments using annotated databases of both nucleotide or protein sequences almost exclusively with networked on premises compute systems.
We present a comparative study of multiple BLAST sequence alignments using two public cloud providers: AWS and GCP. We have prepared several Jupyter Notebooks with all the code required to submit computing jobs to the batch system on each cloud provider. We consider the consequence of the number of query transcripts in input files and the effect on cost and processing time. We tested compute instances with 16, 32 and 64 vCPUs on each cloud provider. Four classes of timing results were collected: the total run time, the time for transferring the BLAST databases to the instance local solid state disk drive (SSD), the time to execute the Common Workflow Language (CWL) script and the time for the creation, setup and release of an instance. This study aims to establish an estimate of the cost and compute time needed for the execution of multiple BLAST runs in a cloud environment.
We demonstrate that the public cloud providers are a practical alternative for the execution of advanced computational biology experiments at low ~500,000 transcripts can be processed inv less than 2 hours with a compute cost of about 200-250 USD. In our opinion, for BLAST based workflows, the choice of cloud platform is not dependent on the workflow but, rather, on the specific details and requirements of the cloud provthe very large genetic sequence databases, such as nr, RefSeq and SRA, on both GCP and AWS). These choices include the accessibility for institutional use, the technical knowledge required for effective use of the platform services, and the availability of open-source frameworks such as application programming interfaces (APIs) to deploy the workflow.ider (e.g. NCBI maintains updated copies of the very large genetic sequence databases, such as nr, RefSeq and SRA, on both GCP and AWS). These choices include the accessibility for institutional use, the technical knowledge required for effective use of the platform services, and the availability of open-source frameworks such as application programming interfaces (APIs) to deploy the workflow.

Additional details

Read the peer-reviewed publication(s):

(PubMed: 33511996)

Github links:

https://github.com/ncbi/cloud-transcriptome-annotation

Accessions (data referenced by this study):

BioSample: SAMN04942738
BioProject: PRJNA320545





File NameSample IDData TypeFile FormatSizeRelease Date 
GitHub archivezip134.38 MB2020-12-13
readme.txtTEXT3.94 KB2020-12-21
Displaying 1-2 of 2 File(s).
Funding body Awardee Award ID Comments
National Institutes of Health D Landsman Intramural Research Program of the National Library of Medicine
Date Action
December 21, 2020 Dataset publish
December 29, 2020 Manuscript Link added : 10.1093/gigascience/giaa163
December 31, 2020 Updated "Accessions (data generated by this study) BioProject:PRJNA320545" to "Accessions (data referenced by this study) BioProject:PRJNA320545
December 31, 2020 Updated "Accessions (data generated by this study) BioSample:SAMN04942738" to "Accessions (data referenced by this study) BioSample:SAMN04942738"
November 29, 2021 Manuscript Link updated : 10.1093/gigascience/giaa163