Help Login Create account

Data released on March 20, 2018

Supporting data for "Field of Genes: Using Apache Kafka as a Bioinformatic Data Repository"

Lawlor, B; Lynch, R; Mac Aogain, M; Walsh, P (2018): Supporting data for "Field of Genes: Using Apache Kafka as a Bioinformatic Data Repository" GigaScience Database. http://dx.doi.org/10.5524/100430 RIS BibTeX Text

Bioinformatic research is increasingly dependent on large-scale data sets, accessed either from private or public repositories. An example of a public repository is NCBI's RefSeq. These repositories must decide in what form to make their data available. Unstructured data can be put to almost any use, but are limited in how access to them can be scaled. Highly structured data offer improved performance for specific algorithms but limit the wider usefulness of the data. We present an alternative: lightly structured data stored in Apache Kafka in a way that is amenable to parallel access and streamed processing, including subsequent transformations into more highly structured representations. We contend that this approach could provide a flexible and powerful nexus of bioinformatic data, bridging the gap between low-structure on one hand, and high-performance and scale on the other. To demonstrate this, we present a proof of concept version of NCBI's RefSeq database using this technology. We measure the performance and scalability characteristics of this alternative with respect to flat files.
The proof of concept scales almost linearly as more compute nodes are added, outperforming the standard approach using files.
Apache Kafka merits consideration as a fast and more scalable but general-purpose way to store and retrieve bioinformatic data, for public, centralized reference datasets like RefSeq, and private clinical and experimental data.

Contact Submitter

Additional information:

https://github.com/blawlor/field-of-genes

Keywords:

apache kafka scalability parallelization big data 

Software

http://gigadb.org/images/data/cropped/100430.jpg

Funding:

  • Funding body - European Commission
  • Award ID - 324365
  • Comment - Seventh framework programme
  • Awardee - P Walsh
  • Funding body - Horizon 2020 Framework Programme
  • Award ID - 644186
  • Comment - Marie Skaodowska-Curie
  • Funding body - Irish Research Council (IE)
  • Award ID - EPSPD/2015/32
  • Awardee - Michael Mac Aogain
  • Funding body - Science Foundation Ireland
  • Award ID - 16/IFA/4342
  • Awardee - Paul Walsh

Files: (FTP site) Table Settings

Columns:

File Description
Sample ID
Data Type
File Format
Size
Release Date
Download Link
File Attributes

File NameSample IDData TypeFile FormatSizeRelease Date 
mixed archivearchive92.3 KB2018-03-19
ReadmeTEXT1.86 KB2018-03-19
Displaying 1-2 of 2 File(s).

History:

+

Other datasets you might like: