Supporting data for "Linking big biomedical datasets to modular analysis with Portable Encapsulated Projects"

Dataset type: Metadata, Software
Data released on October 27, 2021

Sheffield NC; Stolarczyk M; Reuter VP; Rendeiro AF (2021): Supporting data for "Linking big biomedical datasets to modular analysis with Portable Encapsulated Projects" GigaScience Database.


Organizing and annotating biological sample data is critical in data-intensive bioinformatics. Unfortunately, metadata formats from a data provider are often incompatible with requirements of a processing tool. There is no broadly accepted standard to organize metadata across biological projects and bioinformatics tools, restricting the portability and reusability of both annotated datasets and analysis software.
To address this, we present Portable Encapsulated Projects (PEP), a formal specification for biological sample metadata structure. The PEP specification accommodates typical features of data-intensive bioinformatics projects with many biological samples. In addition to standardization, the PEP specification provides descriptors and modifiers for project-level and sample-level metadata, which improve portability across both computing environments and data processing tools. PEP includes a schema validator framework, allowing formal definition of required metadata attributes for data analysis broadly. We have implemented packages for reading PEPs in both Python and R to provide a language-agnostic interface for organizing project metadata. PEP therefore presents an important step toward unifying data annotation and processing tools in data-intensive biological research projects. The formal PEP specification and links to associated tools and documentation are available at

File NameSample IDData TypeFile FormatSizeRelease Date 
GitHub archivezip101.43 KB2021-10-02
GitHub archivezip738.75 KB2021-10-02
GitHub archivezip120.04 KB2021-10-02
GitHub archivezip384.23 KB2021-10-02
ReadmeTEXT3.27 KB2021-10-02
Displaying 1-5 of 5 File(s).
Funding body Awardee Award ID Comments
National Institutes of Health NC Sheffield R35GM128636 Institute for General Medical Sciences (NIGMS)
Date Action
October 27, 2021 Dataset publish
November 15, 2021 Manuscript Link added : 10.1093/gigascience/giab077
March 28, 2022 Manuscript Link updated : 10.1093/gigascience/giab077