Help Login Create account

Back to GigaDB

Introduction

GigaDB search

Submission guidelines

Controlled vocabulary

Application Programming Interface


Introduction

The GigaDB website allows any user to browse, search, view datasets and access data files. If you want to submit a dataset, save searches or be alerted of new content of interest we request that you create an account.

A 'Latest news' section will be visible to announce any updates or new features to the database and the RSS feed automatically announces each new dataset release.

The GigaDB homepage allows you to browse datasets by type eg Genomic, Metagenomic, Transcriptomic. Clicking on the DOI (digital object identifier) or image will take you directly to the webpage for the dataset of interest.

Alternatively you can use the search functions to find datasets, samples or files of interest.


GigaDB search

Search Operation

To search across all Dataset, Sample and File records in GigaDB, simply enter a search term in the search bar found at the top of all GigaDB pages.

The search is case insensitive which means both uppercase and lowercase keywords will have the same result.

Search results

The search results are grouped by GigaDB Datasets (G), Samples (S) and Files (F).

For each dataset result, author names and DOI are displayed. Hovering over dataset name provides the description of dataset. Dataset and sample names are linked to the specific DOI page for those data, as well as file links are provided to download.

For each sample result, the sample name, species name and species ID are displayed with links to the NCBI taxonomy page for the species and to the GigaDB dataset page.

For each file result, the file name, file type and file size are displayed with a direct link to the FTP server location of that file.

Only those objects that have direct matches are displayed in the search results, i.e. the only Files to be displayed in the search results will be those with matches to the search term, all other files within the same dataset will NOT be displayed.

For example, searching for the term “Potato” will return the dataset with the title “Genomic data from the potato” which contains 17 files, however, the search results table will only display 3 of those 17 files because only 3 contain the search term “potato”. To find all data associated with a dataset you must follow the link to the dataset page.


Filtering results

On the left of the search results you have the option to further refine the results by using the filters. By default all filters are disabled, allowing you to see all search results for your keyword. If you want to hide some results based on some criteria, choose the filter for your criteria, and select the options that match what you want to see.

TFilter options for Datasets:

Filter options for Samples:

Filter options for Files:

Click the 'Apply Filters' button to see your refined results table.


Submission guidelines

GigaDB is an open-access database. As such, all data submitted to GigaDB must be fully consented for public release (for more information about our data policies, please see our Terms of use page).

All sequence, assembly, variation, and microarray data must be deposited in a public database at NCBI, EBI, or DDBJ before you submit them to GigaDB. In the cases where you would like GigaDB to host files associated with genomic data not fully consented for public release, you must first submit the non-public data to dbGaP or EGA.

Step 1 - Create an account or log in to GigaDB

Step 2 - Download and complete the Excel template file. Completed example files for the E. coli (10.5524/100001) and Sorghum (10.5524/100012) datasets are available.

The template file contains:

Mandatory fields are highlighted in yellow.

Study

Required information includes submitter name, email and affiliation, upload status [can we publish this dataset immediately after review (Publish) or should it be held until publication (HUP)], author list, dataset type(s) (selected from a controlled vocabulary list), dataset title and description, estimated total size of the files that will be submitted and dataset image information.

Optional information includes links to additional resources and related manuscripts, accessions for data in other databases (prefixes are found in the Links tab), and relationship (if any) to a previously published GigaDB dataset (selected from a controlled vocabulary list).

Samples

Required information includes a sample ID or name (please use an NCBI BioSample ID when possible), species NCBI taxonomy ID, and species common name.

Optional information includes sample attributes (these are automatically populated in GigaDB if an NCBI BioSample ID is provided).

Files

Required information includes a file name or path relative to your home directory and file type (selected from a controlled vocabulary list). A readme file must be provided.

Optional information includes a file description and a sample ID or name.

Step 3 - confirm you have read our Terms of use page and upload the completed Excel template file.

You can expect a response from the GigaDB team within 5 days to verify the information in your submission and to arrange upload of your files to our FTP site.

If you have any questions, please contact us at database@gigasciencejournal.com.


Controlled vocabulary

Dataset types

Genomic - includes all genetic and genomic data eg sequence, assemblies, alignments, genotypes, variation and annotation.
Minimal requirements: DNA sequence data eg next-gen raw reads (fastq files) OR assembled DNA sequences (fasta files)

Epigenomic - includes methylation and histone modification data.
Minimal requirements: Details on methylation sites/status eg qmap files OR details on histone modification sites/status.

Metagenomic - includes all genetic and genomic data eg sequence, assemblies, alignments, genotypes, variation and annotation from environmental samples.
Minimal requirements: Environmental DNA sequence data eg next-gen raw reads (fastq files) OR assembled DNA sequences (fasta files).

Proteomic - includes all mass spec data.
Minimal requirements: Peptide/protein data eg mass spec.

Transcriptomic - includes all data relating to mRNA.
Minimal requirements: RNA sequence data eg next-gen raw reads (fastq files) OR transcript statistics eg RNA coverage/depth.

Additional dataset types can be added, upon review, as new submissions are received.


File types

File types and examples of associated file extensions:

Alignments: .bam, .chain, .maf, .net, .sam

Allele frequencies: .frq

Annotation: .gff, .ipr, .kegg, .wego

Coding sequence: .cds, .fa

InDels: .gff, .txt, .vcf

ISA-Tab: see ISA tools

Genome assembly: .agp, .contig, .depth, .fa, .length, .scafseq

Genome sequence: .fastq, .fq

Haplotypes: .haplotype

Methylome data: .fa, .qmap, .rpm, .txt

Protein sequence: .fa, .pep

Readme: .pdf, .txt

SNPs: .annotation, .gff, .txt, .vcf

SVs: .gff, .txt, .vcf

Transcriptome data: .depth, .rpkm, .wig

Other: .xls, .pdf, .txt

Additional file types can be added, upon review, as new submissions are received.


File formats

AGP (.agp) - the Accessioned Golden Path (AGP) file describes the assembly of a larger sequence object from smaller objects:

chr1 1 1972671 0 W scaffold43 1 1972671 m
chr1 1972672 3061819 1 W scaffold8 1 1089148 p
chr1 3061820 3181505 2 W scaffold548 1 119686 m
chr1 3181506 4176151 3 W scaffold313 1 994646 m

The large object can be a contig, a scaffold (supercontig), or a chromosome.
See AGP Specification v2.0

BAM (.bam) - the Binary Alignment/Map (BAM) format is the compressed binary version of the Sequence Alignment/Map (SAM) format, a compact and index-able representation of nucleotide sequence alignments.

BIGWIG (.bw) - the BIGWIG format is for storing dense, continuous data (such as GC percent, probability scores, and transcriptome data) that will be displayed in the UCSC Genome Browser as a graph. BIGWIG files are created initially from wiggle (WIG) type files, using the program wigToBigWig.

CHAIN (.chain) - the CHAIN format describes a pairwise alignment that allow gaps in both sequences simultaneously and is used by the UCSC Genome Browser.

CONTIG (.contig) - the CONTIG format is a direct output from the SOAPdenovo alignment program:

>1 length 32 cvg_0.0_tip_0
GAGAACGGCGAAGCCTGCTCGGGCCCGTTATA
>3 length 32 cvg_23.0_tip_0
TAGCAGCGATTTGATCAAACTCAATCTTACCG
>5 length 32 cvg_40.0_tip_0
GGTAAGATTGAGTTTGATCAAATCGCTGCTAT

EXCEL (.xls, .xlsx) - Microsoft office spreadsheet files

FASTA (.fasta, .fa, .seq, .cds, .pep, .scafseq [SOAPdenovo output file - sequence of each scaffold]) - FASTA is a text-based format for representing either nucleotide sequences or peptide sequences.

FASTQ (.fq, .fastq) - the FASTQ format stores sequences (usually nucleotide sequence) and Phred qualities in a single file.

GFF (.gff) - The General Feature Format (GFF) is used for describing genes and other features of DNA, RNA and protein sequences.

IPR (.ipr) - the Web Gene Ontology (WEGO) Annotation format consists of the protein ID, followed by column(s) that are the IPR (InterPro) ID(s):

CR_ENSP00000334840
CR_ENSMMUP00000018123 IPR000504 IPR003954
CR_ENSP00000333725 IPR001781 IPR015880 IPR007087 IPR001909

See WEGO: a web tool for plotting GO annotations

KEGG (.kegg) - the Web Gene Ontology (WEGO) Annotation format consists of the protein ID, followed by column(s) that are the KEGG (Kyoto Encyclopedia of Genes and Genomes) ID(s):

CR_ENSMMUP00000031408 ko03010
CR_ENSP00000364815 ko00970 ko00290
CR_ENSP00000414605 ko05146 ko04510 ko04512

See WEGO: a web tool for plotting GO annotations

MAF (.maf) - the Multiple Alignment Format (MAF) stores a series of multiple alignments at the DNA level between entire genomes.

NET (.net) - the NET file format is used to describe the axtNet data that underlie the net alignment annotations in the UCSC Genome Browser.

PDF (.pdf) - portable document format

PNG (.png) - portable network graphics

QMAP (.qmap) - QMAP files are generated for methylation data from an internal BGI pipeline.

QUAL (.qual) - the QUAL file format represents base quality score file for NextGen data (similar in format to fasta).

RPKM (.rpkm) - Gene expression levels are calculated by Reads Per Kilobase per Million (RPKM) mapped reads eg 1kb transcript with 1000 alignments in a sample of 10 million reads (out of which 8 million reads can be mapped) will have RPKM = 1000/(1 * 8) = 125:

ENSP00000379387 15.5651433366423 6002951 289 3093
ENSP00000349977 24.7483107230444 6002951 398 2679
ENSP00000368887 24.6477413647837 6002951 174 1176

SAM (.sam) - the Sequence Alignment/Map (SAM) format is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. Most often it is generated as a human readable version of its sister BAM format, which stores the same data in a compressed, indexed, binary form. Currently, most SAM format data is output from aligners that read FASTQ files and assign the sequences to a position with respect to a known reference genome. In the future, SAM will also be used to archive unaligned sequence data generated directly from sequencing machines.
See The Sequence Alignment/Map format and SAMtools

TAR (.tar) - an archive containing other files

TEXT (.doc, .readme, .text, .txt) - a text file

VCF (.vcf) - the Variant Call Format (VCF) is a text file format for representing eg SNPs, InDels, CNVs, SVs, microsatellites, genotypes.

WEGO (.wego) - the Web Gene Ontology (WEGO) Annotation format consists of the protein ID, followed by column(s) that are the GO ID(s):

Bmb015379_2_IPR001092
Bmb003749_1_IPR006329 GO:0009168 GO:0003876
Bmb006173_1_IPR000909 GO:0007165 GO:0004629 GO:0007242

See WEGO: a web tool for plotting GO annotations

WIG (.wig) - the output file from TopHat is a UCSC wigglegram of alignment coverage.

UNKNOWN - any file format not in this list

XML (.xml) - eXtensible Markup Language


Upload status

Publish: this dataset is fully consented for immediate release upon GigaDB approval

HUP: this dataset should be Held Until Publication (HUP)


DOI relationship

The DOI relationship vocabulary is taken from the DataCite 'relationType' schema property (ID=12.2).

Definition: Description of the relationship of the resource being registered (A) and the related resource (B).

IsSupplementTo: indicates that A is a supplement to B

IsSupplementedBy: indicates that B is a supplement to A

IsNewVersionOf: indicates A is a new edition of B, where the new edition has been modified or updated

IsPreviousVersionOf: indicates A is a previous edition of B

IsPartOf: indicates A is a portion of B; may be used for elements of a series

HasPart: indicates A includes the part B

References: indicates B is used as a source of information for A

IsReferencedBy: indicates A is used as a source of information by B

Missing Value reporting

For attributes (sample, dataset or files) that have some or all values missing please use the following controlled value terms to describe the exact reason for the missing value.

not applicable: information is inappropriate to report, often this attribute can be removed entirely.

restricted access: information exists but cannot be released openly because of privacy concerns

not provided: information is not available at the time of submission, a value may be provided at the later stage

not collected: information was not collected and will therefore never be available

Application Programming Interface

Availability

The current API version is available on our main production database. This version will be periodically updated with new additional functionality, we will whenever possible maintain backwards compatability, but occassionally this may not be possible, for this reason we recomend regularly checking and updating you usage of our API.
The basic functionality of the API is to retrieve dataset metadata held in GigaDB. The actual data files will still need to be pulled by FTP, but you can gather the exact FTP locations from the metadata using the API, then use that to pull only the files you actually need/want.

Search function is based on the web-search function and will therefore give the same results.

Comments and Bug reporting

The GigaScience github issue for the API works is here:
https://github.com/gigascience/gigadb-website/issues/27
Please add feedback / comments/ questions to that issue.

Summary

It is currently possible to search "all" fields, or to specify one of a select few fields to search.

It is possible to have results return all metadata for each dataset with "hits" to the search term, or to specify a particular portion of the metadata, these portions are currently "dataset", "sample" and "file", which is in alignment with the same functionality on the web-search tool.The default is to return results as GigaDB v3 XML

It is planned that we will have the option to specify the format to be GigaDBv3-JSON or ISA2.0-JSON in the future, but that has not been implemented yet.

Terminology

To specify exact fields to return data from, use terms; dataset?=, sample?=, file?=, (or experiment?=*)
* - experiment will be implemented in the future

To search for datasets without the ID's, use the term search?keyword=

To search by specific attributes use search?<attribute_name>=
Available attribute_name to search include:
taxno = Taxonomic ID (NCBI)
taxname = species name (nb must exact spelling, no synonyms searched)
author = restricts search to the author table
datasettype = restricts search to the types of datasets, e.g. metagenomic, genomic, transcriptomic etc..
manuscript = restricts search to the manuscript ID associated with GigaDB dataset(s) e.g. search?manuscript=10.1186/2047-217X-3-21
project = restricts search to the project name, e.g. Genome 10K
eg.
.../search?taxno=9606

To specify results to be returned are ONLY a particular level of data, add the phrase &results=dataset ,or file or sample:
e.g.
http://gigadb.org/api/search?project=Genome%2010K&result=sample
NB - the search still looks everywhere, but the results returned are only those samples that are in datasets that are found by the search.
Default results are "dataset" only.

Examples


1. retrieve known datasets by doi
http://gigadb.org/api/dataset?doi=100051

2. retrieve samples from a known DOI
http://gigadb.org/api/sample?doi=100051

3. retrieve file information from a known DOI
http://gigadb.org/api/file?doi=100051

4. Search all GigaDB by keyword, return only the top level dataset metadata
http://gigadb.org/api/search?keyword=chimp&result=dataset

5.Search all GigaDB by keyword, return only the sample level metadata
http://gigadb.org/api/search?keyword=chimp&result=sample

6.Search all GigaDB by keyword, return only the file level metadata
http://gigadb.org/api/search?keyword=chimp&result=file

7. refine search to just the title of the dataset
http://gigadb.org/api/search?keyword=title:human&result=dataset

8. refine search to the descriptions of datasets
http://gigadb.org/api/search?keyword=description:human&result=dataset

9.refine search to NCBI taxonomic ID
http://gigadb.org/api/search?taxno=9606&result=dataset

10. refine search to taxonomic names
http://gigadb.org/api/search?taxname=Homo%20sapiens&result=dataset

11. refine search to Authors
http://gigadb.org/api/search?author=Wang%20Jun

12. refine search to linked manuscript IDs
http://gigadb.org/api/search?manuscript=10.1371/journal.pone.0005795

13. refine search to dataset types
http://gigadb.org/api/search?datasettype=Genomic

14. refine search to project names
http://gigadb.org/api/search?project=Genome%2010K&result=sample

15. list all dataset doi
http://gigadb.org/api/list

16. dump the database
http://gigadb.org/api/dump

Command line usage

You can also use the curl commands on the command line to retrieve metadata :

eg.

curl http://gigadb.org/api/dataset?doi=100051


If you want to check whether a search will work you can use the -I flag:

curl -I http://gigadb.org/api/dataset?doi=100051

HTTP/1.1 200 OK

or

HTTP/1.1 404 Not Found / HTTP/1.1 500 Internal server error