- Discover common biological databases
- Understand common bioinformatics file formats
There are many sources available for acquiring biological data. A partial, and by no means exhaustive, list includes:
- NCBI (National Center for Biotechnology Information)
- GenBank
- RefSeq
- SRA (Sequence Read Archive)
- Ensembl
- IMG (Integrated Microbial Genomes)
- DDBJ (DNA Data Bank of Japan)
- FungiDB
Note that many of these databases cross-reference one-another, with the same depositions found in multiple sources.
- NCBI - NCBI houses many major databases, including GenBank and RefSeq, which house genes and genome assemblies, and the SRA, which houses sequencing data. It's also home to databases like PubMed.
NCBI Taxonomy is one particularly useful database. The Taxonomy Browser displays a hierarchical view of the taxonomy, as well as more detailed taxon-specific pages that highlight the internal links to other databases, and LinkOut links to external resources and other taxon-specific data. It is particularly useful for finding genomes and assemblies.
- Go to the NCBI Taxonomy page and search for "Saccharomyces cerevisiae"
- Click on one of the
1
s in theGenome
row in the table on the right side to launch a new page. This page contains links to download the genome sequence and annotation files.
NCBI's Sequence Read Archive (SRA) stores sequence and quality data (fastq files) in aligned or unaligned formats from NextGen sequencing platforms.
- Open the NCBI SRA page and search for "Saccharomyces cerevisiae"
- Click on any of the listed results to show project information
- Under "Project Data" click on links to "SRA Experiments"
- In, this page, you can find:
- Experimental Design
- Sample Information
- Library Preparation Information
- Sequencing "Run" information with an identifier usually starting with "SRR"
Most journals require researchers to publish their sequencing data in a public repository upon manuscript publication. There's often a section somewhere in the publication that indicates where data were deposited, and their corresponding accession numbers. For example, the dataset we're using in these lessons comes from this paper. The text says the following:
DATA DEPOSITION
The data sets supporting the results of this article are available in the European Nucleotide Archive repository (ENA) (PRJEB5348, http://www.ebi.ac.uk/ena/data/view/ERX425102). All the code for this work is publicly available (https://github.com/bartongroup/profDGE48).
The ENA and SRA mirror each other, so we can view this accession from either location. It is often easier to download single or a handful of accessions from the ENA. In our R lesson, we will work with the ENA table from this dataset.
There are a lot of different file formats in bioinformatics! We've listed a few common ones below (and their common extensions):
- FASTA – plain sequences (.fa, .fasta, .fna, .fnn, if amino acids often .faa)
- FASTQ – sequencing reads (.fq, .fastq)
- GFF – general feature format (.gff)
- GTF - variation of GFF (.gtf)
- VCF – sequence variants (.vcf)
- SAM – sequence alignments (.sam)
- BAM – alignments in binary (a compressed version of SAM, .bam)
Ensembl is another useful database if you're working with a model organism. It provides a browser for vertebrate (and sometimes other) genomes and supports research in comparative genomics, evolution, sequence variation, and transcriptional regulation. Ensembl annotates genes, computes multiple alignments, predicts regulatory function, and collects disease data. Ensembl tools include BLAST, BLAT, BioMart, and the Variant Effect Predictor (VEP) for all supported species. Below we demonstrate how to find annotated genes using Ensemble.
- Open Ensemble webpage and search for "Saccharomyces cerevisiae"
- Under the "Gene Annotation" section you can download FASTA files for genes, cDNA, ncRNA, proteins, and annotations for the same
- This is the ftp site of ensemble