-
Notifications
You must be signed in to change notification settings - Fork 6
Defining gene to category annotations
This page describes how to retrieve and process data from an ontology like GO biological processes in a form of gene-to-category annotations. These categories define the units on which enrichment is assessed.
Note that this step can be skipped by downloading pre-processed results from figshare [computed on data as of 2019-04-17].
This involves the following steps:
- Retrieve and process the GO hierarchy data.
- Retrieve and process the annotations of genes to GO Terms.
- Iteratively propagate gene-to-Term annotations from child to parent up the GO hierarchy.
There are a number of routes to downloading the GO Term hierarchy. We used the termdb mySQL database dump, and linked to this database from Matlab using a mySQL java connector. Code for achieving this (e.g., in the Matlab_mySQL repository) is a dependency for this package.
Note: the data is also provided in raw form as go-basic.obo
(the basic
file ensures that annotations can be propagated), and you can also download the data as a database.
- Set up downloaded
termdb
mySQL database, and put connection details inConnectMeDatabase
. - Retrieve Biological Process GO Terms, and save the filtered set of terms to a .mat file:
GOTerms = GetGOTerms('biological_process',true);
Saves out to ProcessedData/GOTerms_BP.mat
.
Now that we have the GO Terms in Matlab format, we next need data on which genes are annotated to which GO Terms.
Annotation files should be downloaded directly from the GO website.
- For Mus musculus, the annotation file is
mgi.gaf
. - For Homo sapiens, the annotation file is
goa_human.gaf
.
The appropriate annotation file(s) should be placed in the RawData
directory.
Each line in the annotation file represents an association between a gene product and a GO term with a certain evidence code, and the reference to support the association.
The ReadDirectAnnotationFile
function reads in all of this raw data, and processes it into a Matlab table, with a row for each GO Category, including information about the category and the genes that are annotated to it.
Before this can be run, it requires a mapping from MGI gene identifiers to NCBI Entrez gene identifiers. In mouse, this is achieved by taking data from MouseMine.
python3 MGI_NCBI_downloadall.py
This saves the required gene identifier mapping to ALL_MGI_ID_NCBI.csv
.
In the case of human data, we mapped onto gene symbols from processed gene-expression data from the Allen Human Brain Atlas.
ReadDirectAnnotationFile('mouse')
Saves processed data as GOAnnotationDirect-mouse.mat
(or GOAnnotationDirect-human.mat
), in the ProcessedData
directory.
Note (NOT RECOMMENDED): Annotations processed from GEMMA can alternatively be read using ReadGEMMAAnnotationFile
.
Annotations are made at the lowest level of the GO term hierarchy.
Annotations at a lower level of the hierarchy apply to all parent terms.
For performing enrichment, we therefore need to iteratively propagate direct annotations up the hierarchy, using is_a
(child-parent) relationships from the term2term
table from the GO Term database.
For mouse biological processes, this is achieved using:
propagateHierarchy('mouse','biological_process');
The code takes processed data (e.g., GOAnnotationDirect-mouse.mat
) and saves propagated output as GOAnnotationDirect-mouse-biological_process-Prop.mat
.
These propagated annotations can then be used for enrichment analysis.