A MATLAB library for protein function prediction.
Matlab functions in this package uses "pfp_
" (aka. Protein Function Prediction) as the file name prefix.
-
ONT
, the ontology structure, seepfp_ontbuild.m
. This structure has the following fields:- required fields:
term
: a list of term structures (id
,name
).rel_code
: a list of relationship codes, e.g. {is_a
,part_of
}.DAG
: the relationship matrix,DAG(i, j) = k (k>0)
means term(i) has k-th relationship ofrel_code
with term(j).ont_type
: the ontology type, e.g.molecular_function
.date
: the date that this structure is built.
- optional fields:
alt_list
: the alternate term ID list.
- required fields:
-
OA
, the ontology annotation structure, seepfp_oabuild.m
. This structure has the following fields:- required fields:
object
: the object (sequence) list.ontology
: the associated ontology structure.annotation
: the annotation matrix,annotation(i, j) = 1
means object(i) is annotated with term(j).date
: the date that this structure is built.
- required fields:
-
PRED
, the prediction structure, see (built by any prediction methods, similar toOA
). This structure has the following fields:- required fields:
object
: the object (sequence) list.ontology
: the associated ontology structure.score
: the predicted score matrix,score(i,j)
is the predicted score for the association of object(i) and term(j).date
: the date that this prediction is made.
- optional fields: (see CAFA rules for details of these fields)
author
: the author of this predictor.model
: the model number.keywords
: the keywords of this predictor.tag
: some additional information.
- required fields:
-
NET
, the network structure, seepfp_netbuild.m
. This structure has the following fields:- required fields:
object
: the object (node) list.ADJ
: the adjacency matrix.date
: the date that this network is built.
- required fields:
- "Training" data
-
Sequences in FASTA format.
-
Annotations (MFO terms for exmaple) for each of these sequences. This data needs to be prepared ahead of time as a two-column CSV file (delimited by TAB)
<sequence ID>\t<GO term ID>
where
<sequence ID>
would be of any system (e.g., UniProt accession number), as long as they are consistant with those used in the FASTA file.
-
- NCBI BLAST tool (used 2.2.29+ for this document)
- Query sequences in FASTA format.
-
STEP 1: Load annotations of training sequences.
oa = pfp_oabuild(ont, 'annotation.dat');
where
ont
is a MATLAB structure of ontology which can be built from and OBO file (say, 'ontology.obo') asont = pfp_ontbuild('ontology.obo');
Note that a typical gene ontology OBO file contains all three GO ontologies (i.e., MFO, BPO, and CCO), therefore,
pfp_ontbuild
returns a cell of THREE ontology strcutures instead:onts = pfp_ontbuild('go.obo')
By default, they are ordered as BPO, CCO, MFO, alphabetically. You can also double check the
.ont_type
field of each returning structure. -
STEP 2: Prepare BLAST results
-
Run
blastp
on the query sequences against the "training" sequences by setting output format to be the following:blastp ... -outfmt "6 qseqid sseqid evalue length pident nident" -out blastp.out
-
Load the tabular output file (
blastp.out
as shown above) into MATLAB:B = pfp_importblastp('blastp.out');
-
-
STEP 3: Build the BLAST predictor
blast = pfp_blast(qseqid, B, oa);
where
qseqid
is a cell list of query sequences on which you need scores. Note that it can be just a subset of all those you BLAST'ed.B
is the structure imported step 2, whileoa
is the ontology annotation structure loaded in step 1.Also, extra options can be specified as additional arguments to this function. See the documentation of
pfp_blast.m
for more details. Thus,blast
will be the BLAST predictor in MATLAB for evaluation.
GOtcha predictor can be build in the similar way of BLAST predictor.
gotcha = pfp_gotcha(qseqid, B, oa);
To build a naive predictor, all you need is the ontology annotation structure oa
that you have as in the step 1 of making a BLAST predictor. Then run the following in MATLAB:
naive = pfp_naive(qseqid, oa);