-
Notifications
You must be signed in to change notification settings - Fork 1
P2Rank conservation
What are you going to learn here:
- use a model with conservation
- train and evaluate a new p2rank model with conservation
This tutorial provides a general introduction to using evolutionary conservation as a feature in P2Rank. It was mostly written before the HMMER-based conservation calculation pipeline (which is currently utilized by PrankWeb) was developed. While most of the information provided here is still relevant, specific commands may differ when the HMMER-based pipeline is used (particularly commands related to training a model). In any case, users wishing to train and use conservation-aware models are advised to check out the updated tutorial for training models utilizing the HMMER-based pipeline which may present more up-to-date information.
You should have p2rank 2.3 installed and have relevant datasets downloaded (described in the setup guide). All commands must be run from the directory created in the setup guide. You should also know how to train and evaluate a model without conservation.
p2rank can use conservation information from files produced by, for example, the sequence conservation pipeline.
Example of the structure of the conservation file is as follows (lines starting with #
are treated as comments):
# /tmp/msa8880677536751254144.fasta -- js_divergence - window_size: 3 - background: blosum62 - seq. weighting: True - gap penalty: 1 - normalized: False
# align_column_number score column
0 -1000.00000 -KT-T-KTSTT----E-TTNKDT-D-K-NTT-EDTT-D-TSD--TS---------TTNS---TTTNNSKTTTT-TTK-DT---TTNTTT
1 0.32670 TILFL-ILILL----V-TLLLIVFILI-LII-LIIIIIITIIFVIV-MMMFFL-LLLIL-I-TLVLLIILITL-LFV-IL--ILLLVTI
2 0.43770 FFFLFFFFAFFF-V-L-FFVVVFLVVF-VFFFIVYFVVVFVVLLFV-FLYVVV-VFFVFYV-FFFYVAFFFVF-FFV-VFF-VFFVVFY
3 0.52723 VVVVVVVVMVVV-I-V-VVIVVVVVVVVIVVVVVVVVIIVVVVLVL-IIVVVV-VVVIVIV-VIVLIIVTVLV-VVV-VVV-VIVVLVV
4 0.58142 AAAAAAAAAAAAAA-A-AAAAAAAAAAAAAAAAAGAAAAAAAAAAA-GAAAAA-AAAAAAA-AAAAAAAAAAA-AAA-AAA-AAAAAAG
The p2rank will extract only the position, score, and AA code (the list of AA codes on i-th line correspond to i-th column of the MSA from which the conservation is computed). In this example, p2rank uses the following values:
index | score | letter |
---|---|---|
1 | 0.32670 | T |
2 | 0.43770 | F |
3 | 0.52723 | V |
4 | 0.58142 | A |
As you can see the first row with score -1000.00000 is not used. Nevertheless, it is still loaded by p2rank, but the negative value is replaced with zero. Next, the value is ignored as it corresponds to a gap, represented by '-' in the file above.
As you can see the conservation file has no information about chain. For this reason, a single conservation file needs to be provided for each chain. The chain is encoded in the file name. For example, file 1a0qH.pdb.seq.fasta.hom.gz
corresponds to PDB record 1a0q and chain H.
p2rank comes with a pre-trained conservation-aware model. We can use the following command to evaluate the coach420 model using this model.
.\p2rank\prank.bat eval-predict .\datasets\coach420.ds -threads 4 -label default-conservation -c .\p2rank\config\conservation -conservation_dirs .\coach420\conservation\e5i1\scores
We use a custom label default-conservation (-label default-conservation
) to recognize the result files. This is needed if multiple experiments are run so that the results of one experiment do not overwrite the results of the previous one. The label serves as the prefix of the results. The -c
argument specifies the JSON configuration file for the conservation. Finally, we need to provide a path to directory with computed configurations (-conservation_dirs
).
The conservation path is relative to the dataset definition file.
You may also want to check ./p2rank/test_output/eval_predict_coach420_default-conservation/run.log
for any conservation-related errors. The log file also includes information about loading the conservation:
[INFO] ConservationScore - Loading conservation scores from file [.\datasets\.\coach420\conservation\e5i1\scores\1afkAA.pdb.seq.fasta.hom.gz]
We use results from the Editing Model Training and Evaluation tutorial to get an estimate of the impact of conservation on the result. Keep in mind, that the training is not a completely deterministic process and your results may vary slightly.
DCA (4.0) | n | n + 2 |
---|---|---|
p2rank default | 71.6 | 77.1 |
our model | 70.5 | 76.5 |
p2rank default conservation | 73.4 | 78.1 |
As we can see the performance is similar to the default model. However, after visual inspection of the resulting pockets, you often observe improvement in the shape of the pockets as these seem to be more compact.
p2rank can load conservation files only from one directory. As the training requires test and validation dataset we need to merge conservation into a single directory. This can be done in a few steps:
- Create a new directory
conservation
- Copy content of
chen11/conservation/e5i1/scores
into theconservation
directory - Copy content of
joined/conservation/e5i1/scores
into theconservation
directory
Now we are ready to train a new model. p2rank can be configured using a wide range of parameters and options. To make things easier we are going to re-use the default conservation configuration (.\p2rank\config\conservation
).
A new model can be trained using the following command:
.\p2rank\prank.bat traineval -t .\datasets\chen11.ds -e .\datasets\joined.ds -threads 4 -rf_trees 200 -delete_models 0 -loop 1 -seed 42 -c .\p2rank\config\conservation -label conservation -conservation_dirs .\..\conservation
The conservation path is relative to the dataset definition file. On i7-3632QM it takes about 30 minutes to finish.
Next, we can evaluate our newly trained model on the coach420 dataset using command:
.\p2rank\prank.bat eval-predict .\datasets\coach420.ds -threads 4 -label conservation -model .\p2rank\test_output\traineval_chen11_joined_conservation\runs\seed.42\FastRandomForest.model -c .\p2rank\config\conservation -conservation_dirs .\coach420\conservation\e5i1\scores
As you can notice the command is almost the same as for running the default conservation model, the only the difference is that we specify a custom model file.
For my run, I got the following results:
DCA (4.0) | n | n + 2 |
---|---|---|
p2rank default | 71.6 | 77.1 |
our model | 70.5 | 76.5 |
p2rank default conservation | 73.4 | 78.1 |
our conservation | 72.8 | 76.7 |
P2Rank can load conservation files from the provided directory specified using the -conservation_dirs path/to/dir
argument.
You can specify multiple conservation directories using -conservation_dirs "(path/to/dir1, path/to/dir2)"
syntax.
Alternatively, if no directory is set, p2rank will look for the conservation files in the same directory where the structure files are located.
For each chain in each protein P2Rank will look for conservation score file named {base_protein_file_name}(_){chain_code}.(***).hom(.gz)
in all conservation directories.
Example of valid conservation score file names for pdb file 1a0q.pdb
and chain H
:
1a0q_H.hom
1a0q_H.hom.gz
1a0q_H.whatever.hom
1a0q_H.whatever.hom.gz
1a0qH.whatever.hom
1a0qH.hom