Please execute the following workflows depending on your preferences:
- installs the required conda environment if necessary
- generates a network
- optimizes the network and predicts missing molecular properties
- output contains the predictions and the graph in separate files
- installs the required conda environment if necessary
- predicts missing molecular properties using a specified machine learning script
- generates a network containing the formerly made predictions
- optimizes the network including the annotated predictions
- output contains the predictions with optimization suggestions and the graph in separate files
- simply execute the shell scripts provided in this directory
- for the usage of Network Balance Scaling as a predictor, please try and modify workflow_predict.sh
- for the usage of Network Balance Scaling to optimize existing predictions, please try and modify workflow_optimize_predictions.sh
- this workflow contains a precalculated model
- new models can be created with our inhouse predictor based on deepchem using the script predidictors/bace_predictor_train.sh
- to generate predictions with an existing model independent from the workflow, please try the script predictors/bace_predictor_predict.sh
- please note, that our workflows only include the BACE data set of moleculenet due to a huge RAM requirement for larger data sets of this benchmark
- if you are interested in the other data sets of this benchmark, feel free to construct your own pipeline and try our helper scripts in the utils directory
- you can find precalculated benchmark scenarios and the original CSV files from the moleculenet benchmark in the data directory
- note that all scripts should be startet from main directory in order of the path management
- to evaluate the results you achieved, please checkout our script utils/evaluate_RMSD.py
- python is needed to run the scripts
- if you use an anaconda environment:
- the workflows will automatically create and activate the required dependencies in extra environments
- if the activation fails, please add the following line insinde the workflows above the activate command:
- source ~/anaconda3/etc/profile.d/conda.sh
- or add this line to your .bashrc
- if you use another python environment, please make sure to provide the below listed dependencies
- python-igraph=0.8.3=py37h340e831_2
- cvxpy=1.1.10=py37h89c1867_0
- cvxpy-base=1.1.10=py37hdc94413_0
- rdkit=2020.09.4=py37he53b9e1_0
A script for creating networks with or without existing predictions at red vertices.
usage: python3 generate_network.py INDEX.csv MOLECULES.sdf PROPERTY [-o]
Create an MMS network in .GraphML format
positional arguments:
INDEX.csv Indexed output from mmpdb containing the MMS
relations.
INPUT.smi Input file in smiles format containing the molecules.
Row format: {SMILES\t ID\t PROPERTY}.
PROPERTY Property of interest, annotated in the input SD file.
optional arguments:
-h, --help show this help message and exit
-o OUTNAME, --outname OUTNAME
Specify name of output graph file.
-p PREDICTION.smi, --prediction PREDICTION.smi
Predictions in smiles format of INPUT.smi
A script for optimizing networks. It can be used to either predict unknwon properties or to optimize existing
predictions depending on the option -p
usage: python3 optimize_network.py FILE [-t, -s, -o, -l]
Optimizes an MMS network in GraphML format.
positional arguments:
FILE.GraphML Network to be optimized in GraphML format.
optional arguments:
-h, --help show this help message and exit
-t TRAFO_WEIGHT, --traformations TRAFO_WEIGHT
Decimal number, which indicates how strongly the
number of transformations at an edge influence the sum
of squares
-s STDDEV, --standarddeviation STDDEV
Decimal number, which indicates how strongly the
standard deviation of the transformation at an edge
influence the sum of squares
-o OUTDIR, --outdir OUTDIR
Output directory to store optimized network.
-l, --logarithmic If this action is chosen, the number of
transformations is considered logarithmically.
-f FORMAT, --format FORMAT
Type of result file containing the predictions.
Default is SMILES.
-p, --prediction Optimize existing predictions
The integrated predictor is based on DeepChem (https://github.com/deepchem/). Thus, the corresponding python scripts predictor/build_model.py and predictor/predictor.py additionally underly the MIT License. Our integrated predictor comprises:
- a trained graph convolution regression model for the BACE data set
- a trained message passing neural network model for the BACE data set
- the Python script predictor/build_model.py which can be used to create a model
- the Python script predictor/predictor.py which can be used to create predictions using an existing model
- the bash script predictor/bace_predictor_train.sh which trains a model based on the BACE data set
- the bash script predictor/bace_predictor_predict.sh which predicts properties for the BACE test set
- training, validation and test set for BACE can be found in our data
- for the other data sets, please use the scripts provided in utils to create your own data splits and convert files with respective formats
- deepchem=2.3.0=py_2
- rdkit=2020.09.2=py37h713bca6_0
- tensorflow=1.14.0=h4531e10_0
To incorporate existing predictions into your network, please mind the following:
- the resulting predictions need to be written to a SMILES file
- the SMILES file needs to follow the format SMILES\tID\tPROPERTY
- using workflow_optimize_predictions.sh you need to replace the call of our default predictor based on deepchem (./predictor/bace_predictor_predict.sh)
- please remember to adapt the training, validation and query set at the beginning of the bash script
- note: Please remember, that all molecules need to be included into the mmpdb index. Using the network generation with existing predictions independently from workflow_optimize_predictions requires the molecules with known properties (training and validation sets) in a first SMILES file, your predictions in a second SMILES file and the index including all (training, validation and query) molecules.
- the workflows are constructed to run with SD files as input
- the SD files are converted during the workflow using the scripts provided at utils/
- for directly using SMILES files, please note that the required format consists of three tab-separated columns:
- SMILES\tID\tPROPERTY
- the SMILES format is also a requirement for the input files used in mmpdb's MMP analysis