PyTorch Implementation of Neural Graph Fingerprint

forked from: https://github.com/XuhanLiu/NGFP

Required Packages:

PyTorch, RDkit, tqdm, numpy

> conda create -n nfp_env # [optional]
> source activate nfp_env # [optional]
> conda install -c rdkit rdkit
> conda isntall -c yihui.ren torch_nfp

Applying NFP on canonical CSV files

The input CSV file must follows the convention of

contains no header.
three columns: Dataset Name, Molecule Name, SMILES

The code takes three required arguments: 1) input file, 2) output directory, 3) pretrained model

for example:

python generate_nfp.py \
    -i ./dataset/canonical_ANL/DUD_sample.csv `# input file`\
    -o ./output/DUD_sample/ `# output directory`\
    --model ./pretrained/MPro_mergedmulti_class.pkg \
    --chunk_size 100 `# small value for demo purpose`\
    --tqdm \
    --dataset_name DUD_sample # if not defined, will derive from input

If the number of molecules is larger than the chunk_size, the output consists of a series of csv files, indicated by <dataset_name>-<starting_id>-<ending_id>.csv. Note that, the range is always dividable by the chunk_size, even for the last chunk. Some dataset contains invalid SMILES, to handle this, a missing subdirectory is created. The naming convention of files is consistent with the outout file name. If there are some invalid SMILES within range [i*chunk_size, (i+1)*chunk_size), the total number of smiles in the corresponding output file and missing file is equal to the chunk_size.

By running the example command, or bash scripts/run_gen_nfp.sh in current folder. The output would look like the following, note that the third chunk (200-300) has samller number of molecules.

$ tree -h ./output/DUD_sample
./output/DUD_sample
├── [132K]  DUD_sample-0-100.csv
├── [132K]  DUD_sample-100-200.csv
├── [6.6K]  DUD_sample-200-300.csv
└── [4.0K]  missing
    ├── [   0]  DUD_sample-0-100.csv
    ├── [   0]  DUD_sample-100-200.csv
    └── [   0]  DUD_sample-200-300.csv

Now, we can run the similarity search based on the generated NFPs of this sampled data.

python compute_tanimoto.py -i ../output/DUD_sample/ --query "CCCCCC1=CC(O)=C(C\C=C(/C)CCC=C(C)C)C(O)=C1" --top_k 20

The output will look like the following.

among total 205 molecules
top-20 similar smiles to C#CCOc3nc(c1ccccc1)nc4sc2CCCc2c34
| smiles                                        |    score |
|-----------------------------------------------|----------|
| C#CCOc3nc(c1ccccc1)nc4sc2CCCc2c34             | 1        |
| C#CCOc3nc(c1ccccc1)nc4sc2CCCCc2c34            | 0.953463 |
| Nc4nc2nn(CCc1ccccc1)cc2c5nc(c3ccco3)nn45      | 0.822863 |
| Nc3nc(c1ccco1)cc(c2cccs2)c3C#N                | 0.81952  |
| Nc4nc2nn(Cc1ccc(F)cc1)cc2c5nc(c3ccco3)nn45    | 0.812759 |
| Nc4nc2nn(CCCc1ccccc1)cc2c5nc(c3ccco3)nn45     | 0.804355 |
| Nc3nc(NCCc1ccc(O)cc1)cc4nc(c2ccco2)nn34       | 0.790788 |
| CC(C)c4ccc3Cc2c(c1ccc(Br)o1)nc(N)nc2c3c4      | 0.784824 |
| S=c4sc2c(ncn3nc(c1ccco1)nc23)n4CCc5ccccc5     | 0.783747 |
| Nc5nc(c1ccccc1)c4c(=O)c3cccc(CN2CCCC2)c3c4n5  | 0.781628 |
| CC(C)CCn4cc2c(nc(N)n3nc(c1ccco1)nc23)n4       | 0.77557  |
| Nc4nc(c1ccccc1)c3c(=O)c2ccccc2c3n4            | 0.774032 |
| Cc1ccccc1CNC(=O)c3cc(c2ccco2)nc(N)n3          | 0.773349 |
| Nc4nc2nn(Cc1ccc(F)cc1)nc2c5nc(c3ccco3)nn45    | 0.763374 |
| Nc3nc(NCCc1ccc(O)cc1)nc4nc(c2ccco2)nn34       | 0.762838 |
| OC[C@@H]1CCCN1c4nc(c2nccs2)c3sccc3n4          | 0.761923 |
| Nc3nc(c1ccco1)cc(c2ccco2)c3C#N                | 0.760984 |
| Nc3nc(C(=O)NCCc1ccc(O)cc1)cn4nc(c2ccco2)nc34  | 0.760341 |
| Nc5nc(c1ccccc1)c4c(=O)c3cccc(CN2CCOCC2)c3c4n5 | 0.755121 |
| Nc3nc(C(=O)NCc1ccccc1)cn4nc(c2ccco2)nc34      | 0.754376 |

Using NFP for similarity measure

ρ Spearman ranking correlation between score similarity and fingerprint similarity
mean score average socres of top-k most similar fingerprints
recall rate fractions of top-k most similar fingerprints in top-k scores.
all scores higher the better

10k sampled smiles, using trained NFP model, 6vww protein over 23 pockets.

pocket	nfp avg	cfp avg	nfp rcl	cfp rcl	nfp ρ	cfp ρ
6vww_pocket1	8.30084	5.21324	0.18	0.02	0.383276	-0.19282
6vww_pocket100	11.0752	8.48871	0.29	0.01	0.419817	-0.071672
6vww_pocket108	7.12053	5.40831	0.08	0.03	0.433887	-0.0220411
6vww_pocket11	9.11185	7.79543	0.13	0.06	0.526309	-0.103178
6vww_pocket13	8.81503	6.93905	0.08	0.02	0.202461	-0.081619
6vww_pocket130	7.07989	5.35894	0.1	0.02	0.485588	-0.154715
6vww_pocket135	4.69959	1.67261	0.27	0.02	0.653972	-0.0853283
6vww_pocket143	7.00099	5.22176	0.08	0.02	0.411739	-0.0453596
6vww_pocket154	5.75149	3.22001	0.17	0.01	0.667301	-0.206292
6vww_pocket156	5.37498	2.7611	0.12	0.01	0.554091	-0.288988
6vww_pocket157	7.9011	3.77018	0.17	0.02	0.512664	-0.279959
6vww_pocket17	9.30961	6.59216	0.12	0.03	0.458592	-0.0679879
6vww_pocket18	8.67098	7.42478	0.03	0.04	0.382344	-0.103151
6vww_pocket22	8.05067	6.17218	0.14	0.02	0.510678	-0.141086
6vww_pocket23	7.45153	6.13385	0.05	0.02	0.504353	-0.101424
6vww_pocket3	7.46564	4.3675	0.09	0.01	0.288429	-0.278291
6vww_pocket37	7.39818	5.11127	0.1	0.01	0.625352	-0.195684
6vww_pocket57	10.5995	8.54198	0.08	0.02	0.45416	-0.0771552
6vww_pocket6	9.9263	8.06751	0.1	0.02	0.385074	-0.128674
6vww_pocket62	11.2105	8.78469	0.28	0.01	0.469667	-0.0329453
6vww_pocket71	6.80595	5.25062	0.04	0.02	0.265327	-0.091839
6vww_pocket8	8.37131	5.35022	0.18	0.03	0.39115	-0.179408
6vww_pocket9	8.12616	6.84118	0.04	0.01	0.28962	-0.0810846

averages	nfp	cfp
avg top-100 score mean	8.07034	5.84727
avg top-100 score recall	0.126957	0.0208696
avg spearman corr.	0.446776	-0.1309

DEMO on the last target (6vww_pocket8)

Spearman correlation

Demo shows top-20 most similar smile strings (by absolute score difference) rho NFP: 0.3911501585258828, rho CFP: -0.17940844415167148

smiles	nfp	rank	cfp	rank	Δscr
Cc1[nH]c2ccccc2c1C(c1cccnc1)N1CCCCC1	1	0	1	0	0
O=C(CSc1n[nH]c(=O)n1CCc1ccccc1)c1ccc(O)cc1O	0.6141	2409	0.3478	4739.5	0.2069
NC(=O)c1ccc(CNc2ccc3nnc(-c4ccccc4F)n3n2)cc1	0.5754	4154	0.3777	2527.5	0.5468
O=c1[nH]c2ccc(Nc3ncnc4sc5c(c34)CCC5)cc2[nH]1	0.4957	8081	0.3333	5833.5	0.6809
Cc1cc(C)c(-c2n[nH]c3c2C(c2ccccc2F)N(CCO)C3=O)c(O)c1	0.7577	67	0.3655	3386	0.7164
O=C(NCCc1nc(-c2ccccc2)n[nH]1)c1cc[nH]n1	0.5051	7697	0.3033	7838.5	0.7834
CCc1cccc2c(C(=O)N3CCCC(n4cccn4)C3)c[nH]c12	0.7988	12	0.4137	799.5	0.8133
CCC(CNC(=O)Cn1ccc(=O)[nH]c1=O)N1CCc2ccccc2C1	0.6257	1965	0.3777	2527.5	0.8340
CC(c1cccs1)N(C)C(=O)c1ccc2[nH]nnc2c1	0.6943	432	0.3139	7190.5	0.8513
Cn1nc(CCNC(=O)N[C@H]2CCC@HCC2)c2ccccc21	0.5884	3502	0.3370	5528.5	0.9264
NC(=O)CC1(O)CCCN(C(=O)c2c[nH]c(C(F)(F)F)c2)C1	0.7615	63	0.3780	2493	0.9590
c1ccc(-c2n[nH]cc2CNC2CCCN(c3cccnn3)C2)cc1	0.6913	474	0.4235	537.5	1.0342
Cc1cc(NC(=O)NCc2cccnc2)ccn1	0.4940	8144	0.3658	3345	1.0489
O=C(NCc1n[nH]c(=O)[nH]1)C1(c2ccc(Cl)cc2)CC1	0.6703	752	0.2592	9407	1.0552
COc1ccc2[nH]c(SC(C)c3nc4ccccc4c(=O)[nH]3)nc2c1	0.6065	2709	0.3777	2527.5	1.0637
CCc1oc2ccccc2c1CC(=O)N1CCC(c2nnc[nH]2)C1	0.7710	48	0.4069	1034.5	1.0671
O=C(Cn1[nH]cc2c(=O)ncnc1-2)NC1(c2ccccc2)CCC1	0.7725	45	0.3333	5833.5	1.0963
Cc1nc(Nc2ccc(C(N)=O)cc2)c2c(-c3ccccc3)csc2n1	0.6139	2425	0.3563	4064.5	1.1502
CC(C)N(Cc1nc2c(cnn2C)c(=O)[nH]1)Cc1cccs1	0.6321	1730	0.2333	9753.5	1.1806
Cc1ncc(COP(=O)(O)O)c(CNc2co[nH]c2=O)c1O	0.5240	6853	0.3452	4890	1.3226

Spearman rank correlation (ρ)

nfp avg: 0.44677611521262073, cfp avg: -0.13090014039680933

pocket	nfp ρ	cfp ρ
6vww_pocket108	0.433887	-0.0220411
6vww_pocket9	0.28962	-0.0810846
6vww_pocket100	0.419817	-0.071672
6vww_pocket22	0.510678	-0.141086
6vww_pocket154	0.667301	-0.206292
6vww_pocket18	0.382344	-0.103151
6vww_pocket57	0.45416	-0.0771552
6vww_pocket143	0.411739	-0.0453596
6vww_pocket3	0.288429	-0.278291
6vww_pocket157	0.512664	-0.279959
6vww_pocket37	0.625352	-0.195684
6vww_pocket1	0.383276	-0.19282
6vww_pocket13	0.202461	-0.081619
6vww_pocket6	0.385074	-0.128674
6vww_pocket17	0.458592	-0.0679879
6vww_pocket130	0.485588	-0.154715
6vww_pocket23	0.504353	-0.101424
6vww_pocket71	0.265327	-0.091839
6vww_pocket62	0.469667	-0.0329453
6vww_pocket135	0.653972	-0.0853283
6vww_pocket11	0.526309	-0.103178
6vww_pocket156	0.554091	-0.288988
6vww_pocket8	0.39115	-0.179408

MultiTask vs SingleTask accross various subsamples

The subsets of data are randomly selected with sizes of 5k, 10k and 20k. All experiments are using the same fixed selections. Split of train, test, validation is fixed. Averaged over 3 Runs.

settings	avg. mse
single5k	0.844739
single10k	0.692700
single20k	0.660374
multi5k	0.839220
multi10k	0.737363
multi20k	0.709497

Examples:

Generate Neural Fingerprint (NFP) using a trained model

python examples/generate_nfp.py --datafile <datafile.smi> \
                                --model <saved_trained_model> \
                                --output <output_nfp.npy> 
python examples/generate_nfp.py --datafile ./dataset/zinc/zinc_sample.smi \
                                --model ./output/best_efficacy.pkl.pkg \
                                --output ./output/example_nfp_output.npy

Each line in <datafile.smi> contains a smile string and additional information. We assume the first column is the smile strings and columns are space (or tab) separated. If not the case, one can pass the delimiter and column index of the smile string as --delimiter "," --column_index 2 for example. (see function line_parser() in generate_nfp.py for more details.

Different from fingerprints as bit vectors from the Morgan algorithm, the NFP is represented by a vector of non-negative real values. The length of NFP is defined by the trained model's hidden dimension (128 in the example). To change the NFP length, One need to redefine a NFP network and re-train the model. (see reproduce_main_results.py for more details.)

Compute continuous Tanimoto Similarity

> python examples/compute_tanimoto.py -h

Compute the continuous Tanimoto similarity, defined in the NFP paper:

$\sum_i \min(X_i, Y_i) / \sum_i \max(X_i, Y_i)$

The function tanimoto_similarity(x,y) is defined in NeuralGraph/util.py. It takes two variables x and y: x must be a single fingerprint of length L, and y can be either one fingerprint (L,) or an array of M fingerprints, (M,L).

Here is an example output from the fingerprints we generated in the previous example.

> python examples/compute_tanimoto.py --datafile dataset/zinc/zinc_sample.smi \
                                      --fingerprint output/example_nfp_output.npy \
                                      --top_k 10 --anchor_smile_idx 15

top-10 similar smiles to C[C@H](CCc1ccccc1)[NH2+]C[C@H](c2ccc(c(c2)C(=O)N)[O-])O
smiles                                                       score
--------------------------------------------------------  --------
C[C@H](CCc1ccccc1)[NH2+]C[C@@H](c2ccc(c(c2)C(=O)N)[O-])O  1
COc1ccc(c(c1)[O-])C(=O)NC[C@@H]2c3ccccc3CCO2              0.692871
COc1ccc(c(c1)[O-])C(=O)NC[C@H]2c3ccccc3CCO2               0.692871
Cc1cc(nc2c1cccc2)N3C[C@@H](O[C@H](C3)C)C                  0.6875
Cc1cc(nc2c1cccc2)N3C[C@H](O[C@@H](C3)C)C                  0.6875
c1ccc2c(c1)CC(=C2)N3CCN(CC3)c4cccc(c4)C(F)(F)F            0.644531
Cc1cc(ccc1OC)CCCC(=O)N/N=C/c2ccccc2[O-]                   0.64209
COc1ccc(c(c1)[O-])C(=O)Cc2cnn(c2)c3ccccc3                 0.634766
CCC(=O)c1ccc2c(c1)N(c3ccccc3S2)C[C@H](C)N(C)C             0.634766
c1ccc(cc1)COCc2cnc([nH]c2=O)[O-]                          0.633789

Calculate similarity between two SMILE strings

smile_similarity.py takes two SMILE strings, compute their fingerprints and calculate the similarity. Two fingerprinting methods are implemented: "morgan" and "nfp" (neural fingerprint) If a model pkg file is not provided, the "nfp" will uses large random weights as described in the original paper. The similarity is defined as one minus continuous Tanimoto distance.

Here is an example:

#!/bin/bash
s1="C1OC(O)C(O)C(O)C1O"
s2="CC(C)=CCCC(C)=CC(=O)"
python smile_similarity.py $s1 $s2 -m morgan
python smile_similarity.py $s1 $s2 -m nfp
python smile_similarity.py $s1 $s2 -m nfp --model './output/best_delaney.pkl.pkg'

Reproduce results in the original paper

Measured in mean squared error (lower the better)

Dataset	Solubility	Drug Efficacy	Photovoltaic
This repo (NFP+MLP)	0.34(0.02)	1.07(0.10)	1.08(0.06)
NGF Paper	0.52(0.07)	1.16(0.03)	1.43(0.09)
This repo (CFP+MLP)	1.35(0.18)	1.13(0.03)	1.84(0.10)
NGF Paper	1.40(0.13)	1.36(0.10)	2.00(0.09)

To reproducing these results:

python reproduce_main_results.py <experiment_name> <method_name>

where <experiment_name> should be one of ["solubility", "drug_efficacy", "photovoltaic"], and <method_name> is either ["morgan", "nfp"].

Convolutional Neural Graph Fingerprint

PyTorch-based Neural Graph Fingerprint for Organic Molecule Representations

This repository is an implementation of Convolutional Networks on Graphs for Learning Molecular Fingerprints in PyTorch.

It includes a preprocessing function to convert molecules in smiles representation into molecule tensors.

Related work

There are several implementations of this paper publicly available:

by HIPS using autograd
by debbiemarkslab using theano
by [GUR9000] 3 using keras
by ericmjl using autograd
by DeepChem using tensorflow
by keiserlab using Keras

The closest implementation is the implementation by GUR9000 and keiserlab in Keras. However this repository represents moleculs in a fundamentally different way. The consequences are described in the sections below.

Molecule Representation

Atom, bond and edge tensors

This codebase uses tensor matrices to represent molecules. Each molecule is described by a combination of the following three tensors:

atom matrix, size: (max_atoms, num_atom_features) This matrix defines the atom features.

Each column in the atom matrix represents the feature vector for the atom at the index of that column.
edge matrix, size: (max_atoms, max_degree) This matrix defines the connectivity between atoms.

Each column in the edge matrix represent the neighbours of an atom. The neighbours are encoded by an integer representing the index of their feature vector in the atom matrix.

As atoms can have a variable number of neighbours, not all rows will have a neighbour index defined. These entries are filled with the masking value of -1. (This explicit edge matrix masking value is important for the layers to work)
bond tensor size: (max_atoms, max_degree, num_bond_features) This matrix defines the bond features.

The first two dimensions of this tensor represent the bonds defined in the edge tensor. The column in the bond tensor at the position of the bond index in the edge tensor defines the features of that bond.

Bonds that are unused are masked with 0 vectors.

Batch representations

This codes deals with molecules in batches. An extra dimension is added to all of the three tensors at the first index. Their respective sizes become:

atom matrix, size: (num_molecules, max_atoms, num_atom_features)
edge matrix, size: (num_molecules, max_atoms, max_degree)
bond tensor size: (num_molecules, max_atoms, max_degree, num_bond_features)

As molecules have different numbers of atoms, max_atoms needs to be defined for the entire dataset. Unused atom columns are masked by 0 vectors.

Dependencies

RDKit This dependency is necessary to convert molecules into tensor representatins, once this step is conducted, the new data can be stored, and RDkit is no longer a dependency.
PyTorch Requires PyTorch >= 1.0
NumPy Requires Numpy >= 0.19
Pandas Optional for examples

Acknowledgements

Implementation is based on Duvenaud et al., 2015.
Feature extraction scripts were implemented from the original implementation
Data preprocessing scripts were rewritten from keiserlab
Graphpool layer adopted from Han, et al., 2016

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
NeuralGraph		NeuralGraph
dataset		dataset
docker		docker
examples		examples
figs		figs
noteboooks		noteboooks
pretrained		pretrained
reports		reports
scripts		scripts
torch_nfp		torch_nfp
LICENSE		LICENSE
README.md		README.md
generate_nfp.py		generate_nfp.py
hubconf.py		hubconf.py
main_covid.py		main_covid.py
main_covid_docking_v5.py		main_covid_docking_v5.py
main_covid_xf.py		main_covid_xf.py
reproduce_distance_measure.py		reproduce_distance_measure.py
reproduce_main_results.py		reproduce_main_results.py
setup.py		setup.py
smile_similarity.py		smile_similarity.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyTorch Implementation of Neural Graph Fingerprint

Required Packages:

Applying NFP on canonical CSV files

Using NFP for similarity measure

DEMO on the last target (6vww_pocket8)

Spearman rank correlation (ρ)

MultiTask vs SingleTask accross various subsamples

Examples:

Generate Neural Fingerprint (NFP) using a trained model

Compute continuous Tanimoto Similarity

Calculate similarity between two SMILE strings

Reproduce results in the original paper

Convolutional Neural Graph Fingerprint

Related work

Molecule Representation

Atom, bond and edge tensors

Batch representations

Dependencies

Acknowledgements

About

Releases

Packages

Languages

License

YHRen/NGFP

Folders and files

Latest commit

History

Repository files navigation

PyTorch Implementation of Neural Graph Fingerprint

Required Packages:

Applying NFP on canonical CSV files

Using NFP for similarity measure

DEMO on the last target (6vww_pocket8)

Spearman rank correlation (ρ)

MultiTask vs SingleTask accross various subsamples

Examples:

Generate Neural Fingerprint (NFP) using a trained model

Compute continuous Tanimoto Similarity

Calculate similarity between two SMILE strings

Reproduce results in the original paper

Convolutional Neural Graph Fingerprint

Related work

Molecule Representation

Atom, bond and edge tensors

Batch representations

Dependencies

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages