The chemical space of drug-discovery is very large and discrete. Screening through this space for molecules that satisfy biological and pharmacokinetic properties such as stability, solubility, efficacy, affinity and permeability poses a highly complex multiobjective optimization problem. Precisely, our Transformer model with modified encoder architecture is well suited for translating the information contained in high-throughput biological data into instances in the chemical space.
- We show that attention-based sequential prediction performs better and converges faster, by well attending to previously predicted outputs and encoded gene expression signature.
- Moreover, the model automatically learns the structural and chemical characteristics during training, which is evident by visually inspecting the common scaffolds in the generated and the actual compounds.
- By incorporating biological information in the form of altered gene expression, we have outperformed other DL based molecular generators in terms of validity, uniqueness and metrics like Synthetic Accessibility score and Tanimoto similarity with the known compound.
Altogether, our method can not only help in accelerating the early stage of drug discovery but can also aid in drug repurposing. This work, done under the guidance of Prof. Manikandan Narayanan, is accepeted as a poster at 'ML for Computation Biology' track at ISMB22.
- Installing RDKit
- python 3.6+
- tensorflow 2.1+
- numba v0.52
A single Jupyter notebook Modified_Transformer.ipynb
, downloads the dataset, evaluation toolkit (RDKit
), builds, trains and evaluates the transformer model. It's parameters can be easily modified and the whole setup can be easily ported to run with public-cloud like GCP, AWS, etc. or google-colab
.