Skip to content

CardioGenAI: A Machine Learning-Based Framework for Re-Engineering Drugs for Reduced hERG Liability

License

Notifications You must be signed in to change notification settings

batistagroup/CardioGenAI

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Picture3 Code style: black License: MIT

Summary

The link between in vitro hERG ion channel inhibition and subsequent in vivo QT interval prolongation, a critical risk factor for the development of arrythmias such as Torsade de Pointes, is so well established that in vitro hERG activity alone is often sufficient to end the development of an otherwise promising drug candidate. It is therefore of tremendous interest to develop advanced methods to identify hERG-active compounds in the early stages of drug development, as well as to redesign compounds for reduced hERG liability while maintaining their on-target potency. In this work, we present CardioGenAI, a machine learning-based framework for re-engineering both developmental and commercially available drugs for reduced hERG activity while preserving their pharmacological activity. The framework incorporates novel state-of-the-art discriminative models for predicting hERG channel activity, as well as activity against the voltage-gated NaV1.5 and CaV1.2 channels due to their potential implications in modulating the arrhythmogenic potential induced by hERG channel blockade. We applied the complete framework to pimozide, an FDA-approved antipsychotic agent that demonstrates high affinity to the hERG channel, and generated 100 refined candidates. Remarkably, among the candidates is fluspirilene, a compound which is of the same class of drugs (diphenylmethanes) as pimozide and therefore has similar pharmacological activity, yet exhibits over 700-fold weaker binding to hERG. We envision that this method can effectively be applied to developmental compounds exhibiting hERG liabilities to provide a means of rescuing drug development programs that have stalled due to hERG-related safety concerns. Additionally, the discriminative models can also serve independently as effective components of a virtual screening pipeline. We have made all of our software open-source to facilitate integration of the CardioGenAI framework for molecular hypothesis generation into drug discovery workflows.

cgaicf

High-Level Technical Overview of the Framework

The CardioGenAI framework combines generative and discriminative ML models to re-engineer hERG-active compounds for reduced hERG channel inhibition while preserving their pharmacological activity. An autoregressive transformer is trained on a dataset that we previously curated which contains approximately 5 million unique and valid SMILES strings derived from ChEMBL 33, GuacaMol v1, MOSES, and BindingDB datasets. The model is trained autoregressively, receiving a sequence of SMILES tokens as context as well as the corresponding molecular scaffold and physicochemical properties, and iteratively predicting each subsequent token in the sequence. Once trained, this model is able to generate valid molecules conditioned on a specified molecular scaffold along with a set of physicochemical properties. For an input hERG-active compound, the generation is conditioned on the scaffold and physicochemical properties of this compound. Each generated compound is subject to filtering based on activity against hERG, NaV1.5 and CaV1.2 channels. Depending on the desired activity against each channel, the framework employs either classification models to include predicted non-blockers (i.e., pIC50 value ≥ 5.0) or regression models to include compounds within a specified range of predicted pIC50 values. Both the classification and regression models utilize the same architecture, and are trained using three feature representations of each molecule: a feature vector that is extracted from a bidirectional transformer trained on SMILES strings, a molecular fingerprint, and a graph. For each molecule in the filtered generated ensemble and the input hERG-active molecule, a feature vector is constructed from the 209 chemical descriptors available through the RDKit Descriptors module. The redundant descriptors are then removed according to pairwise mutual information calculated for every possible pair of descriptors. Cosine similarity is then calculated between the processed descriptor vector of the input molecule and the descriptor vectors of every generated molecule to identify the molecules most chemically similar to the input molecule but with desired activity against each of the cardiac ion channels.

Installation and Setup

Follow these instructions to install and set up CardioGenAI on your local machine:

Cloning the Repository

Clone the CardioGenAI repository to your local environment using the following command:

git clone https://github.com/gregory-kyro/CardioGenAI.git

After cloning, navigate to the CardioGenAI project directory:

cd CardioGenAI

Setting Up the Conda Environment

Create a Conda environment using the environment.yml file provided in the repository which contains all of the necessary dependencies:

conda env create -f environment.yml

Activate the newly created environment:

conda activate cardiogenai_env

Downloading Necessary Files

Some essential files are not hosted directly in the GitHub repository due to their sizes. Please download the following files from the provided Google Drive links:

After downloading, place these files in the specified directories within the CardioGenAI project:

  • Autoregressive_Transformer_parameters.ptmodel_parameters/transformer_model_parameters
  • prepared_transformer_data.csvdata/prepared_transformer_datasets
  • raw_transformer_data.csvdata/raw_transformer_datasets
  • train_hERG.h5data/prepared_cardiac_datasets/

Running the Software

Running the complete CardioGenAI framework, performing inference with the discriminative models, and reproducing the figures in the manuscript can easily be achieved with the Jupyter notebook provided with this repository. Simply navigate to the CardioGenAI project directory, open the _run.ipynb notebook, and select the cardiogenai_env environment as the kernel. Usage instructions are below.

Running the CardioGenAI Framework

To optimize a cardiotoxic compound with CardioGenAI, utilize the optimize_cardiotoxic_drug function from the Optimization_Framework module:

from src.Optimization_Framework import optimize_cardiotoxic_drug

optimize_cardiotoxic_drug(input_smiles,
                          herg_activity,
                          nav_activity,
                          cav_activity,
                          n_generations,
                          device)
  • input_smiles (str): The input SMILES string of the compound that you seek to optimize for reduced cardiac ion channel activity.
  • herg_activity (tuple or str): hERG activity for which to filter. If the entry is a string, it must be either 'blockers' or 'non-blockers'. If it is a tuple, it must indicate a range of activity values.
  • nav_activity (tuple or str): NaV1.5 activity for which to filter. If the entry is a string, it must be either 'blockers' or 'non-blockers'. If it is a tuple, it must indicate a range of activity values.
  • cav_activity (tuple or str): CaV1.2 activity for which to filter. If the entry is a string, it must be either 'blockers' or 'non-blockers'. If it is a tuple, it must indicate a range of activity values.
  • n_generations (int): The number of optimized drug candidates to generate. Default is 100.
  • device (str): The device to use for the optimization. Must be either 'gpu' or 'cpu'. Default is 'gpu'.

Performing Inference with the Discriminative Models

To predict activity against the hERG, NaV1.5 and CaV1.2 channels, utilize the predict_cardiac_ion_channel_activity function from the Discriminator module:

from src.Discriminator import predict_cardiac_ion_channel_activity

predict_cardiac_ion_channel_activity(input_data,
                                     prediction_type,
                                     predict_hERG,
                                     predict_Nav,
                                     predict_Cav,
                                     device)
  • input_data (str or list): The input data for which the discriminative models will process. If the entry is a string, it must be either a SMILES string or a path to a prepared h5 file. If it is a list, it must be a list of SMILES strings.
  • prediction_type (str): Either 'regression' or 'classification'. Default is 'regression'.
  • predict_hERG (bool): Whether to predict hERG activity. Default is True.
  • predict_Nav (bool): Whether to predict NaV1.5 activity. Default is False.
  • predict_Cav (bool): Whether to predict CaV1.2 activity. Default is False.
  • device (str): The device to use for the inference computations. Must be either 'gpu' or 'cpu'. Default is 'gpu'.

Reproducing the Figures in the Manuscript

To reproduce the results presented in the manuscript, utilize the get_figures function from the Figures module:

from src.Figures import get_figures

get_figures()

About

CardioGenAI: A Machine Learning-Based Framework for Re-Engineering Drugs for Reduced hERG Liability

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.6%
  • Jupyter Notebook 2.4%