Skip to content

This project addresses the task of predicting a molecule's biological activity or toxicity based on its chemical structure. Using a Quantitative Structure-Activity Relationship (QSAR) approach, we aim to train a machine learning model capable of accurately classifying molecules as active, inactive, or unknown.

Notifications You must be signed in to change notification settings

AdamAdonyi/Deep-Learning-Neural-network-based-Biological-Activity-prediction-from-Chemical-Structure-QSAR

Repository files navigation

QSAR-Challenge-Predicting-Biological-Activity-from-Chemical-Structure

This project addresses the task of predicting a molecule's biological activity or toxicity based on its chemical structure. Using a Quantitative Structure-Activity Relationship (QSAR) approach, we aim to train a machine learning model capable of accurately classifying molecules as active, inactive, or unknown.

Data:

The dataset consists of molecules (12 000) represented in SMILES format, along with their corresponding activity labels of 11 separated category (+1: active, 0: unknown, -1: inactive). The data is super inbalanced but see more details in the EDA. Test set has 5896 SMILES without labels in all 11 categories to evaluate the model. Challenge host provided server evaluating the model based on calculated AUC for prediction of the test set.

Data preparation had 3 pillars: Morgan Fingerprint (1024 added features), MACCSKeys (167 newly generated features), RDKit Desc. (210 new features) based on the SMILES.

Generated training data: (12000x1385)

Model:

I employed Deep Learning Neural network models. The models were trained on a portion of the dataset and evaluated on a separate testing set. Since the dataset was inbalanced, I created a model separately for each task. Each task-specific-model has a unique, distribution-specified class_weight assigned. The models have 6 layers with ReLu as activation function and with instituted drop out layers (0.2). As Final output layer with 3 neurons, the activation was softmax. The used loss functin was CategoricalCrossentropy, the optimizer was Adamax with 0.001 learning rate and 0.9 momentum. Early stopping was tested (code sometimes contains it) but not used since I used 35 epochs with 32 batch size.

Evaluation:

The performance was assessed using the auc on the evaluation server, which calculates the mean Area Under the ROC Curve (AUC) while considering the presence of unknown labels. During the training I modified hyperparameters based on the measured loss, accuracy, recall and F1 score.

Results:

I reached 0.752 AUC which considered a good score in case of the unbalanced data

Future Work:

It would be nice to improve the performance:

  • generating more data (Morgan Fingerprint)
  • introducing more QC step focusing on distributions of descriptions
  • testing optimizers and fine tune hyperparameter (gridsearch)

About

This project addresses the task of predicting a molecule's biological activity or toxicity based on its chemical structure. Using a Quantitative Structure-Activity Relationship (QSAR) approach, we aim to train a machine learning model capable of accurately classifying molecules as active, inactive, or unknown.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published