QSAR-Challenge-Predicting-Biological-Activity-from-Chemical-Structure

This project addresses the task of predicting a molecule's biological activity or toxicity based on its chemical structure. Using a Quantitative Structure-Activity Relationship (QSAR) approach, we aim to train a machine learning model capable of accurately classifying molecules as active, inactive, or unknown.

Data:

The dataset consists of molecules (12 000) represented in SMILES format, along with their corresponding activity labels of 11 separated category (+1: active, 0: unknown, -1: inactive). The data is super inbalanced but see more details in the EDA. Test set has 5896 SMILES without labels in all 11 categories to evaluate the model. Challenge host provided server evaluating the model based on calculated AUC for prediction of the test set.

Data preparation had 3 pillars: Morgan Fingerprint (1024 added features), MACCSKeys (167 newly generated features), RDKit Desc. (210 new features) based on the SMILES.

Generated training data: (12000x1385)

Model:

I employed Deep Learning Neural network models. The models were trained on a portion of the dataset and evaluated on a separate testing set. Since the dataset was inbalanced, I created a model separately for each task. Each task-specific-model has a unique, distribution-specified class_weight assigned. The models have 6 layers with ReLu as activation function and with instituted drop out layers (0.2). As Final output layer with 3 neurons, the activation was softmax. The used loss functin was CategoricalCrossentropy, the optimizer was Adamax with 0.001 learning rate and 0.9 momentum. Early stopping was tested (code sometimes contains it) but not used since I used 35 epochs with 32 batch size.

Evaluation:

The performance was assessed using the auc on the evaluation server, which calculates the mean Area Under the ROC Curve (AUC) while considering the presence of unknown labels. During the training I modified hyperparameters based on the measured loss, accuracy, recall and F1 score.

Results:

I reached 0.752 AUC which considered a good score in case of the unbalanced data

Future Work:

It would be nice to improve the performance:

generating more data (Morgan Fingerprint)
introducing more QC step focusing on distributions of descriptions
testing optimizers and fine tune hyperparameter (gridsearch)

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Heatmap_training_data.JPG		Heatmap_training_data.JPG
QSAR_Challenge_2024_Code.ipynb		QSAR_Challenge_2024_Code.ipynb
README.md		README.md
data_train.csv		data_train.csv
model.png		model.png
model_arch.png		model_arch.png
smiles.png		smiles.png
smiles_test.csv		smiles_test.csv
strategy_data_prep.JPG		strategy_data_prep.JPG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QSAR-Challenge-Predicting-Biological-Activity-from-Chemical-Structure

Data:

Generated training data: (12000x1385)

Model:

Evaluation:

Results:

Future Work:

About

Releases

Packages

Languages

AdamAdonyi/Deep-Learning-Neural-network-based-Biological-Activity-prediction-from-Chemical-Structure-QSAR

Folders and files

Latest commit

History

Repository files navigation

QSAR-Challenge-Predicting-Biological-Activity-from-Chemical-Structure

Data:

Generated training data: (12000x1385)

Model:

Evaluation:

Results:

Future Work:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages