ChemBio-Hackathon

This is our submission for the ChemBio AI hackathon conducted by ChemBio AI under Prometeo'24
Report- Report Link
Slides- Slides Link

Problem Statement

Build a seq to seq based model which could predict the reaction conditions for the reaction i.e. given reaction smiles as an input it should be able to predict the conditions.

Problem Description

To give an overview of the problem statement, We are given a dataset that contains the organic reaction in smiley format. The conditions on which the prediction is required are Reagent, Solvent, and Catalyst based on the Given Organic reaction

Data Description

canonic_rxn: This column contains the canonical representation of chemical reactions using SMILES (Simplified Molecular Input Line Entry System) strings. Each row represents a distinct chemical transformation, capturing the reactants and products involved.
rxnmapper_aam: This column encodes atom-to-atom mappings (AAM) for the reaction. AAM is a technique used to establish correspondence between atoms in reactants and products, facilitating the tracking of atom transformations during a reaction.
Reagent: This column details the reagents utilized in the chemical reactions. Reagents are chemical substances introduced into a reaction to initiate or facilitate the transformation.
Solvent: The solvents used are listed in the solvent column for each reaction. Reaction conditions are greatly influenced by solvents, which also have an impact on reaction rates and results.
Catalyst: Substances known as catalysts quicken reactions without changing permanently. Understanding catalysts helps one gain an understanding of the efficiency and mechanism of reactions.
Yield: The yield column measures the effectiveness of the chemical reactions. Reaction yields are an essential metric for assessing the efficacy of the reaction conditions and measuring the success of the reaction.

Methodology

For this we used a graph-based seq2seq network. With a graph fused encoder-based encoder and the transformer decoder for the decoder. The input goes for the smiley string and the output comes out as all three different Reagants. Here is an overview of Model Architectures

Any smiles contains a reaction, so first task is to separate both the reactant and product.
Each of the product and reactant are parsed through the Graph Encoder which first gets the graph features and then apply an attention encoder to it
Then A decoder is attached and then the output of both is fused into a single space which is projected through 3 different classes to give our 3 different outputs
The tokenizing and padding are based on count vectorizer and spacy
The output comes as a 1d tensor which can be later decoded to be exact outputs

Model Archietecure

Here is an overview of the architecture and the flow of variables in our architecture. Given below is the flow of variables, how the flow of variables is happening, what are the inputs and what are their respective outputs
Given Below is the Detailed architecture

Training Details

For Training, I have used Adam as the Optimizer and the MSE for the loss calculation, for each output loss is calculated and total loss is calculated by adding all the different loss and backpropagating the loss I trained it for 10 epochs, here is the loss curve With the ongoing epochs, the loss is decreasing with each epoch, showing a combined use loss of 300 range which keeps decreasing with each epoch

Inference

For inferencing, I have initialized the target sequences as zeros and the output probability is being calculated which I am later decoding using the dictionary I made from tokenizer by getting the ceratin string corresponding to the output

Files Descriptions

Notebooks- This folder contains all the experiments, preprocessing, and sample architectures, The full Training Loop.ipynb is our main file in which we compiled all of the architectures, pipelines and compiled into one and did the training. Rest are the python files for each of the different component present, to do the training if needed on local machine

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
Notebooks		Notebooks
ChemBio Hackathon.pdf		ChemBio Hackathon.pdf
Dataset.jpg		Dataset.jpg
Flow of Variables.jpg		Flow of Variables.jpg
GAT_model.py		GAT_model.py
Inference Result.jpg		Inference Result.jpg
Loss.jpg		Loss.jpg
README.md		README.md
Report.pdf		Report.pdf
architecture.jpg		architecture.jpg
dataloader.py		dataloader.py
decoder.py		decoder.py
helper.py		helper.py
text_helper.py		text_helper.py
training.py		training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChemBio-Hackathon

Problem Statement

Problem Description

Data Description

Methodology

Model Archietecure

Training Details

Inference

Files Descriptions

About

Releases

Packages

Contributors 2

Languages

vageesh1/ChemBio-Hackathon

Folders and files

Latest commit

History

Repository files navigation

ChemBio-Hackathon

Problem Statement

Problem Description

Data Description

Methodology

Model Archietecure

Training Details

Inference

Files Descriptions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages