The preprint for this work is posted on arXiv.
- Data/process.py scripts to preprocess PaRoutes dataset and create training and evaluation partitions.
- Models/Architecture.py contains definitions of Encoder, Decoder, and combining Seq2Seq module.
- Models/Training.py definition of Lightning Training class
- Models/Configure.py definiton of model config
- Models/Generation.py implementation of beam search using python lists
- Models/TensorGen.py implementation of beam search using torch.Tensors to maximize GPU efficiency. Warning: the current algorithm works properly only with batch_size=1 inputs (PRs welcome).
- Utils/Dataset.py definition of custom torch Datasets used for training and evaluation.
- Utils/PreProcess.py all functions related to preprocessing of the PaRoutes dataset (used by Data/process.py)
- Utils/PostProcess.py all functions needed to postprocess results of beam search and run evaluations
- Utils/Visualize.py function that draws the synthesis tree as a pdf
For training see:
- train_nosm.py - w/o SM provided to encoder
- train_wsm.py - w/ SM provided to encoder
Once everything is set up, it's suffice to simply run python train_wsm.py
.
Run bash download_ckpts.sh
to download our checkpoints from the file storage.
Finally, we provide assess_single.py which allows to run our model on a single target compound.
To use the tutorials, simply move/copy them to the root directory. This is necessary because the notebooks use relative imports.
- Tutorials/Basic_Usage.ipynb walks you through how to input your compounds, steps, and starting materials. Visualization of routes in PDF is shown.
- Tutorials/Route_Separation.ipynb reproduces the route separation results from the paper.
- Tutorials/Pharma_Compounds.ipynb reproduces the three FDA-approved drug results from the paper.
All code is licensed under MIT License. The content of the pre-print on arXiv is licensed under CC-BY 4.0.
- Bring codecov to 80+.
- Revise Models/TensorGen.py so that it can work with batch size greater than 1.