Materials for the Computer-aided molecular design course at THM Gießen WS2019/2020
In this practical exercise between both lecturing blocks you will become a computational chemist. You will collect affinity data for a protein kinase target. If you do not know what protein kinase target to work on, please drop me an email.
Please perform the follwing steps:
- Go to Uniprot find out human Uniprot identifier for your protein kinase.
- Select IC50 assays for target in Chembl and remove PAINS compounds
- Create a pandas dataframe containing IC50 data and keep values with operators
- Average IC50 data for values without operator. In case of IC50 values with operators at different ligand concentrations use just one, the one with more information content: >10, >1 --> choose >10. In case you have both IC50 values for the same compound with and without operator just consider IC50 values without operator.
- Create training and test dataset (20%)
- Build five different categorical ML models predicting kinase activity using different scikit-learn learners. Use 1uM as activity threshold
- Analyse models using accuracy, sensitivity and specificity using cross-validation. Check your final model best model on the test data set.
- Select one model with good recall and one model with good precision
- Create another training/test split and build a regression model. Select best model based on AUC.
Materials used For each step please re-visit the excellent TeachOpenCADD talktorials jupyter notebooks that we have discussed during our lecture.
You can work with jupyter notebooks, finally please provide a python script (can be based on the jupyter notebooks) that can be executed from command-line and does every analysis step. Make PAINS filtering optional
Thanks and enjoy!