-
Some problems are so difficult that no one researcher, research group, research institute, or multi-national company can make meaningful progress. It takes a world wide effort and collaborations between industry and academia. Drug discovery is one such area. Advances in data science (aka artificial intelligence, machine learning) are being applied to data sets from high-throughput experimental techniques and historical databases of biomedical literature, publicly available to the world community.
-
The process of small molecule drug development involves the gradual reduction of tens of thousands of small molecules to a drug candidate that eventually is given to patients in clinical trials. This is a long (decades, often the whole career of a researcher), costly process and engages all corners of our interconnected economy (scientists, physicians, doctors, entrepreneurs, investors, pharmaceutical companies, government officials). These real world constraints pressure research questions to shy away from too much risk and leave many diseases untreated. But computational methods that have become popularized within the past decade can help make data driven decisions earlier in the decision making process, so that drugs can be developed better, faster, and cheaper. At this workshop you will get hands on experience solving the types of problems that keep our researchers up at night.
- Input data: a precomputed and relatively clean data set of ~1000 drugs-like molecules by ~100 chemical features
- Goal: Your job is to categorize drug-like molecules into a smaller diverse and representative set. This is a real-world unsupervised multi-class classification problem encountered in a biotech startup. There is underlying structure in this data set and we have solved it one way and are curious to see how you solve it.
- Hints: you will be given clues about the structure of the data at the event, but for now it's top secret! We have prepared Python code snippets (pandas, numpy, scikit-learn) for a solution using k-means clustering to move you along toward the goal within the time constraints of the event.
- This jupyter notebook is here to help facilitate the workshop
from IPython.display import Image
Image("Screen Shot 2016-10-27 at 3.29.58 PM.png")
- If you don't have pandas, numpy, scikit-learn, matplotlib, etc installed then do so with
pip install pandas, numpy, scikit-learn, matplotlib
- You can check which libraries you have installed with
pip freeze
import pandas as pd
inputfile = 'chemicalDataForStudents20161027-110104.csv'
df = pd.read_csv(inputfile, sep=',')
# take a peak at the data
print df.shape
print df.tail(3)
(1650, 111)
LabuteASA MaxAbsEStateIndex MaxAbsPartialCharge MaxEStateIndex \
1514 164.703793 14.880596 0.496768 14.880596
161 148.584776 5.798308 0.493601 5.798308
1220 138.515751 12.549085 0.347020 12.549085
MaxPartialCharge MinAbsEStateIndex MinAbsPartialCharge \
1514 0.350866 0.002799 0.350866
161 0.215753 0.686138 0.215753
1220 0.244674 0.081001 0.244674
MinEStateIndex MinPartialCharge MolLogP ... \
1514 -0.875036 -0.496768 3.48928 ...
161 0.686138 -0.493601 4.33182 ...
1220 -0.668981 -0.347020 1.25950 ...
fr_term_acetylene fr_tetrazole fr_thiazole fr_thiophene \
1514 0 0 0 0
161 0 0 0 0
1220 0 0 0 0
fr_unbrch_alkane fr_urea \
1514 0 0
161 5 0
1220 0 0
smiles \
1514 COc1ccc(cc1)c2ccc(c(c2)F)N\3C(=O)CS/C3=C(/C#N)...
161 Cc1cc(on1)CCCCCCCOc2ccc(cc2)C3=NCCO3
1220 CCCC[C@H](CN(C=O)O)C(=O)N[C@H](C(=O)N(C)C)C(C)...
codeName y_pred clusterSize_y_pred
1514 JamesWatt-JeanBaptisteLamarck 1390 1
161 DanielBernoulli-CharlesDarwin 1391 1
1220 Empedocles-CharlesAugustindeCoulomb 1392 1
[3 rows x 111 columns]
# the chemicals can be represented by a string.
print df.head().smiles
# each compound has a codeName
# the codeNames are how we will can refer to them after the analysis (rather than by row number or smiles)
print df.head().codeName
525 Cc1cc(nc(c1)N)COC[C@H](CN)OCc2cc(cc(n2)N)C
526 Cc1cc(nc(c1)N)COCC[C@@H](CN)OCc2cc(cc(n2)N)C
527 Cc1cc(nc(c1)N)COC[C@@H]([C@H](C)OCc2cc(cc(n2)N...
528 Cc1cc(nc(c1)N)COC[C@@H](CN)OCc2cc(cc(n2)N)C
415 CC(C)(C)NC(=O)[C@@H](c1ccccc1)NC(=O)N(C)Cc2ccc...
Name: smiles, dtype: object
525 JamesClerkMaxwell-ErnstMayr
526 BillNye-FrankHornby
527 CharlesLyell-ErwinSchrodinger
528 Empedocles-GustavKirchoff
415 CharlesAugustindeCoulomb-FrancisCrick
Name: codeName, dtype: object
- Real data is messy. Data sanitization involves
- removing features or samples that didn't compute for all samples
- removing outliers that you suspect are artefacts or that will wildnly bias the predictions that come from the data
- The data provided has been filtered a bit, bit be warned that this is an important part of the process and can take a long time
- The features need to be treated equally. Just because units change from grams to kilograms does not mean there is a 100x difference
- There are various ways to standardize data. You may have read about standard scores (Z-statistic). In the end each feature should be centred around the same value and have the same max and min.
- The way this is done should preserve the variation in each feature. So remember your numerical methods computer science class and beware of subtracting errors and the like.
# just get features from data, remove labels
df_un = df.drop(['codeName', 'smiles'], 1)
# normalize
import numpy as np
df_norm = (df_un - df_un.mean()) / (df_un.max() - df_un.min())
X = np.array(df_norm)
# X is basically scaled to be between 1 and zero in way that is robust to real word data
# you can uncomment this to check
# print 'mean', np.mean(X,0)
# print 'max', np.max(X,0)
# print 'min', np.min(X,0)
- Read these links
- K-means clustering comes up with labels for unlabelled data. It takes the data and a parameter (we call it k here) that fixes the number of clusters
- Try out different values of k using the code below
- The key line of code below is
y_pred = KMeans(n_clusters=k, random_state=random_state).fit_predict(X)
- It takes the normalized data and asigns cluster labels to it, such that there are k unique clusters.
- Properties of k
- k is integer, since clusters are countable
- k is at least 1. This would be one big cluster
- k is at most teh number of samples (the rows of X). This would treat every sample as its own cluster (a singleton)
# cluster by kmeans
from sklearn.cluster import KMeans
import random
random.seed(0)
k = int(random.uniform(1, len(X))) # set k without any prior knowledge... any number between 1 and the number of samples
print 'k', k
random_state = 0
y_pred = KMeans(n_clusters=k, random_state=random_state).fit_predict(X)
df['y_pred'] = y_pred # plot and analyze unnormalized data with labels
k 1393
- Now that the clustering is done we can look at the sizes of the clsuters. The function
np.histogram
- outputs two arrays, [the number of clusters of a given size], [the size of the clusters]
# look at cluster size
print np.histogram(df.groupby('y_pred').size(), bins = np.append(np.unique(df.groupby( ["y_pred"] ).size()), np.max(df.groupby( ["y_pred"] ).size())+1))
# add in cluster size to df
df = pd.merge(df, pd.DataFrame({'clusterSize_y_pred' : df.groupby( ["y_pred"] ).size()}).reset_index(), on='y_pred')
print df.tail()
(array([1161, 208, 23, 1]), array([1, 2, 3, 4, 5]))
LabuteASA MaxAbsEStateIndex MaxAbsPartialCharge MaxEStateIndex \
1645 147.806545 14.525346 0.378511 14.525346
1646 149.812648 12.574862 0.477880 12.574862
1647 181.229439 14.001653 0.460949 14.001653
1648 203.683877 11.743598 0.438042 11.743598
1649 124.973421 11.101663 0.507823 11.101663
MaxPartialCharge MinAbsEStateIndex MinAbsPartialCharge \
1645 0.154401 0.031220 0.154401
1646 0.330899 0.050841 0.330899
1647 0.258894 0.164579 0.258894
1648 0.233112 0.053390 0.233112
1649 0.230804 0.042424 0.230804
MinEStateIndex MinPartialCharge MolLogP ... \
1645 -0.910887 -0.378511 2.73290 ...
1646 -1.397395 -0.477880 1.01617 ...
1647 -0.569145 -0.460949 1.87380 ...
1648 -0.053390 -0.438042 4.59410 ...
1649 -1.291383 -0.507823 1.83460 ...
fr_term_acetylene fr_tetrazole fr_thiazole fr_thiophene \
1645 0 0 0 0
1646 0 0 0 0
1647 0 0 0 0
1648 0 0 0 0
1649 0 0 0 0
fr_unbrch_alkane fr_urea \
1645 0 0
1646 0 0
1647 0 0
1648 0 0
1649 0 0
smiles \
1645 Cn1cc(cn1)[C@H]2C[C@H]3CSC(=N[C@]3(CO2)c4ccc(c...
1646 [H]/N=C/1\NC(=O)[C@]2(S1)C=C(C[C@H]([C@@H]2NC(...
1647 c1cc(oc1)c2nc3nc(nc(n3n2)N)NCCN4CCN(CC4)c5ccc(...
1648 CCC(=O)Nc1cccc(c1)Oc2c3cc[nH]c3nc(n2)Nc4ccc(cc...
1649 c1cc2c(cc1O)OC[C@]3([C@@H]2Oc4c3cc5c(c4)OCO5)O
codeName y_pred clusterSize_y_pred
1645 MichaelFaraday-GalileoGalilei 1317 1
1646 CarlBosch-RobertHooke 404 1
1647 FrancisGalton-Anaximander 402 1
1648 BenjaminThompson-KonradLorenz 1332 1
1649 RobertKoch-AndreMarieAmpere 268 1
[5 rows x 111 columns]
# get top clusters
topClusters=df[['y_pred', 'clusterSize_y_pred']].drop_duplicates().sort_values(by='clusterSize_y_pred', ascending=[0]).head()
print topClusters
y_pred clusterSize_y_pred
525 142 4
1103 199 3
341 57 3
36 1233 3
546 218 3
- do the compunds in the same clusters look the same?
- use this webtool to check https://cactus.nci.nih.gov/gifcreator/
- Since we know the real clusters by another method we can compare your to ours
- Output your final list of y_pred classes with the codeNames and smiles and we can go back and check if they are the same as our classes
- The code below outputs a csv file. Details of how to submit will be given at the workshop
# sort data
df = df.sort_values(by=['clusterSize_y_pred', 'y_pred'], ascending=[0,1])
# output data
import time
timestr = time.strftime("%Y%m%d-%H%M%S")
initials='gw'
output = 'predictedClasses' + initials + timestr +'.csv'
df.to_csv(output, sep=',', index=False)
df.head(50)
LabuteASA | MaxAbsEStateIndex | MaxAbsPartialCharge | MaxEStateIndex | MaxPartialCharge | MinAbsEStateIndex | MinAbsPartialCharge | MinEStateIndex | MinPartialCharge | MolLogP | ... | fr_term_acetylene | fr_tetrazole | fr_thiazole | fr_thiophene | fr_unbrch_alkane | fr_urea | smiles | codeName | y_pred | clusterSize_y_pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
525 | 141.729151 | 5.759574 | 0.383683 | 5.759574 | 0.123477 | 0.226210 | 0.123477 | -0.226210 | -0.383683 | 1.31854 | ... | 0 | 0 | 0 | 0 | 0 | 0 | Cc1cc(nc(c1)N)COC[C@H](CN)OCc2cc(cc(n2)N)C | JamesClerkMaxwell-ErnstMayr | 142 | 4 |
526 | 148.094093 | 5.819164 | 0.383683 | 5.819164 | 0.123477 | 0.098079 | 0.123477 | -0.098079 | -0.383683 | 1.70864 | ... | 0 | 0 | 0 | 0 | 1 | 0 | Cc1cc(nc(c1)N)COCC[C@@H](CN)OCc2cc(cc(n2)N)C | BillNye-FrankHornby | 142 | 4 |
527 | 148.094093 | 6.124918 | 0.383683 | 6.124918 | 0.123477 | 0.181412 | 0.123477 | -0.259896 | -0.383683 | 1.70704 | ... | 0 | 0 | 0 | 0 | 0 | 0 | Cc1cc(nc(c1)N)COC[C@@H]([C@H](C)OCc2cc(cc(n2)N... | CharlesLyell-ErwinSchrodinger | 142 | 4 |
528 | 141.729151 | 5.759574 | 0.383683 | 5.759574 | 0.123477 | 0.226210 | 0.123477 | -0.226210 | -0.383683 | 1.31854 | ... | 0 | 0 | 0 | 0 | 0 | 0 | Cc1cc(nc(c1)N)COC[C@@H](CN)OCc2cc(cc(n2)N)C | Empedocles-GustavKirchoff | 142 | 4 |
415 | 185.862478 | 12.947534 | 0.477530 | 12.947534 | 0.339488 | 0.007434 | 0.339488 | -1.177893 | -0.477530 | 2.91090 | ... | 0 | 0 | 0 | 0 | 0 | 1 | CC(C)(C)NC(=O)[C@@H](c1ccccc1)NC(=O)N(C)Cc2ccc... | CharlesAugustindeCoulomb-FrancisCrick | 6 | 3 |
416 | 185.862478 | 12.911895 | 0.477530 | 12.911895 | 0.339488 | 0.003046 | 0.339488 | -1.172572 | -0.477530 | 2.91250 | ... | 0 | 0 | 0 | 0 | 1 | 1 | CCCCNC(=O)[C@H](c1ccccc1)NC(=O)N(C)Cc2ccc3c(c2... | WolfgangErnstPauli-Lucretius | 6 | 3 |
417 | 201.824746 | 13.056169 | 0.477530 | 13.056169 | 0.339488 | 0.016616 | 0.339488 | -1.181843 | -0.477530 | 3.31260 | ... | 0 | 0 | 0 | 0 | 0 | 1 | CN(Cc1ccc2c(c1C(=O)O)OCO2)C(=O)N[C@@H](c3ccccc... | ErwinSchrodinger-EvangelistaTorricelli | 6 | 3 |
151 | 112.519202 | 12.044467 | 0.504068 | 12.044467 | 0.200850 | 0.003845 | 0.200850 | -0.738647 | -0.504068 | 2.57680 | ... | 0 | 0 | 0 | 0 | 0 | 0 | c1ccc(cc1)C2=CC(=O)c3c(cc(c(c3O)O)O)O2 | LinusPauling-IsaacNewton | 36 | 3 |
152 | 123.997689 | 12.233506 | 0.507966 | 12.233506 | 0.203372 | 0.033681 | 0.203372 | -0.472106 | -0.507966 | 2.58540 | ... | 0 | 0 | 0 | 0 | 0 | 0 | COc1c(cc2c(c1O)C(=O)C=C(O2)c3ccc(cc3)O)O | CarlFriedrichGauss-BillNye | 36 | 3 |
153 | 112.519202 | 12.350655 | 0.507822 | 12.350655 | 0.199995 | 0.009887 | 0.199995 | -0.312312 | -0.507822 | 2.57680 | ... | 0 | 0 | 0 | 0 | 0 | 0 | c1cc(c(cc1C2=COc3cc(ccc3C2=O)O)O)O | SigmundFreud-BenjaminFranklin | 36 | 3 |
220 | 130.436067 | 14.155215 | 0.378494 | 14.155215 | 0.154894 | 0.034118 | 0.154894 | -0.853547 | -0.378494 | 3.16290 | ... | 0 | 0 | 0 | 0 | 0 | 0 | C[C@]1(C[C@H](SC(=N1)N)c2cncnc2)c3ccc(cc3F)F | LeonardodaVinci-JamesWatson | 44 | 3 |
221 | 130.436067 | 14.314858 | 0.378512 | 14.314858 | 0.154285 | 0.254836 | 0.154285 | -0.794143 | -0.378512 | 3.08860 | ... | 0 | 0 | 0 | 0 | 0 | 0 | C[C@]1(CCSC(=N1)N)c2cc(c(cc2F)F)c3cncnc3 | JeanBaptisteLamarck-ThomasKuhn | 44 | 3 |
222 | 136.691989 | 14.247607 | 0.378494 | 14.247607 | 0.154895 | 0.047960 | 0.154895 | -0.867130 | -0.378494 | 3.97774 | ... | 0 | 0 | 0 | 0 | 0 | 0 | Cc1c(c(on1)C)[C@@H]2C[C@@](N=C(S2)N)(C)c3ccc(c... | CarolusLinnaeus-FranzBoas | 44 | 3 |
341 | 194.858937 | 12.370274 | 0.488253 | 12.370274 | 0.488253 | 0.250590 | 0.423170 | -1.478263 | -0.423170 | 2.46130 | ... | 0 | 0 | 0 | 0 | 0 | 0 | B(c1ccccc1CN2CCN(CC2)C3=NC(=O)/C(=C/c4ccc(c(c4... | Lucretius-Avicenna | 57 | 3 |
342 | 194.858937 | 12.356106 | 0.487918 | 12.356106 | 0.487918 | 0.236423 | 0.423177 | -1.443879 | -0.423177 | 2.46130 | ... | 0 | 0 | 0 | 0 | 0 | 0 | B(c1ccc(cc1)CN2CCN(CC2)C3=NC(=O)/C(=C/c4ccc(c(... | LouisdeBroglie-HenryMoseley | 57 | 3 |
343 | 194.858937 | 12.362383 | 0.487928 | 12.362383 | 0.487928 | 0.242705 | 0.423177 | -1.460291 | -0.423177 | 2.46130 | ... | 0 | 0 | 0 | 0 | 0 | 0 | B(c1cccc(c1)CN2CCN(CC2)C3=NC(=O)/C(=C/c4ccc(c(... | FranzBoas-HermannvonHelmholtz | 57 | 3 |
105 | 148.269255 | 12.380685 | 0.312156 | 12.380685 | 0.236417 | 0.064117 | 0.236417 | -3.479281 | -0.312156 | 3.31770 | ... | 0 | 0 | 0 | 0 | 0 | 0 | CCCN1c2ccc(cc2CCC1=O)NS(=O)(=O)Cc3ccccc3 | MaxPlanck-JackHorner | 62 | 3 |
106 | 154.634197 | 12.465613 | 0.312156 | 12.465613 | 0.236417 | 0.064672 | 0.236417 | -3.492131 | -0.312156 | 3.62612 | ... | 0 | 0 | 0 | 0 | 0 | 0 | CCCN1c2ccc(cc2CCC1=O)NS(=O)(=O)Cc3ccc(cc3)C | IsaacNewton-HeinrichHertz | 62 | 3 |
107 | 141.904313 | 12.363117 | 0.315211 | 12.363117 | 0.236417 | 0.066713 | 0.236417 | -3.481908 | -0.315211 | 2.84592 | ... | 0 | 0 | 0 | 0 | 0 | 0 | Cc1ccc(cc1)CS(=O)(=O)Nc2ccc3c(c2)CCC(=O)N3C | Lucretius-LouisPasteur | 62 | 3 |
693 | 217.369820 | 13.986122 | 0.443692 | 13.986122 | 0.407311 | 0.053205 | 0.407311 | -4.004868 | -0.443692 | 3.13200 | ... | 0 | 0 | 0 | 0 | 0 | 0 | CC(C)[C@H]1Cc2cc(ccc2S(=O)(=O)N(C1)C[C@H]([C@H... | FrancisCrick-AlbertEinstein | 114 | 3 |
694 | 223.734762 | 14.081573 | 0.443692 | 14.081573 | 0.407311 | 0.047279 | 0.407311 | -4.031627 | -0.443692 | 3.52210 | ... | 0 | 0 | 0 | 0 | 0 | 0 | CC(C)(C)[C@@H]1Cc2cc(ccc2S(=O)(=O)N(C1)C[C@H](... | WilliamHarvey-AlessandroVolta | 114 | 3 |
695 | 223.734762 | 14.081573 | 0.443692 | 14.081573 | 0.407311 | 0.047279 | 0.407311 | -4.031627 | -0.443692 | 3.52210 | ... | 0 | 0 | 0 | 0 | 0 | 0 | CC(C)(C)[C@H]1Cc2cc(ccc2S(=O)(=O)N(C1)C[C@H]([... | MarieCurie-AlexanderVonHumboldt | 114 | 3 |
1000 | 161.901036 | 7.541884 | 0.485185 | 7.541884 | 0.150640 | 0.007064 | 0.150640 | -0.263058 | -0.485185 | 1.24734 | ... | 0 | 0 | 0 | 0 | 0 | 0 | c1cc(cc(c1)O[C@H]2CO[C@H]3[C@@H]2OC[C@H]3Oc4cc... | JamesWatt-JamesWatson | 148 | 3 |
1001 | 161.901036 | 7.544344 | 0.485160 | 7.544344 | 0.150640 | 0.009842 | 0.150640 | -0.273739 | -0.485160 | 1.24734 | ... | 0 | 0 | 0 | 0 | 0 | 0 | c1cc(cc(c1)O[C@@H]2CO[C@H]3[C@@H]2OC[C@@H]3Oc4... | FriedrichAugustKekule-MichaelFaraday | 148 | 3 |
1002 | 161.901036 | 7.439745 | 0.485185 | 7.439745 | 0.150639 | 0.019691 | 0.150639 | -0.234382 | -0.485185 | 1.24734 | ... | 0 | 0 | 0 | 0 | 0 | 0 | c1cc(ccc1C(=N)N)O[C@H]2CO[C@H]3[C@@H]2OC[C@H]3... | GalileoGalilei-HenryMoseley | 148 | 3 |
11 | 190.631761 | 12.020077 | 0.312609 | 12.020077 | 0.258254 | 0.175468 | 0.258254 | -0.175468 | -0.312609 | 3.65440 | ... | 0 | 0 | 0 | 0 | 0 | 0 | c1cc(ccc1CC2CCN(CC2)CCc3cnn(c3)c4c5c(ccn4)C(=O... | WernerHeisenberg-BenjaminFranklin | 171 | 3 |
12 | 184.266819 | 12.009169 | 0.312609 | 12.009169 | 0.258254 | 0.177748 | 0.258254 | -0.177748 | -0.312609 | 3.57930 | ... | 0 | 0 | 0 | 0 | 0 | 0 | c1cc(ccc1C2CCN(CC2)CCc3cnn(c3)c4c5c(ccn4)C(=O)... | ArthurEddington-PeterDebye | 171 | 3 |
13 | 194.570085 | 12.020277 | 0.312609 | 12.020277 | 0.258254 | 0.185669 | 0.258254 | -0.185669 | -0.312609 | 4.23270 | ... | 0 | 0 | 0 | 0 | 0 | 0 | c1cnc(c2c1C(=O)NC=N2)n3cc(cn3)CCN4CCC(CC4)c5cc... | JagadishChandraBose-AlbertEinstein | 171 | 3 |
217 | 138.697590 | 9.485220 | 0.507966 | 9.485220 | 0.115120 | 0.191250 | 0.115120 | 0.191250 | -0.507966 | 4.92030 | ... | 0 | 0 | 0 | 0 | 0 | 0 | c1ccc(c(c1)N=C(c2ccc(cc2)O)c3ccc(cc3)O)Cl | CharlesAugustindeCoulomb-Anaximander | 180 | 3 |
218 | 134.759266 | 9.506331 | 0.507966 | 9.506331 | 0.115120 | 0.217313 | 0.115120 | 0.217313 | -0.507966 | 4.57532 | ... | 0 | 0 | 0 | 0 | 0 | 0 | Cc1ccccc1N=C(c2ccc(cc2)O)c3ccc(cc3)O | HenryMoseley-AmedeoAvogadro | 180 | 3 |
219 | 128.394324 | 9.462216 | 0.507966 | 9.462216 | 0.115120 | 0.216220 | 0.115120 | 0.216220 | -0.507966 | 4.26690 | ... | 0 | 0 | 0 | 0 | 0 | 0 | c1ccc(cc1)N=C(c2ccc(cc2)O)c3ccc(cc3)O | Euclid-Avicenna | 180 | 3 |
1103 | 146.057666 | 10.279459 | 0.385467 | 10.279459 | 0.138992 | 0.341144 | 0.138992 | -0.613237 | -0.385467 | 4.00098 | ... | 0 | 0 | 0 | 0 | 0 | 0 | C[C@H](c1nc2cnc3c(c2n1C4CCC(CC4)CCC#N)cc[nH]3)O | Avicenna-JamesWatt | 199 | 3 |
1104 | 133.327782 | 10.193187 | 0.385467 | 10.193187 | 0.138992 | 0.157389 | 0.138992 | -0.634183 | -0.385467 | 3.22078 | ... | 0 | 0 | 0 | 0 | 0 | 0 | C[C@H](c1nc2cnc3c(c2n1C4CCC(CC4)C#N)cc[nH]3)O | ArthurEddington-LinusPauling | 199 | 3 |
1105 | 139.692724 | 10.240533 | 0.385467 | 10.240533 | 0.138992 | 0.314854 | 0.138992 | -0.622015 | -0.385467 | 3.61088 | ... | 0 | 0 | 0 | 0 | 0 | 0 | C[C@H](c1nc2cnc3c(c2n1C4CCC(CC4)CC#N)cc[nH]3)O | MarianoArtigas-RichardFeynman | 199 | 3 |
546 | 148.927912 | 12.326558 | 0.393567 | 12.326558 | 0.236881 | 0.205146 | 0.236881 | -1.125546 | -0.393567 | -1.72870 | ... | 0 | 0 | 0 | 0 | 0 | 0 | CCCC(C(=O)N[C@@H]1[C@@H]([C@H](O[C@H]1n2cnc3c2... | JamesClerkMaxwell-GottfriedLeibniz | 218 | 3 |
547 | 148.927912 | 12.378086 | 0.393567 | 12.378086 | 0.237148 | 0.080400 | 0.237148 | -1.134309 | -0.393567 | -1.87280 | ... | 0 | 0 | 0 | 0 | 0 | 0 | CC(C)[C@@H](C(=O)N[C@@H]1[C@@H]([C@H](O[C@H]1n... | AlbertEinstein-GottfriedLeibniz | 218 | 3 |
548 | 155.292854 | 12.536974 | 0.393567 | 12.536974 | 0.237158 | 0.029474 | 0.237158 | -1.134521 | -0.393567 | -1.48270 | ... | 0 | 0 | 0 | 0 | 0 | 0 | CC[C@H](C)[C@@H](C(=O)N[C@@H]1[C@@H]([C@H](O[C... | Anaximander-HenryCavendish | 218 | 3 |
188 | 158.025073 | 11.988379 | 0.496768 | 11.988379 | 0.407501 | 0.019053 | 0.407501 | -1.083643 | -0.496768 | 2.51200 | ... | 0 | 0 | 0 | 0 | 0 | 0 | CC(C)NC(=O)O[C@@H]1CC[C@@](c2c1nnn2Cc3ccc(cc3)... | Avicenna-JohnDalton | 219 | 3 |
189 | 158.025073 | 11.988379 | 0.496768 | 11.988379 | 0.407501 | 0.019053 | 0.407501 | -1.083643 | -0.496768 | 2.51200 | ... | 0 | 0 | 0 | 0 | 0 | 0 | CC(C)NC(=O)O[C@@H]1CC[C@](c2c1nnn2Cc3ccc(cc3)O... | LouisdeBroglie-FrancisGalton | 219 | 3 |
190 | 158.025073 | 11.988379 | 0.496768 | 11.988379 | 0.407501 | 0.019053 | 0.407501 | -1.083643 | -0.496768 | 2.51200 | ... | 0 | 0 | 0 | 0 | 0 | 0 | CC(C)NC(=O)O[C@H]1CC[C@@](c2c1nnn2Cc3ccc(cc3)O... | AlexanderFleming-CarlSagan | 219 | 3 |
831 | 137.462123 | 13.727927 | 0.352008 | 13.727927 | 0.323307 | 0.144989 | 0.323307 | -0.291978 | -0.352008 | 2.80530 | ... | 0 | 0 | 0 | 0 | 0 | 0 | C[C@@H](CC(=O)NCc1ccc2c(c1)NC(=O)N2)c3ccccc3F | LouisPasteur-ThomasKuhn | 233 | 3 |
832 | 136.772520 | 13.606261 | 0.348248 | 13.606261 | 0.323307 | 0.268656 | 0.323307 | -0.368969 | -0.348248 | 2.71500 | ... | 0 | 0 | 0 | 0 | 0 | 0 | C/C(=C\c1ccccc1F)/C(=O)NCc2ccc3c(c2)NC(=O)N3 | WernerHeisenberg-RobertHooke | 233 | 3 |
833 | 143.827065 | 13.804605 | 0.349565 | 13.804605 | 0.323307 | 0.147304 | 0.323307 | -0.293785 | -0.349565 | 3.36630 | ... | 0 | 0 | 0 | 0 | 0 | 0 | C[C@@H](CC(=O)N[C@H](C)c1ccc2c(c1)NC(=O)N2)c3c... | AageBohr-EmilFischer | 233 | 3 |
143 | 135.810779 | 12.106708 | 0.477639 | 12.106708 | 0.346775 | 0.073073 | 0.346775 | -3.898899 | -0.477639 | 1.66550 | ... | 0 | 0 | 0 | 1 | 0 | 0 | c1cc(ccc1CCNS(=O)(=O)c2ccsc2C(=O)O)C(=O)O | JaneGoodall-AlexanderVonHumboldt | 258 | 3 |
144 | 142.175721 | 12.116470 | 0.477639 | 12.116470 | 0.346775 | 0.150715 | 0.346775 | -3.863136 | -0.477639 | 2.05560 | ... | 0 | 0 | 0 | 1 | 1 | 0 | c1cc(ccc1CCCNS(=O)(=O)c2ccsc2C(=O)O)C(=O)O | EvangelistaTorricelli-ClaudiusPtolemy | 258 | 3 |
145 | 129.445837 | 12.105157 | 0.477639 | 12.105157 | 0.346775 | 0.072284 | 0.346775 | -3.954005 | -0.477639 | 1.62300 | ... | 0 | 0 | 0 | 1 | 0 | 0 | c1cc(ccc1CNS(=O)(=O)c2ccsc2C(=O)O)C(=O)O | AlessandroVolta-JohannesKepler | 258 | 3 |
173 | 145.727128 | 11.780729 | 0.550172 | 11.780729 | 0.311615 | 0.003531 | 0.311615 | -1.085590 | -0.550172 | 0.87400 | ... | 0 | 0 | 0 | 0 | 3 | 0 | c1c(cc(c(c1[N+](=O)[O-])O)I)CC(=O)NCCCCCC(=O)[O-] | JohnvonNeumann-ReneDescartes | 276 | 3 |
174 | 126.465290 | 11.681878 | 0.502092 | 11.681878 | 0.310480 | 0.017613 | 0.310480 | -0.835222 | -0.502092 | 1.60410 | ... | 0 | 0 | 0 | 0 | 3 | 0 | c1cc(c(cc1CC(=O)NCCCCCC(=O)O)[N+](=O)[O-])O | CarlSagan-WillardGibbs | 276 | 3 |
175 | 126.465290 | 11.671878 | 0.550172 | 11.671878 | 0.310480 | 0.005082 | 0.310480 | -1.085222 | -0.550172 | 0.26940 | ... | 0 | 0 | 0 | 0 | 3 | 0 | c1cc(c(cc1CC(=O)NCCCCCC(=O)[O-])[N+](=O)[O-])O | FlorenceNightingale-ErnstHaeckel | 276 | 3 |
210 | 169.244099 | 12.730152 | 0.496758 | 12.730152 | 0.259824 | 0.043983 | 0.259824 | -0.381021 | -0.496758 | 3.42917 | ... | 0 | 0 | 0 | 0 | 0 | 0 | [H]/N=C(\Cc1cccc(c1)OC)/NC(=O)c2ccc(cc2OC3CCNC... | FrancisGalton-FrancescoRedi | 318 | 3 |
50 rows × 111 columns
- in case you've gotten this far you can explore
- plotting the distribution of features and comaring between y_pred classes
- dimensionality rediction with principle components analysis http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
- features selection: features that describe the most variation between classes