What's the big picture?

Some problems are so difficult that no one researcher, research group, research institute, or multi-national company can make meaningful progress. It takes a world wide effort and collaborations between industry and academia. Drug discovery is one such area. Advances in data science (aka artificial intelligence, machine learning) are being applied to data sets from high-throughput experimental techniques and historical databases of biomedical literature, publicly available to the world community.
The process of small molecule drug development involves the gradual reduction of tens of thousands of small molecules to a drug candidate that eventually is given to patients in clinical trials. This is a long (decades, often the whole career of a researcher), costly process and engages all corners of our interconnected economy (scientists, physicians, doctors, entrepreneurs, investors, pharmaceutical companies, government officials). These real world constraints pressure research questions to shy away from too much risk and leave many diseases untreated. But computational methods that have become popularized within the past decade can help make data driven decisions earlier in the decision making process, so that drugs can be developed better, faster, and cheaper. At this workshop you will get hands on experience solving the types of problems that keep our researchers up at night.

The plan

Input data: a precomputed and relatively clean data set of ~1000 drugs-like molecules by ~100 chemical features
Goal: Your job is to categorize drug-like molecules into a smaller diverse and representative set. This is a real-world unsupervised multi-class classification problem encountered in a biotech startup. There is underlying structure in this data set and we have solved it one way and are curious to see how you solve it.
Hints: you will be given clues about the structure of the data at the event, but for now it's top secret! We have prepared Python code snippets (pandas, numpy, scikit-learn) for a solution using k-means clustering to move you along toward the goal within the time constraints of the event.
This jupyter notebook is here to help facilitate the workshop

from IPython.display import Image
Image("Screen Shot 2016-10-27 at 3.29.58 PM.png")

Technical remarks

If you don't have pandas, numpy, scikit-learn, matplotlib, etc installed then do so with

pip install pandas, numpy, scikit-learn, matplotlib

You can check which libraries you have installed with

pip freeze

Import data

import pandas as pd
inputfile = 'chemicalDataForStudents20161027-110104.csv'
df = pd.read_csv(inputfile, sep=',')

# take a peak at the data
print df.shape
print df.tail(3)

(1650, 111)
       LabuteASA  MaxAbsEStateIndex  MaxAbsPartialCharge  MaxEStateIndex  \
1514  164.703793          14.880596             0.496768       14.880596   
161   148.584776           5.798308             0.493601        5.798308   
1220  138.515751          12.549085             0.347020       12.549085   

      MaxPartialCharge  MinAbsEStateIndex  MinAbsPartialCharge  \
1514          0.350866           0.002799             0.350866   
161           0.215753           0.686138             0.215753   
1220          0.244674           0.081001             0.244674   

      MinEStateIndex  MinPartialCharge  MolLogP         ...          \
1514       -0.875036         -0.496768  3.48928         ...           
161         0.686138         -0.493601  4.33182         ...           
1220       -0.668981         -0.347020  1.25950         ...           

      fr_term_acetylene  fr_tetrazole  fr_thiazole  fr_thiophene  \
1514                  0             0            0             0   
161                   0             0            0             0   
1220                  0             0            0             0   

      fr_unbrch_alkane  fr_urea  \
1514                 0        0   
161                  5        0   
1220                 0        0   

                                                 smiles  \
1514  COc1ccc(cc1)c2ccc(c(c2)F)N\3C(=O)CS/C3=C(/C#N)...   
161                Cc1cc(on1)CCCCCCCOc2ccc(cc2)C3=NCCO3   
1220  CCCC[C@H](CN(C=O)O)C(=O)N[C@H](C(=O)N(C)C)C(C)...   

                                 codeName  y_pred  clusterSize_y_pred  
1514        JamesWatt-JeanBaptisteLamarck    1390                   1  
161         DanielBernoulli-CharlesDarwin    1391                   1  
1220  Empedocles-CharlesAugustindeCoulomb    1392                   1  

[3 rows x 111 columns]

# the chemicals can be represented by a string.
print df.head().smiles

# each compound has a codeName
# the codeNames are how we will can refer to them after the analysis (rather than by row number or smiles)
print df.head().codeName

525           Cc1cc(nc(c1)N)COC[C@H](CN)OCc2cc(cc(n2)N)C
526         Cc1cc(nc(c1)N)COCC[C@@H](CN)OCc2cc(cc(n2)N)C
527    Cc1cc(nc(c1)N)COC[C@@H]([C@H](C)OCc2cc(cc(n2)N...
528          Cc1cc(nc(c1)N)COC[C@@H](CN)OCc2cc(cc(n2)N)C
415    CC(C)(C)NC(=O)[C@@H](c1ccccc1)NC(=O)N(C)Cc2ccc...
Name: smiles, dtype: object
525              JamesClerkMaxwell-ErnstMayr
526                      BillNye-FrankHornby
527            CharlesLyell-ErwinSchrodinger
528                Empedocles-GustavKirchoff
415    CharlesAugustindeCoulomb-FrancisCrick
Name: codeName, dtype: object

Cleaning the data

Real data is messy. Data sanitization involves
- removing features or samples that didn't compute for all samples
- removing outliers that you suspect are artefacts or that will wildnly bias the predictions that come from the data
- The data provided has been filtered a bit, bit be warned that this is an important part of the process and can take a long time

Normalizing the data

The features need to be treated equally. Just because units change from grams to kilograms does not mean there is a 100x difference
There are various ways to standardize data. You may have read about standard scores (Z-statistic). In the end each feature should be centred around the same value and have the same max and min.
The way this is done should preserve the variation in each feature. So remember your numerical methods computer science class and beware of subtracting errors and the like.

# just get features from data, remove labels
df_un = df.drop(['codeName', 'smiles'], 1)

# normalize
import numpy as np
df_norm = (df_un - df_un.mean()) / (df_un.max() - df_un.min())
X = np.array(df_norm)

# X is basically scaled to be between 1 and zero in way that is robust to real word data
# you can uncomment this to check
# print 'mean', np.mean(X,0)
# print 'max', np.max(X,0)
# print 'min', np.min(X,0)

K-means clustering

Read these links
- https://en.wikipedia.org/wiki/K-means_clustering
- http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
K-means clustering comes up with labels for unlabelled data. It takes the data and a parameter (we call it k here) that fixes the number of clusters
Try out different values of k using the code below
The key line of code below is

y_pred = KMeans(n_clusters=k, random_state=random_state).fit_predict(X)

It takes the normalized data and asigns cluster labels to it, such that there are k unique clusters.
Properties of k
- k is integer, since clusters are countable
- k is at least 1. This would be one big cluster
- k is at most teh number of samples (the rows of X). This would treat every sample as its own cluster (a singleton)

# cluster by kmeans
from sklearn.cluster import KMeans
import random
random.seed(0)
k = int(random.uniform(1, len(X))) # set k without any prior knowledge... any number between 1 and the number of samples
print 'k', k
random_state = 0
y_pred = KMeans(n_clusters=k, random_state=random_state).fit_predict(X)
df['y_pred'] = y_pred # plot and analyze unnormalized data with labels

k 1393

Now that the clustering is done we can look at the sizes of the clsuters. The function

np.histogram

outputs two arrays, [the number of clusters of a given size], [the size of the clusters]

# look at cluster size
print np.histogram(df.groupby('y_pred').size(), bins = np.append(np.unique(df.groupby( ["y_pred"] ).size()), np.max(df.groupby( ["y_pred"] ).size())+1))
# add in cluster size to df
df = pd.merge(df, pd.DataFrame({'clusterSize_y_pred' : df.groupby( ["y_pred"] ).size()}).reset_index(), on='y_pred') 
print df.tail()

(array([1161,  208,   23,    1]), array([1, 2, 3, 4, 5]))
       LabuteASA  MaxAbsEStateIndex  MaxAbsPartialCharge  MaxEStateIndex  \
1645  147.806545          14.525346             0.378511       14.525346   
1646  149.812648          12.574862             0.477880       12.574862   
1647  181.229439          14.001653             0.460949       14.001653   
1648  203.683877          11.743598             0.438042       11.743598   
1649  124.973421          11.101663             0.507823       11.101663   

      MaxPartialCharge  MinAbsEStateIndex  MinAbsPartialCharge  \
1645          0.154401           0.031220             0.154401   
1646          0.330899           0.050841             0.330899   
1647          0.258894           0.164579             0.258894   
1648          0.233112           0.053390             0.233112   
1649          0.230804           0.042424             0.230804   

      MinEStateIndex  MinPartialCharge  MolLogP         ...          \
1645       -0.910887         -0.378511  2.73290         ...           
1646       -1.397395         -0.477880  1.01617         ...           
1647       -0.569145         -0.460949  1.87380         ...           
1648       -0.053390         -0.438042  4.59410         ...           
1649       -1.291383         -0.507823  1.83460         ...           

      fr_term_acetylene  fr_tetrazole  fr_thiazole  fr_thiophene  \
1645                  0             0            0             0   
1646                  0             0            0             0   
1647                  0             0            0             0   
1648                  0             0            0             0   
1649                  0             0            0             0   

      fr_unbrch_alkane  fr_urea  \
1645                 0        0   
1646                 0        0   
1647                 0        0   
1648                 0        0   
1649                 0        0   

                                                 smiles  \
1645  Cn1cc(cn1)[C@H]2C[C@H]3CSC(=N[C@]3(CO2)c4ccc(c...   
1646  [H]/N=C/1\NC(=O)[C@]2(S1)C=C(C[C@H]([C@@H]2NC(...   
1647  c1cc(oc1)c2nc3nc(nc(n3n2)N)NCCN4CCN(CC4)c5ccc(...   
1648  CCC(=O)Nc1cccc(c1)Oc2c3cc[nH]c3nc(n2)Nc4ccc(cc...   
1649     c1cc2c(cc1O)OC[C@]3([C@@H]2Oc4c3cc5c(c4)OCO5)O   

                           codeName  y_pred  clusterSize_y_pred  
1645  MichaelFaraday-GalileoGalilei    1317                   1  
1646          CarlBosch-RobertHooke     404                   1  
1647      FrancisGalton-Anaximander     402                   1  
1648  BenjaminThompson-KonradLorenz    1332                   1  
1649    RobertKoch-AndreMarieAmpere     268                   1  

[5 rows x 111 columns]

# get top clusters
topClusters=df[['y_pred', 'clusterSize_y_pred']].drop_duplicates().sort_values(by='clusterSize_y_pred', ascending=[0]).head()
print topClusters

      y_pred  clusterSize_y_pred
525      142                   4
1103     199                   3
341       57                   3
36      1233                   3
546      218                   3

Sanity check... 2d chemical structures

do the compunds in the same clusters look the same?
use this webtool to check https://cactus.nci.nih.gov/gifcreator/

Submit you classes to Cyclica

Since we know the real clusters by another method we can compare your to ours
Output your final list of y_pred classes with the codeNames and smiles and we can go back and check if they are the same as our classes
The code below outputs a csv file. Details of how to submit will be given at the workshop

# sort data
df = df.sort_values(by=['clusterSize_y_pred', 'y_pred'], ascending=[0,1])

# output data
import time
timestr = time.strftime("%Y%m%d-%H%M%S")
initials='gw'
output = 'predictedClasses' + initials + timestr +'.csv'
df.to_csv(output, sep=',', index=False)
df.head(50)

	LabuteASA	MaxAbsEStateIndex	MaxAbsPartialCharge	MaxEStateIndex	MaxPartialCharge	MinAbsEStateIndex	MinAbsPartialCharge	MinEStateIndex	MinPartialCharge	MolLogP	...	fr_thiophene	fr_unbrch_alkane	fr_urea	smiles	codeName	y_pred	clusterSize_y_pred
525	141.729151	5.759574	0.383683	5.759574	0.123477	0.226210	0.123477	-0.226210	-0.383683	1.31854	...	0	0	0	Cc1cc(nc(c1)N)COC[C@H](CN)OCc2cc(cc(n2)N)C	JamesClerkMaxwell-ErnstMayr	142	4
526	148.094093	5.819164	0.383683	5.819164	0.123477	0.098079	0.123477	-0.098079	-0.383683	1.70864	...	0	1	0	Cc1cc(nc(c1)N)COCC[C@@H](CN)OCc2cc(cc(n2)N)C	BillNye-FrankHornby	142	4
527	148.094093	6.124918	0.383683	6.124918	0.123477	0.181412	0.123477	-0.259896	-0.383683	1.70704	...	0	0	0	Cc1cc(nc(c1)N)COC[C@@H]([C@H](C)OCc2cc(cc(n2)N...	CharlesLyell-ErwinSchrodinger	142	4
528	141.729151	5.759574	0.383683	5.759574	0.123477	0.226210	0.123477	-0.226210	-0.383683	1.31854	...	0	0	0	Cc1cc(nc(c1)N)COC[C@@H](CN)OCc2cc(cc(n2)N)C	Empedocles-GustavKirchoff	142	4
415	185.862478	12.947534	0.477530	12.947534	0.339488	0.007434	0.339488	-1.177893	-0.477530	2.91090	...	0	0	1	CC(C)(C)NC(=O)[C@@H](c1ccccc1)NC(=O)N(C)Cc2ccc...	CharlesAugustindeCoulomb-FrancisCrick	6	3
416	185.862478	12.911895	0.477530	12.911895	0.339488	0.003046	0.339488	-1.172572	-0.477530	2.91250	...	0	1	1	CCCCNC(=O)[C@H](c1ccccc1)NC(=O)N(C)Cc2ccc3c(c2...	WolfgangErnstPauli-Lucretius	6	3
417	201.824746	13.056169	0.477530	13.056169	0.339488	0.016616	0.339488	-1.181843	-0.477530	3.31260	...	0	0	1	CN(Cc1ccc2c(c1C(=O)O)OCO2)C(=O)N[C@@H](c3ccccc...	ErwinSchrodinger-EvangelistaTorricelli	6	3
151	112.519202	12.044467	0.504068	12.044467	0.200850	0.003845	0.200850	-0.738647	-0.504068	2.57680	...	0	0	0	c1ccc(cc1)C2=CC(=O)c3c(cc(c(c3O)O)O)O2	LinusPauling-IsaacNewton	36	3
152	123.997689	12.233506	0.507966	12.233506	0.203372	0.033681	0.203372	-0.472106	-0.507966	2.58540	...	0	0	0	COc1c(cc2c(c1O)C(=O)C=C(O2)c3ccc(cc3)O)O	CarlFriedrichGauss-BillNye	36	3
153	112.519202	12.350655	0.507822	12.350655	0.199995	0.009887	0.199995	-0.312312	-0.507822	2.57680	...	0	0	0	c1cc(c(cc1C2=COc3cc(ccc3C2=O)O)O)O	SigmundFreud-BenjaminFranklin	36	3
220	130.436067	14.155215	0.378494	14.155215	0.154894	0.034118	0.154894	-0.853547	-0.378494	3.16290	...	0	0	0	C[C@]1(C[C@H](SC(=N1)N)c2cncnc2)c3ccc(cc3F)F	LeonardodaVinci-JamesWatson	44	3
221	130.436067	14.314858	0.378512	14.314858	0.154285	0.254836	0.154285	-0.794143	-0.378512	3.08860	...	0	0	0	C[C@]1(CCSC(=N1)N)c2cc(c(cc2F)F)c3cncnc3	JeanBaptisteLamarck-ThomasKuhn	44	3
222	136.691989	14.247607	0.378494	14.247607	0.154895	0.047960	0.154895	-0.867130	-0.378494	3.97774	...	0	0	0	Cc1c(c(on1)C)[C@@H]2C[C@@](N=C(S2)N)(C)c3ccc(c...	CarolusLinnaeus-FranzBoas	44	3
341	194.858937	12.370274	0.488253	12.370274	0.488253	0.250590	0.423170	-1.478263	-0.423170	2.46130	...	0	0	0	B(c1ccccc1CN2CCN(CC2)C3=NC(=O)/C(=C/c4ccc(c(c4...	Lucretius-Avicenna	57	3
342	194.858937	12.356106	0.487918	12.356106	0.487918	0.236423	0.423177	-1.443879	-0.423177	2.46130	...	0	0	0	B(c1ccc(cc1)CN2CCN(CC2)C3=NC(=O)/C(=C/c4ccc(c(...	LouisdeBroglie-HenryMoseley	57	3
343	194.858937	12.362383	0.487928	12.362383	0.487928	0.242705	0.423177	-1.460291	-0.423177	2.46130	...	0	0	0	B(c1cccc(c1)CN2CCN(CC2)C3=NC(=O)/C(=C/c4ccc(c(...	FranzBoas-HermannvonHelmholtz	57	3
105	148.269255	12.380685	0.312156	12.380685	0.236417	0.064117	0.236417	-3.479281	-0.312156	3.31770	...	0	0	0	CCCN1c2ccc(cc2CCC1=O)NS(=O)(=O)Cc3ccccc3	MaxPlanck-JackHorner	62	3
106	154.634197	12.465613	0.312156	12.465613	0.236417	0.064672	0.236417	-3.492131	-0.312156	3.62612	...	0	0	0	CCCN1c2ccc(cc2CCC1=O)NS(=O)(=O)Cc3ccc(cc3)C	IsaacNewton-HeinrichHertz	62	3
107	141.904313	12.363117	0.315211	12.363117	0.236417	0.066713	0.236417	-3.481908	-0.315211	2.84592	...	0	0	0	Cc1ccc(cc1)CS(=O)(=O)Nc2ccc3c(c2)CCC(=O)N3C	Lucretius-LouisPasteur	62	3
693	217.369820	13.986122	0.443692	13.986122	0.407311	0.053205	0.407311	-4.004868	-0.443692	3.13200	...	0	0	0	CC(C)[C@H]1Cc2cc(ccc2S(=O)(=O)N(C1)C[C@H]([C@H...	FrancisCrick-AlbertEinstein	114	3
694	223.734762	14.081573	0.443692	14.081573	0.407311	0.047279	0.407311	-4.031627	-0.443692	3.52210	...	0	0	0	CC(C)(C)[C@@H]1Cc2cc(ccc2S(=O)(=O)N(C1)C[C@H](...	WilliamHarvey-AlessandroVolta	114	3
695	223.734762	14.081573	0.443692	14.081573	0.407311	0.047279	0.407311	-4.031627	-0.443692	3.52210	...	0	0	0	CC(C)(C)[C@H]1Cc2cc(ccc2S(=O)(=O)N(C1)C[C@H]([...	MarieCurie-AlexanderVonHumboldt	114	3
1000	161.901036	7.541884	0.485185	7.541884	0.150640	0.007064	0.150640	-0.263058	-0.485185	1.24734	...	0	0	0	c1cc(cc(c1)O[C@H]2CO[C@H]3[C@@H]2OC[C@H]3Oc4cc...	JamesWatt-JamesWatson	148	3
1001	161.901036	7.544344	0.485160	7.544344	0.150640	0.009842	0.150640	-0.273739	-0.485160	1.24734	...	0	0	0	c1cc(cc(c1)O[C@@H]2CO[C@H]3[C@@H]2OC[C@@H]3Oc4...	FriedrichAugustKekule-MichaelFaraday	148	3
1002	161.901036	7.439745	0.485185	7.439745	0.150639	0.019691	0.150639	-0.234382	-0.485185	1.24734	...	0	0	0	c1cc(ccc1C(=N)N)O[C@H]2CO[C@H]3[C@@H]2OC[C@H]3...	GalileoGalilei-HenryMoseley	148	3
11	190.631761	12.020077	0.312609	12.020077	0.258254	0.175468	0.258254	-0.175468	-0.312609	3.65440	...	0	0	0	c1cc(ccc1CC2CCN(CC2)CCc3cnn(c3)c4c5c(ccn4)C(=O...	WernerHeisenberg-BenjaminFranklin	171	3
12	184.266819	12.009169	0.312609	12.009169	0.258254	0.177748	0.258254	-0.177748	-0.312609	3.57930	...	0	0	0	c1cc(ccc1C2CCN(CC2)CCc3cnn(c3)c4c5c(ccn4)C(=O)...	ArthurEddington-PeterDebye	171	3
13	194.570085	12.020277	0.312609	12.020277	0.258254	0.185669	0.258254	-0.185669	-0.312609	4.23270	...	0	0	0	c1cnc(c2c1C(=O)NC=N2)n3cc(cn3)CCN4CCC(CC4)c5cc...	JagadishChandraBose-AlbertEinstein	171	3
217	138.697590	9.485220	0.507966	9.485220	0.115120	0.191250	0.115120	0.191250	-0.507966	4.92030	...	0	0	0	c1ccc(c(c1)N=C(c2ccc(cc2)O)c3ccc(cc3)O)Cl	CharlesAugustindeCoulomb-Anaximander	180	3
218	134.759266	9.506331	0.507966	9.506331	0.115120	0.217313	0.115120	0.217313	-0.507966	4.57532	...	0	0	0	Cc1ccccc1N=C(c2ccc(cc2)O)c3ccc(cc3)O	HenryMoseley-AmedeoAvogadro	180	3
219	128.394324	9.462216	0.507966	9.462216	0.115120	0.216220	0.115120	0.216220	-0.507966	4.26690	...	0	0	0	c1ccc(cc1)N=C(c2ccc(cc2)O)c3ccc(cc3)O	Euclid-Avicenna	180	3
1103	146.057666	10.279459	0.385467	10.279459	0.138992	0.341144	0.138992	-0.613237	-0.385467	4.00098	...	0	0	0	C[C@H](c1nc2cnc3c(c2n1C4CCC(CC4)CCC#N)cc[nH]3)O	Avicenna-JamesWatt	199	3
1104	133.327782	10.193187	0.385467	10.193187	0.138992	0.157389	0.138992	-0.634183	-0.385467	3.22078	...	0	0	0	C[C@H](c1nc2cnc3c(c2n1C4CCC(CC4)C#N)cc[nH]3)O	ArthurEddington-LinusPauling	199	3
1105	139.692724	10.240533	0.385467	10.240533	0.138992	0.314854	0.138992	-0.622015	-0.385467	3.61088	...	0	0	0	C[C@H](c1nc2cnc3c(c2n1C4CCC(CC4)CC#N)cc[nH]3)O	MarianoArtigas-RichardFeynman	199	3
546	148.927912	12.326558	0.393567	12.326558	0.236881	0.205146	0.236881	-1.125546	-0.393567	-1.72870	...	0	0	0	CCCC(C(=O)N[C@@H]1[C@@H]([C@H](O[C@H]1n2cnc3c2...	JamesClerkMaxwell-GottfriedLeibniz	218	3
547	148.927912	12.378086	0.393567	12.378086	0.237148	0.080400	0.237148	-1.134309	-0.393567	-1.87280	...	0	0	0	CC(C)[C@@H](C(=O)N[C@@H]1[C@@H]([C@H](O[C@H]1n...	AlbertEinstein-GottfriedLeibniz	218	3
548	155.292854	12.536974	0.393567	12.536974	0.237158	0.029474	0.237158	-1.134521	-0.393567	-1.48270	...	0	0	0	CC[C@H](C)[C@@H](C(=O)N[C@@H]1[C@@H]([C@H](O[C...	Anaximander-HenryCavendish	218	3
188	158.025073	11.988379	0.496768	11.988379	0.407501	0.019053	0.407501	-1.083643	-0.496768	2.51200	...	0	0	0	CC(C)NC(=O)O[C@@H]1CC[C@@](c2c1nnn2Cc3ccc(cc3)...	Avicenna-JohnDalton	219	3
189	158.025073	11.988379	0.496768	11.988379	0.407501	0.019053	0.407501	-1.083643	-0.496768	2.51200	...	0	0	0	CC(C)NC(=O)O[C@@H]1CC[C@](c2c1nnn2Cc3ccc(cc3)O...	LouisdeBroglie-FrancisGalton	219	3
190	158.025073	11.988379	0.496768	11.988379	0.407501	0.019053	0.407501	-1.083643	-0.496768	2.51200	...	0	0	0	CC(C)NC(=O)O[C@H]1CC[C@@](c2c1nnn2Cc3ccc(cc3)O...	AlexanderFleming-CarlSagan	219	3
831	137.462123	13.727927	0.352008	13.727927	0.323307	0.144989	0.323307	-0.291978	-0.352008	2.80530	...	0	0	0	C[C@@H](CC(=O)NCc1ccc2c(c1)NC(=O)N2)c3ccccc3F	LouisPasteur-ThomasKuhn	233	3
832	136.772520	13.606261	0.348248	13.606261	0.323307	0.268656	0.323307	-0.368969	-0.348248	2.71500	...	0	0	0	C/C(=C\c1ccccc1F)/C(=O)NCc2ccc3c(c2)NC(=O)N3	WernerHeisenberg-RobertHooke	233	3
833	143.827065	13.804605	0.349565	13.804605	0.323307	0.147304	0.323307	-0.293785	-0.349565	3.36630	...	0	0	0	C[C@@H](CC(=O)N[C@H](C)c1ccc2c(c1)NC(=O)N2)c3c...	AageBohr-EmilFischer	233	3
143	135.810779	12.106708	0.477639	12.106708	0.346775	0.073073	0.346775	-3.898899	-0.477639	1.66550	...	1	0	0	c1cc(ccc1CCNS(=O)(=O)c2ccsc2C(=O)O)C(=O)O	JaneGoodall-AlexanderVonHumboldt	258	3
144	142.175721	12.116470	0.477639	12.116470	0.346775	0.150715	0.346775	-3.863136	-0.477639	2.05560	...	1	1	0	c1cc(ccc1CCCNS(=O)(=O)c2ccsc2C(=O)O)C(=O)O	EvangelistaTorricelli-ClaudiusPtolemy	258	3
145	129.445837	12.105157	0.477639	12.105157	0.346775	0.072284	0.346775	-3.954005	-0.477639	1.62300	...	1	0	0	c1cc(ccc1CNS(=O)(=O)c2ccsc2C(=O)O)C(=O)O	AlessandroVolta-JohannesKepler	258	3
173	145.727128	11.780729	0.550172	11.780729	0.311615	0.003531	0.311615	-1.085590	-0.550172	0.87400	...	0	3	0	c1c(cc(c(c1[N+](=O)[O-])O)I)CC(=O)NCCCCCC(=O)[O-]	JohnvonNeumann-ReneDescartes	276	3
174	126.465290	11.681878	0.502092	11.681878	0.310480	0.017613	0.310480	-0.835222	-0.502092	1.60410	...	0	3	0	c1cc(c(cc1CC(=O)NCCCCCC(=O)O)[N+](=O)[O-])O	CarlSagan-WillardGibbs	276	3
175	126.465290	11.671878	0.550172	11.671878	0.310480	0.005082	0.310480	-1.085222	-0.550172	0.26940	...	0	3	0	c1cc(c(cc1CC(=O)NCCCCCC(=O)[O-])[N+](=O)[O-])O	FlorenceNightingale-ErnstHaeckel	276	3
210	169.244099	12.730152	0.496758	12.730152	0.259824	0.043983	0.259824	-0.381021	-0.496758	3.42917	...	0	0	0	[H]/N=C(\Cc1cccc(c1)OC)/NC(=O)c2ccc(cc2OC3CCNC...	FrancisGalton-FrancescoRedi	318	3

50 rows × 111 columns

More ideas

in case you've gotten this far you can explore
- plotting the distribution of features and comaring between y_pred classes
- dimensionality rediction with principle components analysis http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
- features selection: features that describe the most variation between classes

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
Screen Shot 2016-10-27 at 3.29.58 PM.png		Screen Shot 2016-10-27 at 3.29.58 PM.png
chemicalDataForStudents20161027-110104.csv		chemicalDataForStudents20161027-110104.csv
output_2_0.png		output_2_0.png
studentDrugClassification.ipynb		studentDrugClassification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What's the big picture?

The plan

Technical remarks

Import data

Cleaning the data

Normalizing the data

K-means clustering

Sanity check... 2d chemical structures

Submit you classes to Cyclica

More ideas

About

Releases

Packages

Languages

cyclica/cyclicaDrugClustDemo

Folders and files

Latest commit

History

Repository files navigation

What's the big picture?

The plan

Technical remarks

Import data

Cleaning the data

Normalizing the data

K-means clustering

Sanity check... 2d chemical structures

Submit you classes to Cyclica

More ideas

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages