GitHub - alinarekena/Data-Science-project-2019

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Acc_M1.png		Acc_M1.png
Acc_M2.png		Acc_M2.png
Acc_M3.png		Acc_M3.png
ExternalID.csv		ExternalID.csv
Features selection.ipynb		Features selection.ipynb
GROUP_M5_report.pdf		GROUP_M5_report.pdf
IDS2019_HW06.ipynb		IDS2019_HW06.ipynb
M5.pptx		M5.pptx
Model1-new.png		Model1-new.png
Model2(2)-new.png		Model2(2)-new.png
Model2-new.png		Model2-new.png
Model3-new.png		Model3-new.png
New_Train_SVM_Naive.ipynb		New_Train_SVM_Naive.ipynb
New_Validation_LightGBM.ipynb		New_Validation_LightGBM.ipynb
New_Validation_SVM_Naive.ipynb		New_Validation_SVM_Naive.ipynb
Old_Train_LighGBM.ipynb		Old_Train_LighGBM.ipynb
Old_Train_SVM_Naive.ipynb		Old_Train_SVM_Naive.ipynb
Old_Validation_LighGBM.ipynb		Old_Validation_LighGBM.ipynb
Old_Validation_SVM_Naive.ipynb		Old_Validation_SVM_Naive.ipynb
PROJECT_Bioconcentration_factor_06_RF_models_FINALoob_coments_added.ipynb		PROJECT_Bioconcentration_factor_06_RF_models_FINALoob_coments_added.ipynb
Plan		Plan
README		README
Results_NEW_v1 (1).ipynb		Results_NEW_v1 (1).ipynb
Results_OLD_v1 (1).ipynb		Results_OLD_v1 (1).ipynb
Results_OLD_v3_xgboost.ipynb		Results_OLD_v3_xgboost.ipynb
SVM - Gaussian Kernel.ipynb		SVM - Gaussian Kernel.ipynb
SVM - Linear.ipynb		SVM - Linear.ipynb
SVM - Polynomial Kernel.ipynb		SVM - Polynomial Kernel.ipynb
SVM - Sigmoid Kernel.ipynb		SVM - Sigmoid Kernel.ipynb
Train-Split_SVM_-_Sigmoid_Kernel_Bioconcentration.ipynb		Train-Split_SVM_-_Sigmoid_Kernel_Bioconcentration.ipynb
Train-split_SVM_-_Gaussian_Kernel_Bioconcentration.ipynb		Train-split_SVM_-_Gaussian_Kernel_Bioconcentration.ipynb
Train-split_SVM_-_Linear_Bioconcentration.ipynb		Train-split_SVM_-_Linear_Bioconcentration.ipynb
Train-split_SVM_-_Polynomial_Kernel_Bioconcentration.ipynb		Train-split_SVM_-_Polynomial_Kernel_Bioconcentration.ipynb
TrainID.csv		TrainID.csv
XGBoost_Train.ipynb		XGBoost_Train.ipynb
XGBoost_Validation.ipynb		XGBoost_Validation.ipynb
merged_Descs_sorted.csv		merged_Descs_sorted.csv

Repository files navigation

Original models were build using R, and we did the estimation of the change in model performance caused by using different programming language and libraries. Notebook "PROJECT_Bioconcentration_factor_06_RF_models_FINALoob_coments_added" contains replicated models from the original publication.

To get models, presented in the Results section, notebooks "XGBoost_Validation", "Old_Validation_SVM_Naive" (contains both Linear SVM model and Gaussian Naive Bayes) and "Old_Validation_LightGBM" have to be run. The graphical representation was done by transferring numbers to "Results_OLD_v3_xgboost" notebook.

Final sets of parameters were chosen after running "XGBoost_Train", "Old_Train_SVM_Naive" and "Old_Train_LightGBM" with different model parameters, manually or by loops. Linear SVM was chosen after trying Gaussian, Linear Polynomial and Sigmoid kernels (notebooks "SVM - Gaussian Kernel", "SVM - Linear", "SVM - Polynomial Kernel" and "SVM - Sigmoid Kernel" respectively). Other types of SVM were either not effective or took to much time to run.

All those models were run using 3 selected features from the paper. We as well performed our feature selection, which can be found in the "Features selection" notebook. The algorithm is based on random shuffling and splitting of the dataset, thus can give a slightly different result every time.
Notebooks marked with "New" used features selected by our algorithm as an input, but the performance was insufficient and they were not included in final results.