I have worked on an Early Diabetes Detection project, which is a machine learning project aimed at predicting the likelihood of a person having diabetes based on their symptoms and other factors. I used a dataset collected from direct questionnaires given to patients at the Sylhet Diabetes Hospital in Sylhet, Bangladesh to train four different machine learning algorithms to predict the presence of diabetes.
The goal of this project is to help detect diabetes early and provide preventative care to those who need it. I prepared the data and engineered relevant features to enable effective modeling. I then applied different machine learning algorithms to the data, including logistic regression, decision trees, random forests, and neural networks. I evaluated the performance of these models using various metrics, including accuracy, precision, recall, and F1-score.
The steps Include:
- EDA (Manually)
- Dealing with missing data
- Distribution of different attributes
- Automated EDA using sweetviz and autoviz
- Dataset Preprocessing
- Changing target values into numerical values
- Label encoding
- Calculating Correlation between features
- Feature Selection
- Splitting into Train & Test
- Data Normalization
- k-Fold cross-validation
- Model Building
- Logistic Regression
- Random Forest
- SVM
- KNN
- Gaussian NB
Dataset Link : https://www.kaggle.com/datasets/ishandutta/early-stage-diabetes-risk-prediction-dataset?datasetId=886508&sortBy=dateRun&tab=profile
This data set contains information collected from direct questionnaires given to patients at the Sylhet Diabetes Hospital in Sylhet, Bangladesh, and approved by a doctor. The attributes include:
- Age (between 20 and 65)
- Sex (1 = Male, 2 = Female)
- Polyuria (1 = Yes, 2 = No)
- Polydipsia (1 = Yes, 2 = No)
- Sudden weight loss (1 = Yes, 2 = No)
- Weakness (1 = Yes, 2 = No)
- Polyphagia (1 = Yes, 2 = No)
- Genital thrush (1 = Yes, 2 = No)
- Visual blurring (1 = Yes, 2 = No)
- Itching (1 = Yes, 2 = No)
- Irritability (1 = Yes, 2 = No)
- Delayed healing (1 = Yes, 2 = No)
- Partial paresis (1 = Yes, 2 = No)
- Muscle stiffness (1 = Yes, 2 = No)
- Alopecia (1 = Yes, 2 = No)
- Obesity (1 = Yes, 2 = No)
- Class (1 = Positive, 2 = Negative)
The data set is useful for predicting whether a patient has diabetes based on their symptoms and other factors.