This Github Repo contains R code, written analysis, and the datasets for these analytics projects. I decided to upload the project reports as pdf files since I feel like this is the best way to show the important aspects of the projects and what I learned from them, as opposed to just uploading the R code and dataset which lacks a lot of contexts.
This project examines a dataset where twins report each others' hourly wage and level of education. I created a linear regression model to predict hourly wage based on level of education, and also checked the validity of the model.
I used a linear regression model with multiple predictor variables to predict a patient's expected medical expenses based on their information (i.e. age, BMI, smoke, number of children, etc). I also did a correlation study on how the variables can correlate to one another.
I used a logistic regression model to classify if a tumor is benign or malignant. The dataset has a wide variety of features for the tumors, so I used hypothesis testing and ANOVA to filter out the significant variables for the model. Through this process, I was able to improve the model's accuracy. Accuracy was measured from a confusion matrix that compared the predicted values to the actual values.
I did a time series analysis on the monthly atmospheric CO2 concentrations measured at an observatory from 1974 to 1987. I looked into the patterns we get from the trend and the seasonality of the data. I also compared different forecasting methods to see which has the best predicting results.
I applied various machine learning models on a bank telemarketing dataset. The goal was to determine if someone is a potential client for the bank. The classification models used were Logistic Regression, K-NN, Decision Trees, and Naive Bayes. I compared their effectiveness by looking at confusion matrices and their accuracy in percentages. Along the way, I also needed to modify the data so they would work on these models.