Welcome to our SC1015 Data Science Mini-Project! This project uses the Global Energy Consumption & Renewable Generation dataset from Kaggle. Our presentation slides and video. For a complete walkthrough of the project, refer to the following jupyter notebooks:
- @Pratham117 - Machine Learning models (Linear and Polynomial regression), Isolation Forest for anomaly detection, Data Visualization in plotly and scikit for Polynomial regression and Isolation Forest.
- @aarushi-nema - Exploratory data analysis, data cleaning, extraction.
- @Bappe304 - Data Visualization in scikit-learn.
The question we are trying to answer with this project is: how long will it take before at least 50% of the world's total energy consumption can be met by renewable energy sources?
To answer this, we initially built a naive linear regression model that we later improved using polynomial regression. We also performed anomaly detection on the dataset to find the countries that consume the most energy overall (renewable + non-renewable).
- Linear Regression - To predict the year when 50% of global energy consumption can be met by renewables.
- Polynomial Regression - To improve the model derived in (1) above, because the relationship between the year and the production of renewable energy is non-linear.
- Isolation Forest - To find what countries consume the most energy in total and to use this information to make recommendations and get data-driven insights into global energy trends.
- Global energy trends vary by continent, country, and year.
- Among renewable energy sources, there are large variations in the contribution of each source to the total amount of renewable energy generated worldwide.
- In general, global renewable energy generation has increased over the last two decades or so.
- Simple regression predicts that renewables will become the largest source of energy within the next 40 years.
- Linear regression does not yield a good model for non-linear relationships. (MSE>150,000)
- Polynomial regression resulted in significant improvements in the MSE, and, therefore, provided a better model for the prediction.
- Variance is a measure of how well a model’s train set performance can be replicated on the test set.
- Bias is the error (the difference between the actual and predicted value) in the model.
- In general, as bias increases, the variance decreases and vice-versa. This is called the bias-variance tradeoff and is key to improving ML models. Ideally, models should have a low bias and a low variance.
- Isolation Forest is an anomaly detection model that recursively splits each feature at a random point between the maximum and minimum values for that feature. It then runs a simple test on every point in the feature space to construct a tree. Values are classified as either outliers or inliers based on where they are placed in the tree.
- Isolation Forest predicts that seven countries consume an abnormally high amount of energy and thus contribute the most to global energy consumption. Our recommendation is to ramp up renewable energy generation in these countries to meet their needs to increase global renewable energy consumption.
- Linear and Polynomial regression in scikit-learn.
- The concept of bias-variance tradeoff.
- Data visualization techniques in plotly and scikit-learn.
- The difference between polynomial interpolation and regression.
- Anomaly detection using the Isolation Forest algorithm.
- Numpy, pandas, and seaborn.
- Using github as a tool to collaborate.
https://betterprogramming.pub/anomaly-detection-with-isolation-forest-e41f1f55cc6
https://towardsdatascience.com/machine-learning-polynomial-regression-with-python-5328e4e8a386
https://towardsdatascience.com/polynomial-regression-with-scikit-learn-what-you-should-know-bed9d296f2
https://matplotlib.org/stable/plot_types/index