SC1015 Mini-Project

Welcome to our SC1015 Data Science Mini-Project! This project uses the Global Energy Consumption & Renewable Generation dataset from Kaggle. Our presentation slides and video. For a complete walkthrough of the project, refer to the following jupyter notebooks:

Contributors

@Pratham117 - Machine Learning models (Linear and Polynomial regression), Isolation Forest for anomaly detection, Data Visualization in plotly and scikit for Polynomial regression and Isolation Forest.
@aarushi-nema - Exploratory data analysis, data cleaning, extraction.
@Bappe304 - Data Visualization in scikit-learn.

Introduction to the Problem

The question we are trying to answer with this project is: how long will it take before at least 50% of the world's total energy consumption can be met by renewable energy sources?

To answer this, we initially built a naive linear regression model that we later improved using polynomial regression. We also performed anomaly detection on the dataset to find the countries that consume the most energy overall (renewable + non-renewable).

Models Used

Linear Regression - To predict the year when 50% of global energy consumption can be met by renewables.
Polynomial Regression - To improve the model derived in (1) above, because the relationship between the year and the production of renewable energy is non-linear.
Isolation Forest - To find what countries consume the most energy in total and to use this information to make recommendations and get data-driven insights into global energy trends.

Summary and Conclusion

Global energy trends vary by continent, country, and year.
Among renewable energy sources, there are large variations in the contribution of each source to the total amount of renewable energy generated worldwide.
In general, global renewable energy generation has increased over the last two decades or so.
Simple regression predicts that renewables will become the largest source of energy within the next 40 years.
Linear regression does not yield a good model for non-linear relationships. (MSE>150,000)
Polynomial regression resulted in significant improvements in the MSE, and, therefore, provided a better model for the prediction.
Variance is a measure of how well a model’s train set performance can be replicated on the test set.
Bias is the error (the difference between the actual and predicted value) in the model.
In general, as bias increases, the variance decreases and vice-versa. This is called the bias-variance tradeoff and is key to improving ML models. Ideally, models should have a low bias and a low variance.
Isolation Forest is an anomaly detection model that recursively splits each feature at a random point between the maximum and minimum values for that feature. It then runs a simple test on every point in the feature space to construct a tree. Values are classified as either outliers or inliers based on where they are placed in the tree.
Isolation Forest predicts that seven countries consume an abnormally high amount of energy and thus contribute the most to global energy consumption. Our recommendation is to ramp up renewable energy generation in these countries to meet their needs to increase global renewable energy consumption.

Learning Outcomes

Linear and Polynomial regression in scikit-learn.
The concept of bias-variance tradeoff.
Data visualization techniques in plotly and scikit-learn.
The difference between polynomial interpolation and regression.
Anomaly detection using the Isolation Forest algorithm.
Numpy, pandas, and seaborn.
Using github as a tool to collaborate.

References

https://betterprogramming.pub/anomaly-detection-with-isolation-forest-e41f1f55cc6
https://towardsdatascience.com/machine-learning-polynomial-regression-with-python-5328e4e8a386
https://towardsdatascience.com/polynomial-regression-with-scikit-learn-what-you-should-know-bed9d296f2
https://matplotlib.org/stable/plot_types/index

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
datasets		datasets
Data Driven Insights and Anomaly Detection.ipynb		Data Driven Insights and Anomaly Detection.ipynb
EDA.ipynb		EDA.ipynb
Machine Learning.ipynb		Machine Learning.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SC1015 Mini-Project

Contributors

Introduction to the Problem

Models Used

Summary and Conclusion

Learning Outcomes

References

About

Releases

Packages

Languages

aarushi-nema/SC1015-Project

Folders and files

Latest commit

History

Repository files navigation

SC1015 Mini-Project

Contributors

Introduction to the Problem

Models Used

Summary and Conclusion

Learning Outcomes

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages