This project covered some strategies, including data wrangling, engineering and reporting, with dataset from World Bank Demography and Census to identify key influential factors against GDP growth of each countries. The project scoped to 13 countries in Central Asia including:
- Bangladesh
- Bhutan
- China
- India
- Kazakhstan
- Kyrgyzstan
- Maldives
- Mongolia
- Myanmar
- Nepal
- Sri Lanka
- Tajikistan
- Thailand
- pandas
- matplotlib.pyplot
- seaborn
- numpy
- math
- statsmodels.api
- pylab
- statsmodels.stats import diagnostic
- statsmodels.stats.outliers_influence import variance_inflation_factor
- sklearn.linear_model import LinearRegression
- sklearn.model_selection import train_test_split
- sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
- scipy import stats
- Data Wrangling
- Find nulls through sum-nulls and missing value via correlation plot techniques.
- Replace nulls or missing values with proper values, like means, median or mode of each column, or drop rows containing nulls or missing values.
- Consult internet for key information to manipulate nulls or missing values.
- Data Engineering
- Run correlation matrix to indenify relationships among columns.
- Plot out distributions of each attribute to find central tendency, skewness and outliers in the dataset.
- Run linear regression to highly correlated varibles influencing the GDP growth.
- Reporting
- Bar plots of high correlated attributes influencing a country's GDP growth
- Pie charts of industrial shares to a country's GDP
- Scatter plots of correlation and residue between middle-class income and tax revenue gain
- Bar charts representing different percentages of population particiapting in each industry
- Residual plots to reveal outliers