segmentfault · ghost · Sep 15, 2020
diff --git a/5 环境污染的预测/5 环境污染的预测--这里啥都有/air-pollution-analysis-and-prediction.ipynb b/5 环境污染的预测/5 环境污染的预测--这里啥都有/air-pollution-analysis-and-prediction.ipynb
@@ -0,0 +1 @@
+{"cells":[{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"import os\nfor dirname, _, filenames in os.walk('/kaggle/input'):\n    for filename in filenames:\n        print(os.path.join(dirname, filename))\n","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"* This data is combined(across the years and states) and largely clean version of the Historical Daily Ambient Air Quality Data released by the Ministry of Environment and Forests and Central Pollution Control Board of India under the National Data Sharing and Accessibility Policy (NDSAP).\n* Detect pollution trends"},{"metadata":{"_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","trusted":true},"cell_type":"code","source":"import pandas as pd\nimport numpy as np\nimport seaborn as sns\nimport matplotlib.pyplot as plt\nimport geopandas as gpd\nimport geoplot","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df = pd.read_csv(\"/kaggle/input/india-air-quality-data/data.csv\", encoding = \"ISO-8859-1\")\ndf.head()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"* **stn_code** : Station code. A code given to each station that recorded the data.\n* **sampling_date** : The date when the data was recorded.\n* **state** : It represents the states whose air quality data is measured.\n* **location** : It represents the city whose air quality data is measured.\n* **agency** : Name of the agency that measured the data.\n* **type** : The type of area where the measurement was made.\n* **so2** : The amount of Sulphur Dioxide measured.\n* **no2** : The amount of Nitrogen Dioxide measured\n* **rspm** : Respirable Suspended Particulate Matter measured.\n* **spm** : Suspended Particulate Matter measured.\n* **location_monitoring_station** : It indicates the location of the monitoring area.\n* **pm2_5** : It represents the value of particulate matter measured.\n* **date** : It represents the date of recording (It is cleaner version of ‘sampling_date’ feature)\n"},{"metadata":{"trusted":true},"cell_type":"code","source":"df['date'] = pd.to_datetime(df['date'],format='%Y-%m-%d') # date parse\ndf['year'] = df['date'].dt.year # year\ndf['year'] = df['year'].fillna(df[\"year\"].min())\ndf['year'] = df['year'].values.astype(int)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"print (df.get_dtype_counts())","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# EDA\n\n## Null Values"},{"metadata":{"trusted":true},"cell_type":"code","source":"def printNullValues(df):\n    total = df.isnull().sum().sort_values(ascending = False)\n    total = total[df.isnull().sum().sort_values(ascending = False) != 0]\n    percent = total / len(df) * 100\n    percent = percent[df.isnull().sum().sort_values(ascending = False) != 0]\n    concat = pd.concat([total, percent], axis=1, keys=['Total','Percent'])\n    print (concat)\n    print ( \"-------------\")","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"printNullValues(df)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"* agency’s name have nothing to do with how much polluted the state is. \n* stn_code is also unnecessary.\n* date and sampling_date are similar\n* location_monitoring_station "},{"metadata":{},"cell_type":"markdown","source":"## Type"},{"metadata":{"trusted":true},"cell_type":"code","source":"df[\"type\"].value_counts()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"\nsns.catplot(x = \"type\", kind = \"count\",  data = df, height=5, aspect = 4)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Analysis by type and pollution means"},{"metadata":{"trusted":true},"cell_type":"code","source":"grp = df.groupby([\"type\"]).mean()[\"so2\"].to_frame()\ngrp.plot.bar(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"grp = df.groupby([\"type\"]).mean()[\"no2\"].to_frame()\ngrp.plot.bar(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## SO2\n\nSulfur dioxide"},{"metadata":{"trusted":true},"cell_type":"code","source":"\ndf[['so2', 'state']].groupby(['state']).median().sort_values(\"so2\", ascending = False).plot.bar(figsize=(20,10))\n","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df[['so2','year','state']].groupby([\"year\"]).median().sort_values(by='year',ascending=False).plot(figsize=(20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## no2 \n\nNitrogen dioxide"},{"metadata":{"trusted":true},"cell_type":"code","source":"\ndf[['no2', 'state']].groupby(['state']).median().sort_values(\"no2\", ascending = False).plot.bar(figsize=(20,10))\n","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df[['no2','year','state']].groupby([\"year\"]).median().sort_values(by='year',ascending=False).plot(figsize=(20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## SPM\n\n Suspended Particulate Matter"},{"metadata":{"trusted":true},"cell_type":"code","source":"\ndf[['spm', 'state']].groupby(['state']).median().sort_values(\"spm\", ascending = False).plot.bar(figsize=(20,10))\n","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df[['spm','year','state']].groupby([\"year\"]).median().sort_values(by='year',ascending=False).plot(figsize=(20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## PIVOT tables"},{"metadata":{"trusted":true},"cell_type":"code","source":"fig, ax = plt.subplots(figsize=(20,10))      \nsns.heatmap(df.pivot_table('so2', index='state',columns=['year'],aggfunc='median',margins=True),ax = ax,annot=True, linewidths=.5)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"fig, ax = plt.subplots(figsize=(20,10))      \nsns.heatmap(df.pivot_table('no2', index='state',columns=['year'],aggfunc='median',margins=True),ax = ax,annot=True, linewidths=.5)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"fig, ax = plt.subplots(figsize=(20,10))      \nsns.heatmap(df.pivot_table('spm', index='state',columns=['year'],aggfunc='median',margins=True),ax = ax,annot=False, linewidths=.5)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Trends by regions"},{"metadata":{"trusted":true},"cell_type":"code","source":"temp = df.pivot_table('so2', index='year',columns=['state'],aggfunc='median',margins=True).reset_index()\ntemp = temp.drop(\"All\", axis = 1)\ntemp = temp.set_index(\"year\")\ntemp.plot(figsize=(20,10))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"temp = df.pivot_table('no2', index='year',columns=['state'],aggfunc='median',margins=True).reset_index()\ntemp = temp.drop(\"All\", axis = 1)\ntemp = temp.set_index(\"year\")\ntemp.plot(figsize=(20,10))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"temp = df.pivot_table('spm', index='year',columns=['state'],aggfunc='median',margins=True).reset_index()\ntemp = temp.drop(\"All\", axis = 1)\ntemp = temp.set_index(\"year\")\ntemp.plot(figsize=(20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"* it would make sense to fill the missing values with the mean between the last two available previous and next value?"},{"metadata":{},"cell_type":"markdown","source":"# Geoplotting"},{"metadata":{"trusted":true},"cell_type":"code","source":"india = gpd.read_file('/kaggle/input/maps-of-india/India_SHP/INDIA.shp')\nindia.info()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"india.plot()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"markdown","source":"* Match the names of states between the two datasets"},{"metadata":{"trusted":true},"cell_type":"code","source":"india[\"ST_NAME\"] = india[\"ST_NAME\"].apply(lambda x: x.lower())\n\nindia = india.set_index(\"ST_NAME\")\n\ndf[\"state\"] = df[\"state\"].apply(lambda x: x.lower())","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_before_2000 = df[df[\"year\"] < 2000]\ndf_before_2000 = df_before_2000.groupby(\"state\").mean()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_after_2000 = df[df[\"year\"] > 2000]\ndf_after_2000 = df_after_2000.groupby(\"state\").mean()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"result = pd.concat([df_before_2000, india], axis=1, sort=False)\nresult = result [result[\"geometry\"] != None]\nresult = result [result[\"year\"] > 0]\nfrom geopandas import GeoDataFrame\ncrs = {'init': 'epsg:4326'}\ngdf = GeoDataFrame(result, crs=crs, geometry=result [\"geometry\"])\ngdf['centroid'] = gdf.geometry.centroid\nfig,ax = plt.subplots(figsize=(20,10))\ngdf.plot(column='so2',ax=ax,alpha=0.4,edgecolor='black',cmap='cool', legend=True)\nplt.title(\"Mean So2 before 2000\")\nplt.axis('off')\n\nfor x, y, label in zip(gdf.centroid.x, gdf.centroid.y, gdf.index):\n    ax.annotate(label, xy=(x, y), xytext=(3,3), textcoords=\"offset points\",color='gray')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"result = pd.concat([df_after_2000, india], axis=1, sort=False)\nresult = result [result[\"geometry\"] != None]\nresult = result [result[\"year\"] > 0]\nfrom geopandas import GeoDataFrame\ncrs = {'init': 'epsg:4326'}\ngdf = GeoDataFrame(result, crs=crs, geometry=result [\"geometry\"])\ngdf['centroid'] = gdf.geometry.centroid\nfig,ax = plt.subplots(figsize=(20,10))\ngdf.plot(column='so2',ax=ax,alpha=0.4,edgecolor='black',cmap='cool', legend=True)\nplt.title(\"Mean So2 after 2000\")\nplt.axis('off')\n\nfor x, y, label in zip(gdf.centroid.x, gdf.centroid.y, gdf.index):\n    ax.annotate(label, xy=(x, y), xytext=(3,3), textcoords=\"offset points\",color='gray')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"* two maps show trends in So2 pollution mean"},{"metadata":{"trusted":true},"cell_type":"markdown","source":"# Time Series Analysis\n\n* so2 Sulfur dioxide"},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2 = df[[\"date\", \"so2\"]]\ndf_so2 = df_so2.set_index(\"date\")\ndf_so2 = df_so2.dropna()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample = df_so2.resample(rule = \"M\").mean().ffill()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample.plot(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample[\"so2\"].resample(\"A\").mean().plot.bar(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### ETS Decomposition (Error Trend Seasonality)"},{"metadata":{},"cell_type":"markdown","source":"* Simple Moving Average"},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample.plot(figsize = (20,10))\ndf_so2_resample.rolling(window = 7).mean()[\"so2\"].plot(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"* Esponential weighted moving average EWMA\napply more weight to value more recent"},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample[\"EWMA-7\"] = df_so2_resample[\"so2\"].ewm(span=7).mean()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample.plot(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### ETS\n\n"},{"metadata":{"trusted":true},"cell_type":"code","source":"from statsmodels.tsa.seasonal import seasonal_decompose\nresult = seasonal_decompose(df_so2_resample[\"so2\"], model = \"multiplicative\") ","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"fig = result.plot()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"markdown","source":"## ARIMA and Seasonal ARIMA\n\n#### Autoregressive Integrated Moving Averages\n\n* https://people.duke.edu/~rnau/411arim3.htm ARIMA explained\n\n* Make the time series data stationary\n* Plot the Correlation and AutoCorrelation Charts\n* Construct the ARIMA Model\n* Use the model to make predictions\n\n#### Testing the Stationarity\n\nBasically, we are trying to whether to accept the Null Hypothesis **H0** (that the time series has a unit root, indicating it is non-stationary) or reject **H0** and go with the Alternative Hypothesis (that the time series has no unit root and is stationary).\n\nWe end up deciding this based on the p-value return.\n\n* A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.\n\n* A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis.\n\n"},{"metadata":{"trusted":true},"cell_type":"code","source":"from statsmodels.tsa.stattools import adfuller\nresult = adfuller(df_so2_resample[\"so2\"])\nprint('Augmented Dickey-Fuller Test:')\nlabels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations Used']\n\nfor value,label in zip(result,labels):\n    print(label+' : '+str(value) )\n    \nif result[1] <= 0.05:\n    print(\"strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary\")\nelse:\n    print(\"weak evidence against null hypothesis, time series has a unit root, indicating it is non-stationary \")","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"* the data is seasonal ---> use Seasonal ARIMA"},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample[\"so2_first_diff\"] = df_so2_resample[\"so2\"] - df_so2_resample[\"so2\"].shift(7)\n# CHECK\nresult = adfuller(df_so2_resample[\"so2_first_diff\"].dropna() )\nprint('Augmented Dickey-Fuller Test:')\nlabels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations Used']\n\nfor value,label in zip(result,labels):\n    print(label+' : '+str(value) )\n    \nif result[1] <= 0.05:\n    print(\"strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary\")\nelse:\n    print(\"weak evidence against null hypothesis, time series has a unit root, indicating it is non-stationary \")","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample[\"so2_first_diff\"].plot(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample[\"so2_second_diff\"] = df_so2_resample[\"so2_first_diff\"] - df_so2_resample[\"so2_first_diff\"].shift(7)\ndf_so2_resample[\"so2_second_diff\"].plot(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Using the Seasonal ARIMA model"},{"metadata":{"trusted":true},"cell_type":"code","source":"import statsmodels.api as sm\n\nmodel = sm.tsa.statespace.SARIMAX(df_so2_resample[\"so2\"],order=(0,1,0), seasonal_order=(1,1,1,48))\nresults = model.fit()\nprint(results.summary())\nresults.resid.plot()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"results.resid.plot(kind='kde')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Check with known data "},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample['forecast'] = results.predict(start = 250, end= 400, dynamic= True)  \ndf_so2_resample[['so2','forecast']].plot(figsize=(20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# Forecast\n\n* so2"},{"metadata":{"trusted":true},"cell_type":"code","source":"from pandas.tseries.offsets import DateOffset\nfuture_dates = [df_so2_resample.index[-1] + DateOffset(months=x) for x in range(0,24) ]\nfuture_dates_df = pd.DataFrame(index=future_dates[1:],columns=df_so2_resample.columns)\nfuture_df = pd.concat([df_so2_resample,future_dates_df])\nfuture_df['forecast2'] = results.predict(start = 348, end = 540, dynamic= True)  \nfuture_df[['so2', 'forecast2']].plot(figsize=(20, 10)) ","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat":4,"nbformat_minor":1}
diff --git a/5 环境污染的预测/5 环境污染的预测--这里啥都有/explore.ipynb b/5 环境污染的预测/5 环境污染的预测--这里啥都有/explore.ipynb
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"cells":[{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"import os\nfor dirname, _, filenames in os.walk('/kaggle/input'):\n for filename in filenames:\n print(os.path.join(dirname, filename))\n","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"* This data is combined(across the years and states) and largely clean version of the Historical Daily Ambient Air Quality Data released by the Ministry of Environment and Forests and Central Pollution Control Board of India under the National Data Sharing and Accessibility Policy (NDSAP).\n* Detect pollution trends"},{"metadata":{"_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","trusted":true},"cell_type":"code","source":"import pandas as pd\nimport numpy as np\nimport seaborn as sns\nimport matplotlib.pyplot as plt\nimport geopandas as gpd\nimport geoplot","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df = pd.read_csv(\"/kaggle/input/india-air-quality-data/data.csv\", encoding = \"ISO-8859-1\")\ndf.head()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"* stn_code : Station code. A code given to each station that recorded the data.\n* sampling_date : The date when the data was recorded.\n* state : It represents the states whose air quality data is measured.\n* location : It represents the city whose air quality data is measured.\n* agency : Name of the agency that measured the data.\n* type : The type of area where the measurement was made.\n* so2 : The amount of Sulphur Dioxide measured.\n* no2 : The amount of Nitrogen Dioxide measured\n* rspm : Respirable Suspended Particulate Matter measured.\n* spm : Suspended Particulate Matter measured.\n* location_monitoring_station : It indicates the location of the monitoring area.\n* pm2_5 : It represents the value of particulate matter measured.\n* date : It represents the date of recording (It is cleaner version of ‘sampling_date’ feature)\n"},{"metadata":{"trusted":true},"cell_type":"code","source":"df['date'] = pd.to_datetime(df['date'],format='%Y-%m-%d') # date parse\ndf['year'] = df['date'].dt.year # year\ndf['year'] = df['year'].fillna(df[\"year\"].min())\ndf['year'] = df['year'].values.astype(int)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"print (df.get_dtype_counts())","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# EDA\n\n## Null Values"},{"metadata":{"trusted":true},"cell_type":"code","source":"def printNullValues(df):\n total = df.isnull().sum().sort_values(ascending = False)\n total = total[df.isnull().sum().sort_values(ascending = False) != 0]\n percent = total / len(df) * 100\n percent = percent[df.isnull().sum().sort_values(ascending = False) != 0]\n concat = pd.concat([total, percent], axis=1, keys=['Total','Percent'])\n print (concat)\n print ( \"-------------\")","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"printNullValues(df)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"* agency’s name have nothing to do with how much polluted the state is. \n* stn_code is also unnecessary.\n* date and sampling_date are similar\n* location_monitoring_station "},{"metadata":{},"cell_type":"markdown","source":"## Type"},{"metadata":{"trusted":true},"cell_type":"code","source":"df[\"type\"].value_counts()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"\nsns.catplot(x = \"type\", kind = \"count\", data = df, height=5, aspect = 4)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Analysis by type and pollution means"},{"metadata":{"trusted":true},"cell_type":"code","source":"grp = df.groupby([\"type\"]).mean()[\"so2\"].to_frame()\ngrp.plot.bar(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"grp = df.groupby([\"type\"]).mean()[\"no2\"].to_frame()\ngrp.plot.bar(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## SO2\n\nSulfur dioxide"},{"metadata":{"trusted":true},"cell_type":"code","source":"\ndf[['so2', 'state']].groupby(['state']).median().sort_values(\"so2\", ascending = False).plot.bar(figsize=(20,10))\n","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df[['so2','year','state']].groupby([\"year\"]).median().sort_values(by='year',ascending=False).plot(figsize=(20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## no2 \n\nNitrogen dioxide"},{"metadata":{"trusted":true},"cell_type":"code","source":"\ndf[['no2', 'state']].groupby(['state']).median().sort_values(\"no2\", ascending = False).plot.bar(figsize=(20,10))\n","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df[['no2','year','state']].groupby([\"year\"]).median().sort_values(by='year',ascending=False).plot(figsize=(20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## SPM\n\n Suspended Particulate Matter"},{"metadata":{"trusted":true},"cell_type":"code","source":"\ndf[['spm', 'state']].groupby(['state']).median().sort_values(\"spm\", ascending = False).plot.bar(figsize=(20,10))\n","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df[['spm','year','state']].groupby([\"year\"]).median().sort_values(by='year',ascending=False).plot(figsize=(20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## PIVOT tables"},{"metadata":{"trusted":true},"cell_type":"code","source":"fig, ax = plt.subplots(figsize=(20,10)) \nsns.heatmap(df.pivot_table('so2', index='state',columns=['year'],aggfunc='median',margins=True),ax = ax,annot=True, linewidths=.5)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"fig, ax = plt.subplots(figsize=(20,10)) \nsns.heatmap(df.pivot_table('no2', index='state',columns=['year'],aggfunc='median',margins=True),ax = ax,annot=True, linewidths=.5)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"fig, ax = plt.subplots(figsize=(20,10)) \nsns.heatmap(df.pivot_table('spm', index='state',columns=['year'],aggfunc='median',margins=True),ax = ax,annot=False, linewidths=.5)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Trends by regions"},{"metadata":{"trusted":true},"cell_type":"code","source":"temp = df.pivot_table('so2', index='year',columns=['state'],aggfunc='median',margins=True).reset_index()\ntemp = temp.drop(\"All\", axis = 1)\ntemp = temp.set_index(\"year\")\ntemp.plot(figsize=(20,10))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"temp = df.pivot_table('no2', index='year',columns=['state'],aggfunc='median',margins=True).reset_index()\ntemp = temp.drop(\"All\", axis = 1)\ntemp = temp.set_index(\"year\")\ntemp.plot(figsize=(20,10))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"temp = df.pivot_table('spm', index='year',columns=['state'],aggfunc='median',margins=True).reset_index()\ntemp = temp.drop(\"All\", axis = 1)\ntemp = temp.set_index(\"year\")\ntemp.plot(figsize=(20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"* it would make sense to fill the missing values with the mean between the last two available previous and next value?"},{"metadata":{},"cell_type":"markdown","source":"# Geoplotting"},{"metadata":{"trusted":true},"cell_type":"code","source":"india = gpd.read_file('/kaggle/input/maps-of-india/India_SHP/INDIA.shp')\nindia.info()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"india.plot()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"markdown","source":"* Match the names of states between the two datasets"},{"metadata":{"trusted":true},"cell_type":"code","source":"india[\"ST_NAME\"] = india[\"ST_NAME\"].apply(lambda x: x.lower())\n\nindia = india.set_index(\"ST_NAME\")\n\ndf[\"state\"] = df[\"state\"].apply(lambda x: x.lower())","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_before_2000 = df[df[\"year\"] < 2000]\ndf_before_2000 = df_before_2000.groupby(\"state\").mean()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_after_2000 = df[df[\"year\"] > 2000]\ndf_after_2000 = df_after_2000.groupby(\"state\").mean()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"result = pd.concat([df_before_2000, india], axis=1, sort=False)\nresult = result [result[\"geometry\"] != None]\nresult = result [result[\"year\"] > 0]\nfrom geopandas import GeoDataFrame\ncrs = {'init': 'epsg:4326'}\ngdf = GeoDataFrame(result, crs=crs, geometry=result [\"geometry\"])\ngdf['centroid'] = gdf.geometry.centroid\nfig,ax = plt.subplots(figsize=(20,10))\ngdf.plot(column='so2',ax=ax,alpha=0.4,edgecolor='black',cmap='cool', legend=True)\nplt.title(\"Mean So2 before 2000\")\nplt.axis('off')\n\nfor x, y, label in zip(gdf.centroid.x, gdf.centroid.y, gdf.index):\n ax.annotate(label, xy=(x, y), xytext=(3,3), textcoords=\"offset points\",color='gray')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"result = pd.concat([df_after_2000, india], axis=1, sort=False)\nresult = result [result[\"geometry\"] != None]\nresult = result [result[\"year\"] > 0]\nfrom geopandas import GeoDataFrame\ncrs = {'init': 'epsg:4326'}\ngdf = GeoDataFrame(result, crs=crs, geometry=result [\"geometry\"])\ngdf['centroid'] = gdf.geometry.centroid\nfig,ax = plt.subplots(figsize=(20,10))\ngdf.plot(column='so2',ax=ax,alpha=0.4,edgecolor='black',cmap='cool', legend=True)\nplt.title(\"Mean So2 after 2000\")\nplt.axis('off')\n\nfor x, y, label in zip(gdf.centroid.x, gdf.centroid.y, gdf.index):\n ax.annotate(label, xy=(x, y), xytext=(3,3), textcoords=\"offset points\",color='gray')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"* two maps show trends in So2 pollution mean"},{"metadata":{"trusted":true},"cell_type":"markdown","source":"# Time Series Analysis\n\n* so2 Sulfur dioxide"},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2 = df[[\"date\", \"so2\"]]\ndf_so2 = df_so2.set_index(\"date\")\ndf_so2 = df_so2.dropna()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample = df_so2.resample(rule = \"M\").mean().ffill()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample.plot(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample[\"so2\"].resample(\"A\").mean().plot.bar(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### ETS Decomposition (Error Trend Seasonality)"},{"metadata":{},"cell_type":"markdown","source":"* Simple Moving Average"},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample.plot(figsize = (20,10))\ndf_so2_resample.rolling(window = 7).mean()[\"so2\"].plot(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"* Esponential weighted moving average EWMA\napply more weight to value more recent"},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample[\"EWMA-7\"] = df_so2_resample[\"so2\"].ewm(span=7).mean()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample.plot(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### ETS\n\n"},{"metadata":{"trusted":true},"cell_type":"code","source":"from statsmodels.tsa.seasonal import seasonal_decompose\nresult = seasonal_decompose(df_so2_resample[\"so2\"], model = \"multiplicative\") ","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"fig = result.plot()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"markdown","source":"## ARIMA and Seasonal ARIMA\n\n#### Autoregressive Integrated Moving Averages\n\n* https://people.duke.edu/~rnau/411arim3.htm ARIMA explained\n\n* Make the time series data stationary\n* Plot the Correlation and AutoCorrelation Charts\n* Construct the ARIMA Model\n* Use the model to make predictions\n\n#### Testing the Stationarity\n\nBasically, we are trying to whether to accept the Null Hypothesis H0 (that the time series has a unit root, indicating it is non-stationary) or reject H0 and go with the Alternative Hypothesis (that the time series has no unit root and is stationary).\n\nWe end up deciding this based on the p-value return.\n\n* A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.\n\n* A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis.\n\n"},{"metadata":{"trusted":true},"cell_type":"code","source":"from statsmodels.tsa.stattools import adfuller\nresult = adfuller(df_so2_resample[\"so2\"])\nprint('Augmented Dickey-Fuller Test:')\nlabels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations Used']\n\nfor value,label in zip(result,labels):\n print(label+' : '+str(value) )\n \nif result[1] <= 0.05:\n print(\"strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary\")\nelse:\n print(\"weak evidence against null hypothesis, time series has a unit root, indicating it is non-stationary \")","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"* the data is seasonal ---> use Seasonal ARIMA"},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample[\"so2_first_diff\"] = df_so2_resample[\"so2\"] - df_so2_resample[\"so2\"].shift(7)\n# CHECK\nresult = adfuller(df_so2_resample[\"so2_first_diff\"].dropna() )\nprint('Augmented Dickey-Fuller Test:')\nlabels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations Used']\n\nfor value,label in zip(result,labels):\n print(label+' : '+str(value) )\n \nif result[1] <= 0.05:\n print(\"strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary\")\nelse:\n print(\"weak evidence against null hypothesis, time series has a unit root, indicating it is non-stationary \")","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample[\"so2_first_diff\"].plot(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample[\"so2_second_diff\"] = df_so2_resample[\"so2_first_diff\"] - df_so2_resample[\"so2_first_diff\"].shift(7)\ndf_so2_resample[\"so2_second_diff\"].plot(figsize = (20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Using the Seasonal ARIMA model"},{"metadata":{"trusted":true},"cell_type":"code","source":"import statsmodels.api as sm\n\nmodel = sm.tsa.statespace.SARIMAX(df_so2_resample[\"so2\"],order=(0,1,0), seasonal_order=(1,1,1,48))\nresults = model.fit()\nprint(results.summary())\nresults.resid.plot()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"results.resid.plot(kind='kde')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Check with known data "},{"metadata":{"trusted":true},"cell_type":"code","source":"df_so2_resample['forecast'] = results.predict(start = 250, end= 400, dynamic= True) \ndf_so2_resample[['so2','forecast']].plot(figsize=(20,10))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# Forecast\n\n* so2"},{"metadata":{"trusted":true},"cell_type":"code","source":"from pandas.tseries.offsets import DateOffset\nfuture_dates = [df_so2_resample.index[-1] + DateOffset(months=x) for x in range(0,24) ]\nfuture_dates_df = pd.DataFrame(index=future_dates[1:],columns=df_so2_resample.columns)\nfuture_df = pd.concat([df_so2_resample,future_dates_df])\nfuture_df['forecast2'] = results.predict(start = 348, end = 540, dynamic= True) \nfuture_df[['so2', 'forecast2']].plot(figsize=(20, 10)) ","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat":4,"nbformat_minor":1}