diff --git a/docs/constituency_methodology.ipynb b/docs/constituency_methodology.ipynb index 4beac59..d9f68d5 100644 --- a/docs/constituency_methodology.ipynb +++ b/docs/constituency_methodology.ipynb @@ -11,6 +11,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "## Introduction\n", + "\n", + "When policy changes in the UK - taxes, benefits, or public spending - it affects places and people differently. PolicyEngine UK builds tools to analyze incomes, jobs, and population patterns in each constituency. This documentation explains how we create a microsimulation model that works at the constituency level. The system combines workplace surveys of jobs and earnings, HMRC tax records, and population statistics. We map data between 2010 and 2024 constituency boundaries, estimate income distributions, and optimize geographic weights.\n", + "\n", + "This guide shows how to use PolicyEngine UK for constituency analysis. We start with data collection, transform it for modeling, and build tools to examine policies. The guide provides examples and code to implement these methods. Users can measure changes in household budgets, track employment, and understand economic patterns on different constituencies. This document starts with data collection from workplace surveys, tax records, and population counts, then explains how we convert this data into usable forms through income brackets and boundary mapping. It concludes with technical details about accuracy measurement and calibration, plus example code for analysis and visualization." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data\n", + "\n", "### Earning and jobs data\n", "\n", "Data is extracted from NOMIS Annual Survey of Hours and Earnings (ASHE) - workplace analysis dataset, containing number of jobs and earnings percentiles for all UK parliamentary constituencies from [this website](https://www.nomisweb.co.uk/datasets/ashe). This dataset is stored as `nomis_earning_jobs_data.xlsx`. To download the data, follow the variable selection process shown in the image below:\n", @@ -813,10 +826,228 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Convert earning percentiles to brackets\n", + "## Preprocessing\n", "\n", - "The next step is to convert earning percentiles to earning brackets. To do this, the distribution of earnings needs to be estimated. Based on the ratio of different percentile incomes from [this government statistics report](https://www.gov.uk/government/statistics/percentile-points-from-1-to-99-for-total-income-before-and-after-tax#:~:text=Details,in%20the%20Background%20Quality%20Report), earnings for percentiles from 90 to 99 for each constituency are estimated. Also, the earnings distribution starts from 0. The following image shows the earnings distribution for an example constituency.\n", + "### Convert earning percentiles to brackets\n", "\n", + "The next step is to convert earning percentiles to earning brackets. To do this, the distribution of earnings needs to be estimated. Based on the ratio of different percentile incomes from [this government statistics report](https://www.gov.uk/government/statistics/percentile-points-from-1-to-99-for-total-income-before-and-after-tax#:~:text=Details,in%20the%20Background%20Quality%20Report), earnings for percentiles from 90 to 99 for each constituency are estimated. The following code and image show the earnings distribution for an example constituency." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{code-block} python\n", + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# Sample data for Darlington\n", + "income_data = {\n", + " 'parliamentary constituency 2010': ['Darlington'],\n", + " 'constituency_code': ['E14000658'],\n", + " 'Number of jobs': ['31000'],\n", + " '10 percentile': [13298.0],\n", + " '20 percentile': [16723.0],\n", + " '30 percentile': [20778.0],\n", + " '40 percentile': [23407.0],\n", + " '50 percentile': [27158.0],\n", + " '60 percentile': [30471.0],\n", + " '70 percentile': [33812.0],\n", + " '80 percentile': [40717.0],\n", + " '90 percentile': [55762.0],\n", + " '91 percentile': [58878.0],\n", + " '92 percentile': [62394.4],\n", + " '93 percentile': [66722.3],\n", + " '94 percentile': [71952.0],\n", + " '95 percentile': [78804.5],\n", + " '96 percentile': [87640.7],\n", + " '97 percentile': [100083.5],\n", + " '98 percentile': [123526.5],\n", + " '100 percentile': [179429.0]\n", + "}\n", + "\n", + "income_sample = pd.DataFrame(income_data)\n", + "\n", + "# Excel Data Method\n", + "def load_real_data():\n", + " # Read Excel data\n", + " income_real = pd.read_excel(\"nomis_earning_jobs_data.xlsx\", skiprows=7)\n", + " income_real.columns = income_real.iloc[0]\n", + " income_real = income_real.drop(index=0).reset_index(drop=True)\n", + " \n", + " # Select and rename columns\n", + " columns_to_keep = [\n", + " 'parliamentary constituency 2010',\n", + " 'constituency_code',\n", + " 'Number of jobs',\n", + " 'Median',\n", + " '10 percentile',\n", + " '20 percentile',\n", + " '30 percentile',\n", + " '40 percentile',\n", + " '60 percentile',\n", + " '70 percentile',\n", + " '80 percentile',\n", + " '90 percentile'\n", + " ]\n", + " income_real = income_real[columns_to_keep]\n", + " income_real = income_real.rename(columns={'Median': '50 percentile'})\n", + " return income_real\n", + "\n", + "# Plotting function\n", + "def plot_constituency_distribution(income_df, constituency_name, detailed=True):\n", + " constituency_data = income_df[income_df['parliamentary constituency 2010'] == constituency_name].iloc[0]\n", + " \n", + " percentiles = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 91, 92, 93, 94, 95, 96, 97, 98, 100]\n", + " income_values = [\n", + " 0,\n", + " constituency_data['10 percentile'],\n", + " constituency_data['20 percentile'],\n", + " constituency_data['30 percentile'],\n", + " constituency_data['40 percentile'],\n", + " constituency_data['50 percentile'],\n", + " constituency_data['60 percentile'],\n", + " constituency_data['70 percentile'],\n", + " constituency_data['80 percentile'],\n", + " constituency_data['90 percentile'],\n", + " constituency_data['91 percentile'],\n", + " constituency_data['92 percentile'],\n", + " constituency_data['93 percentile'],\n", + " constituency_data['94 percentile'],\n", + " constituency_data['95 percentile'],\n", + " constituency_data['96 percentile'],\n", + " constituency_data['97 percentile'],\n", + " constituency_data['98 percentile'],\n", + " constituency_data['100 percentile']\n", + " ]\n", + " \n", + " valid_data = [(p, v) for p, v in zip(percentiles, income_values) if pd.notna(v)]\n", + " filtered_percentiles, filtered_income = zip(*valid_data)\n", + " \n", + " plt.figure(figsize=(8, 6))\n", + " plt.plot(filtered_percentiles, filtered_income, marker='o')\n", + " plt.xlabel('Percentiles')\n", + " plt.ylabel('Income')\n", + " plt.title(f'Income Distribution for {constituency_name}')\n", + " plt.grid(True)\n", + " plt.show()\n", + "\n", + "# Plot sample data (Darlington with detailed percentiles)\n", + "plot_constituency_distribution(income_sample, 'Darlington', detailed=True) \n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# Sample data for Darlington\n", + "income_data = {\n", + " 'parliamentary constituency 2010': ['Darlington'],\n", + " 'constituency_code': ['E14000658'],\n", + " 'Number of jobs': ['31000'],\n", + " '10 percentile': [13298.0],\n", + " '20 percentile': [16723.0],\n", + " '30 percentile': [20778.0],\n", + " '40 percentile': [23407.0],\n", + " '50 percentile': [27158.0],\n", + " '60 percentile': [30471.0],\n", + " '70 percentile': [33812.0],\n", + " '80 percentile': [40717.0],\n", + " '90 percentile': [55762.0],\n", + " '91 percentile': [58878.0],\n", + " '92 percentile': [62394.4],\n", + " '93 percentile': [66722.3],\n", + " '94 percentile': [71952.0],\n", + " '95 percentile': [78804.5],\n", + " '96 percentile': [87640.7],\n", + " '97 percentile': [100083.5],\n", + " '98 percentile': [123526.5],\n", + " '100 percentile': [179429.0]\n", + "}\n", + "\n", + "income_sample = pd.DataFrame(income_data)\n", + "\n", + "# Excel Data Method\n", + "def load_real_data():\n", + " # Read Excel data\n", + " income_real = pd.read_excel(\"nomis_earning_jobs_data.xlsx\", skiprows=7)\n", + " income_real.columns = income_real.iloc[0]\n", + " income_real = income_real.drop(index=0).reset_index(drop=True)\n", + " \n", + " # Select and rename columns\n", + " columns_to_keep = [\n", + " 'parliamentary constituency 2010',\n", + " 'constituency_code',\n", + " 'Number of jobs',\n", + " 'Median',\n", + " '10 percentile',\n", + " '20 percentile',\n", + " '30 percentile',\n", + " '40 percentile',\n", + " '60 percentile',\n", + " '70 percentile',\n", + " '80 percentile',\n", + " '90 percentile'\n", + " ]\n", + " income_real = income_real[columns_to_keep]\n", + " income_real = income_real.rename(columns={'Median': '50 percentile'})\n", + " return income_real\n", + "\n", + "# Plotting function\n", + "def plot_constituency_distribution(income_df, constituency_name, detailed=True):\n", + " constituency_data = income_df[income_df['parliamentary constituency 2010'] == constituency_name].iloc[0]\n", + " \n", + " percentiles = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 91, 92, 93, 94, 95, 96, 97, 98, 100]\n", + " income_values = [\n", + " 0,\n", + " constituency_data['10 percentile'],\n", + " constituency_data['20 percentile'],\n", + " constituency_data['30 percentile'],\n", + " constituency_data['40 percentile'],\n", + " constituency_data['50 percentile'],\n", + " constituency_data['60 percentile'],\n", + " constituency_data['70 percentile'],\n", + " constituency_data['80 percentile'],\n", + " constituency_data['90 percentile'],\n", + " constituency_data['91 percentile'],\n", + " constituency_data['92 percentile'],\n", + " constituency_data['93 percentile'],\n", + " constituency_data['94 percentile'],\n", + " constituency_data['95 percentile'],\n", + " constituency_data['96 percentile'],\n", + " constituency_data['97 percentile'],\n", + " constituency_data['98 percentile'],\n", + " constituency_data['100 percentile']\n", + " ]\n", + " \n", + " valid_data = [(p, v) for p, v in zip(percentiles, income_values) if pd.notna(v)]\n", + " filtered_percentiles, filtered_income = zip(*valid_data)\n", + " \n", + " plt.figure(figsize=(8, 6))\n", + " plt.plot(filtered_percentiles, filtered_income, marker='o')\n", + " plt.xlabel('Percentiles')\n", + " plt.ylabel('Income')\n", + " plt.title(f'Income distribution for {constituency_name}')\n", + " plt.grid(True)\n", + " plt.savefig(\"pictures/earning_dist.png\", dpi=300, bbox_inches='tight')\n", + " plt.close()\n", + "\n", + "# Plot sample data (Darlington with detailed percentiles)\n", + "plot_constituency_distribution(income_sample, 'Darlington', detailed=True)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "![](pictures/earning_dist.png)\n", "\n", "After estimating the full earnings distribution, the data is converted into income brackets. For each constituency and income bracket, the number of jobs and total earnings are calculated based on the estimated earnings distribution. For constituencies with missing data, the earnings distribution pattern is estimated using data from constituencies with similar total number of taxpayers and total income levels. The Python script `create_employment_incomes.py` generates `employment_income.csv` containing number of jobs (`employment_income_count`) and total earnings (`employment_income_amount`) for each constituency and income bracket. The following table shows employment and income across different brackets for constituencies:" @@ -1744,27 +1975,29 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "## Methodology\n", + "\n", "### Loss function\n", "\n", "The file `loss.py` defines a function `create_constituency_target_matrix` that creates target matrices for comparing simulated data against actual constituency-level data. \n", "\n", - "- The function takes three main input parameters: dataset (defaults to `enhanced_frs_2022_23`), time_period (defaults to 2025), and an optional reform parameter for policy changes.\n", + "1. The function takes three main input parameters: dataset (defaults to `enhanced_frs_2022_23`), time_period (defaults to 2025), and an optional reform parameter for policy changes.\n", "\n", - "- It reads three files containing real data: `age.csv`, `total_income.csv`, and `employment_income.csv`.\n", + "2. It reads three files containing real data: `age.csv`, `total_income.csv`, and `employment_income.csv`.\n", "\n", - "- It creates a PolicyEngine Microsimulation object using the specified dataset and reform parameters.\n", + "3. It creates a PolicyEngine Microsimulation object using the specified dataset and reform parameters.\n", "\n", - "- The function creates two main matrices: `matrix` for simulated values from PolicyEngine, and `y` for actual target values from HMRC data.\n", + "4. The function creates two main matrices: `matrix` for simulated values from PolicyEngine, and `y` for actual target values from HMRC data.\n", "\n", - "- It calculates total income metrics by computing both the total amounts and counts of people with income in each constituency.\n", + "5. It calculates total income metrics by computing both the total amounts and counts of people with income in each constituency.\n", "\n", - "- It processes age distributions by creating 10-year age bands from 0 to 80, calculating how many people fall into each band.\n", + "6. It processes age distributions by creating 10-year age bands from 0 to 80, calculating how many people fall into each band.\n", "\n", - "- For employment income, it processes both counts and amounts for different income bands between £12,570 and £70,000, excluding people under 16.\n", + "7. For employment income, it processes both counts and amounts for different income bands between £12,570 and £70,000, excluding people under 16.\n", "\n", - "- The `sim.map_result()` function is used throughout to map individual-level results to household level.\n", + "8. The `sim.map_result()` function is used throughout to map individual-level results to household level.\n", "\n", - "- The function returns both the simulated matrix and the target matrix `(matrix, y)` which can be used for comparing the simulation results against actual data." + "9. The function returns both the simulated matrix and the target matrix `(matrix, y)` which can be used for comparing the simulation results against actual data." ] }, { @@ -1775,30 +2008,30 @@ "\n", "The file `calibrate.py` defines a main `calibrate()` function that performs weight calibration for constituency-level analysis.\n", "\n", - "* It imports necessary functions and matrices from other files including `create_constituency_target_matrix`, `create_national_target_matrix` from `loss.py`, and `transform_2010_to_2024` for constituency boundary transformations.\n", + "1. It imports necessary functions and matrices from other files including `create_constituency_target_matrix`, `create_national_target_matrix` from `loss.py`, and `transform_2010_to_2024` for constituency boundary transformations.\n", "\n", - "* Sets up initial matrices using the `create_constituency_target_matrix` and `create_national_target_matrix` functions for both constituency and national level data.\n", + "2. Sets up initial matrices using the `create_constituency_target_matrix` and `create_national_target_matrix` functions for both constituency and national level data.\n", "\n", - "* Creates a Microsimulation object using the `enhanced_frs_2022_23` dataset.\n", + "3. Creates a Microsimulation object using the `enhanced_frs_2022_23` dataset.\n", "\n", - "* Initializes weights for 650 constituencies x 100180 households, starting with the log of household weights divided by constituency count.\n", + "4. Initializes weights for 650 constituencies x 100180 households, starting with the log of household weights divided by constituency count.\n", "\n", - "* Converts all the matrices and weights into PyTorch tensors to enable optimization.\n", + "5. Converts all the matrices and weights into PyTorch tensors to enable optimization.\n", "\n", - "* Defines a loss function that calculates and combines both constituency-level and national-level mean squared errors into a single loss value.\n", + "6. Defines a loss function that calculates and combines both constituency-level and national-level mean squared errors into a single loss value.\n", "\n", - "* Uses Adam optimizer with a learning rate of 0.1 to minimize the loss over 512 epochs.\n", + "7. Uses Adam optimizer with a learning rate of 0.1 to minimize the loss over 512 epochs.\n", "\n", - "* Every 100 epochs during optimization, it updates the weights using the mapping matrix from 2010 to 2024 constituencies and saves the current weights to a `weights.h5` file.\n", + "8. Every 100 epochs during optimization, it updates the weights using the mapping matrix from 2010 to 2024 constituencies and saves the current weights to a `weights.h5` file.\n", "\n", - "* Includes an `update_weights()` function that applies the constituency mapping matrix to transform the weights between different boundary definitions." + "9. Includes an `update_weights()` function that applies the constituency mapping matrix to transform the weights between different boundary definitions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Example code for constituency analysis\n", + "## Example\n", "\n", "The following code demonstrates how to analyze and visualize median earnings across UK parliamentary constituencies using PolicyEngine:" ] @@ -1809,43 +2042,8 @@ "source": [ "```{code-block} python\n", "# Import required libraries\n", - "from policyengine_uk import Microsimulation\n", "from policyengine.utils.charts import *\n", "from policyengine import Simulation\n", - "import plotly.graph_objects as go\n", - "import pandas as pd\n", - "import numpy as np\n", - "import h5py\n", - "import os\n", - "\n", - "# Initialize baseline microsimulation for UK\n", - "baseline = Microsimulation()\n", - "# Calculate household net income for 2024\n", - "baseline_income = baseline.calculate(\"real_household_net_income\", period=2024)\n", - "# Calculate number of people per household for 2024\n", - "baseline_people = baseline.calculate(\"people\", map_to = \"household\", period=2024)\n", - "\n", - "# Load constituency-level age distribution data\n", - "age_df = pd.read_csv('../policyengine_uk_data/datasets/frs/local_areas/constituencies/targets/age.csv')\n", - "# Load pre-calculated weights for geographic distribution\n", - "with h5py.File(\"../policyengine_uk_data/datasets/frs/local_areas/constituencies/targets/weights.h5\", \"r\") as f:\n", - " weights = f[\"weight\"][:]\n", - "\n", - "# Calculate weighted income for 2024 using matrix multiplication\n", - "income_2024 = np.dot(weights, baseline_income.values)\n", - "# Calculate weighted population for 2024\n", - "population_2024 = np.dot(weights, baseline_people.values) \n", - "# Calculate per capita income\n", - "per_capita_2024 = income_2024 / population_2024\n", - "\n", - "# Create DataFrame with constituency data\n", - "df = pd.DataFrame({\n", - " 'code': age_df['code'],\n", - " 'name': age_df['name'],\n", - " 'income': per_capita_2024\n", - "})\n", - "# Rename code column to match expected format\n", - "df = df.rename(columns={'code': 'LAD24CD'})\n", "\n", "# Initialize simulation for visualization\n", "sim = Simulation(\n", @@ -1903,7 +2101,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 1, "metadata": {}, "outputs": [ { @@ -1921,40 +2119,8 @@ } ], "source": [ - "from policyengine_uk import Microsimulation\n", "from policyengine.utils.charts import *\n", "from policyengine import Simulation\n", - "import plotly.graph_objects as go\n", - "import pandas as pd\n", - "import numpy as np\n", - "import h5py\n", - "import os\n", - "\n", - "os.environ[\"HUGGING_FACE_TOKEN\"] = \"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\"\n", - "\n", - "baseline = Microsimulation()\n", - "# reformed = Microsimulation(reform=reform)\n", - "baseline_income = baseline.calculate(\"real_household_net_income\", period=2024)\n", - "# reformed_income = reformed.calculate(\"real_household_net_income\", period=2029)\n", - "baseline_people = baseline.calculate(\"people\", map_to = \"household\", period=2024)\n", - "# reformed_people = baseline.calculate(\"people\", map_to = \"household\", period=2029)\n", - "\n", - "# hex_locations = pd.read_csv('hex_map_LA.csv')\n", - "age_df = pd.read_csv('../policyengine_uk_data/datasets/frs/local_areas/constituencies/targets/age.csv')\n", - "with h5py.File(\"../policyengine_uk_data/datasets/frs/local_areas/constituencies/targets/weights.h5\", \"r\") as f:\n", - " weights = f[\"weight\"][:]\n", - "\n", - "\n", - "income_2024 = np.dot(weights, baseline_income.values)\n", - "population_2024 = np.dot(weights, baseline_people.values) \n", - "per_capita_2024 = income_2024 / population_2024\n", - "\n", - "df = pd.DataFrame({\n", - " 'code': age_df['code'],\n", - " 'name': age_df['name'],\n", - " 'income': per_capita_2024\n", - "})\n", - "df = df.rename(columns={'code': 'LAD24CD'})\n", "\n", "\n", "sim = Simulation(\n", diff --git a/docs/pictures/earning_dist.png b/docs/pictures/earning_dist.png index 10eab49..cde5b6b 100644 Binary files a/docs/pictures/earning_dist.png and b/docs/pictures/earning_dist.png differ diff --git a/docs/pictures/nomis_screenshot1.png b/docs/pictures/nomis_screenshot1.png index c661e30..29e014c 100644 Binary files a/docs/pictures/nomis_screenshot1.png and b/docs/pictures/nomis_screenshot1.png differ