feat: update gapminder.json and add source information #580

dsmedia · 2024-07-11T11:13:12Z

Summary

Updates the gapminder.json dataset from the source
Adds detailed source information for gapminder.json to SOURCES.md
Creates a script file demonstrating the dataset update process

Related Issue

Resolves #577

dsmedia · 2024-07-11T11:19:47Z

For reference, here is the code used to generate the dataset. The code pulls the data from source spreadsheets (links via gapminder.org) and then retains only the countries included in the then-current vega-datsets version of gapminder.

import pandas as pd
import json
import re
from vega_datasets import data

def google_sheet_to_pandas(sheet_url):
    key_match = re.search(r'/d/([a-zA-Z0-9-_]+)', sheet_url)
    gid_match = re.search(r'gid=(\d+)', sheet_url)
    sheet_key, gid = key_match.group(1), gid_match.group(1)
    csv_export_url = f"https://docs.google.com/spreadsheets/d/{sheet_key}/export?format=csv&gid={gid}"
    return pd.read_csv(csv_export_url)

urls = [
    "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676", #life expectancy v14
    "https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676", #population v7
    "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676", #fertility v14
    "https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158" #data geographies v2
]

# Load dataframes
df_life, df_pop, df_fert, df_region = [google_sheet_to_pandas(url) for url in urls]

# Prepare main dataframe
df_main = df_pop[['name', 'time', 'Population']].rename(columns={'Population': 'pop'})

# Merge other dataframes
df_main = df_main.merge(df_life[['name', 'time', 'Life expectancy ']], on=['name', 'time'])
df_main = df_main.merge(df_fert[['name', 'time', 'Babies per woman']], on=['name', 'time'])
df_main = df_main.merge(df_region[['name', 'six_regions']], on='name')

# Rename columns
df_main = df_main.rename(columns={
    'name': 'country',
    'time': 'year',
    'Life expectancy ': 'life_expect',
    'Babies per woman': 'fertility',
    'six_regions': 'region'
})

# Reorder columns
df_main = df_main[['year', 'country', 'region', 'pop', 'life_expect', 'fertility']]

# Convert year to int and filter years from 1955 to 2005 in increments of 5
df_main['year'] = df_main['year'].astype(int)
df_main = df_main[df_main['year'].between(1955, 2005) & (df_main['year'] % 5 == 0)]

# Sort the dataframe
df_main = df_main.sort_values(['country', 'year'])

# Create the cluster mapping
cluster_map = {
    'south_asia': 0,
    'europe_central_asia': 1,
    'sub_saharan_africa': 2,
    'america': 3,
    'east_asia_pacific': 4,
    'middle_east_north_africa': 5
}

# Add cluster column and drop region column
df_main['cluster'] = df_main['region'].map(cluster_map)
df_main = df_main.drop('region', axis=1)

# Reorder columns
column_order = ['year', 'country', 'cluster', 'pop', 'life_expect', 'fertility']
df_main = df_main[column_order]

# Load gapminder dataset
df_gapminder = data.gapminder()

# Rename Hong Kong to Hong Kong, China in df_gapminder
df_gapminder.loc[df_gapminder['country'] == 'Hong Kong', 'country'] = 'Hong Kong, China'

# Get the list of countries in df_gapminder
gapminder_countries = set(df_gapminder['country'])

# Keep only rows in df_main that have a country in gapminder_countries
df_main = df_main[df_main['country'].isin(gapminder_countries)]

# Convert population to integer to match data type of original version of the dataset (and handle potential errors)
df_main['pop'] = df_main['pop'].astype(int, errors='ignore')

# Convert DataFrame to list of dictionaries
data_list = df_main.to_dict(orient='records')

# Convert the list of dictionaries to JSON
json_data = json.dumps(data_list)

print(json_data)
with open('gapminder.json', 'w') as f:
    json.dump(data_list, f)

domoritz · 2024-07-11T11:58:30Z

Thank you. I think the updates and adding to sources can be one pull request since the sources updates are for the updates dataset, no?

domoritz · 2024-07-11T14:18:12Z

@dsmedia let's add the code to the repo. We have the https://github.com/vega/vega-datasets/tree/main/scripts folder.

…e-sources-md

This commit introduces a Python script that updates the gapminder.json file in the vega-datasets repository. The script: - Fetches current data from Gapminder's Google Sheets - Processes and combines data for life expectancy, population, fertility, and regions - Filters results to match countries in the existing dataset - Updates consistent with a minor release The script maintains data consistency by referencing a specific version of the existing dataset. This update allows for refreshed Gapminder data while preserving the column structure and scope of countries/years expected by dependent visualizations. Related: #580

dsmedia · 2024-07-12T10:59:38Z

Merged all related changes into this PR and added the updater to the script folder.

domoritz · 2024-07-16T19:56:03Z

thank you

dsmedia added 3 commits July 11, 2024 06:54

docs: add gapminder source details to SOURCES.md

36e8120

feat: update gapminder.json dataset from source

2fec055

feat: update gapminder.json dataset from source

9af98ab

dsmedia added 2 commits July 12, 2024 05:29

Merge remote-tracking branch 'origin/fix-update-gapminder' into updat…

66bdccb

…e-sources-md

dsmedia changed the title ~~feat: update gapminder.json dataset from source~~ feat: update gapminder.json and add source information Jul 12, 2024

dsmedia mentioned this pull request Jul 12, 2024

docs: add gapminder source details to SOURCES.md #579

Closed

fix: typo

3def5c3

domoritz merged commit 76feaab into vega:main Jul 16, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: update gapminder.json and add source information #580

feat: update gapminder.json and add source information #580

dsmedia commented Jul 11, 2024 •

edited

Loading

dsmedia commented Jul 11, 2024

domoritz commented Jul 11, 2024

domoritz commented Jul 11, 2024

dsmedia commented Jul 12, 2024

domoritz commented Jul 16, 2024

feat: update gapminder.json and add source information #580

feat: update gapminder.json and add source information #580

Conversation

dsmedia commented Jul 11, 2024 • edited Loading

Summary

Related Issue

dsmedia commented Jul 11, 2024

domoritz commented Jul 11, 2024

domoritz commented Jul 11, 2024

dsmedia commented Jul 12, 2024

domoritz commented Jul 16, 2024

dsmedia commented Jul 11, 2024 •

edited

Loading