-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: update gapminder.json and add source information #580
Conversation
For reference, here is the code used to generate the dataset. The code pulls the data from source spreadsheets (links via gapminder.org) and then retains only the countries included in the then-current vega-datsets version of gapminder. import pandas as pd
import json
import re
from vega_datasets import data
def google_sheet_to_pandas(sheet_url):
key_match = re.search(r'/d/([a-zA-Z0-9-_]+)', sheet_url)
gid_match = re.search(r'gid=(\d+)', sheet_url)
sheet_key, gid = key_match.group(1), gid_match.group(1)
csv_export_url = f"https://docs.google.com/spreadsheets/d/{sheet_key}/export?format=csv&gid={gid}"
return pd.read_csv(csv_export_url)
urls = [
"https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676", #life expectancy v14
"https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676", #population v7
"https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676", #fertility v14
"https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158" #data geographies v2
]
# Load dataframes
df_life, df_pop, df_fert, df_region = [google_sheet_to_pandas(url) for url in urls]
# Prepare main dataframe
df_main = df_pop[['name', 'time', 'Population']].rename(columns={'Population': 'pop'})
# Merge other dataframes
df_main = df_main.merge(df_life[['name', 'time', 'Life expectancy ']], on=['name', 'time'])
df_main = df_main.merge(df_fert[['name', 'time', 'Babies per woman']], on=['name', 'time'])
df_main = df_main.merge(df_region[['name', 'six_regions']], on='name')
# Rename columns
df_main = df_main.rename(columns={
'name': 'country',
'time': 'year',
'Life expectancy ': 'life_expect',
'Babies per woman': 'fertility',
'six_regions': 'region'
})
# Reorder columns
df_main = df_main[['year', 'country', 'region', 'pop', 'life_expect', 'fertility']]
# Convert year to int and filter years from 1955 to 2005 in increments of 5
df_main['year'] = df_main['year'].astype(int)
df_main = df_main[df_main['year'].between(1955, 2005) & (df_main['year'] % 5 == 0)]
# Sort the dataframe
df_main = df_main.sort_values(['country', 'year'])
# Create the cluster mapping
cluster_map = {
'south_asia': 0,
'europe_central_asia': 1,
'sub_saharan_africa': 2,
'america': 3,
'east_asia_pacific': 4,
'middle_east_north_africa': 5
}
# Add cluster column and drop region column
df_main['cluster'] = df_main['region'].map(cluster_map)
df_main = df_main.drop('region', axis=1)
# Reorder columns
column_order = ['year', 'country', 'cluster', 'pop', 'life_expect', 'fertility']
df_main = df_main[column_order]
# Load gapminder dataset
df_gapminder = data.gapminder()
# Rename Hong Kong to Hong Kong, China in df_gapminder
df_gapminder.loc[df_gapminder['country'] == 'Hong Kong', 'country'] = 'Hong Kong, China'
# Get the list of countries in df_gapminder
gapminder_countries = set(df_gapminder['country'])
# Keep only rows in df_main that have a country in gapminder_countries
df_main = df_main[df_main['country'].isin(gapminder_countries)]
# Convert population to integer to match data type of original version of the dataset (and handle potential errors)
df_main['pop'] = df_main['pop'].astype(int, errors='ignore')
# Convert DataFrame to list of dictionaries
data_list = df_main.to_dict(orient='records')
# Convert the list of dictionaries to JSON
json_data = json.dumps(data_list)
print(json_data)
with open('gapminder.json', 'w') as f:
json.dump(data_list, f) |
Thank you. I think the updates and adding to sources can be one pull request since the sources updates are for the updates dataset, no? |
@dsmedia let's add the code to the repo. We have the https://github.com/vega/vega-datasets/tree/main/scripts folder. |
This commit introduces a Python script that updates the gapminder.json file in the vega-datasets repository. The script: - Fetches current data from Gapminder's Google Sheets - Processes and combines data for life expectancy, population, fertility, and regions - Filters results to match countries in the existing dataset - Updates consistent with a minor release The script maintains data consistency by referencing a specific version of the existing dataset. This update allows for refreshed Gapminder data while preserving the column structure and scope of countries/years expected by dependent visualizations. Related: #580
Merged all related changes into this PR and added the updater to the script folder. |
thank you |
Summary
gapminder.json
dataset from the sourcegapminder.json
toSOURCES.md
Related Issue
Resolves #577