Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creation of SOURCES.toml #634

Closed
dsmedia opened this issue Nov 29, 2024 · 15 comments · Fixed by #643
Closed

Creation of SOURCES.toml #634

dsmedia opened this issue Nov 29, 2024 · 15 comments · Fixed by #643

Comments

@dsmedia
Copy link
Collaborator

dsmedia commented Nov 29, 2024

Before migrating the extrinsic metadata of each dataset from markdown to a machine readable format, should we agree on a yaml template that would work well with the new Frictionless tooling? Should we make any required (like sourcing) to ensure future datasets are properly documented before release? Not all have sources now, but we can get those added. What should the yaml file be named?

@dangotbanned
Copy link
Member

@dsmedia just to echo #631 (comment)

Is there a big benefit to including the yaml in addition to json? json is much more common (and the only of the two natively supported in python/js) and the readability difference is small that I would say let's only have json.

@domoritz having yaml doesn't benefit me personally, just thought I'd provide the options @dsmedia mentioned in #629 (comment):

Just thinking out loud, but instead of directly maintaining the sources.md file, we could keep the dataset metadata in a json or yaml file, and generate the sources.md file from this machine-readable format.

I'm happy with just json

If we wanted a non-json format, I'd suggest .toml since it is natively supported in python.
For the extrinsic fields you mentioned in (#631 (comment)), I imagine the toml-array-of-tables syntax would be handy.

class ResourceMeta(TypedDict, total=False):
description: str
sources: Sequence[Source]
licenses: Sequence[License]

I'm not sure how familiar you are with TypedDict(s), but you can enforce any required-and-notrequired constraints you like on the hierarchy I started in build_datapackage.py

@domoritz
Copy link
Member

Sounds good to me. I don't mind either format and having automated checks sounds great.

@dsmedia
Copy link
Collaborator Author

dsmedia commented Dec 1, 2024

Might something like this work for a TOML format, containing resource-level (i.e. dataset level) description, source and license information? This is just a proof-of-concept that includes three of the datasets: budget.json, countries.json, and gapminder.json. (I've also pulled into this file the package-level license information now hard-coded into the generation script file, to separate configuration from code.) I assume that the generation script will be able to match these to the resources with their resource name (i.e. the filename without the extension). At a later stage, resource-level column descriptions (for tabular data) could be incorporated into the TOML file (to supplement the column names and types identified by the script).

SOURCES.toml
# Package-level license information
[package.license]
name = "BSD-3-Clause"
path = "https://opensource.org/license/bsd-3-clause"
title = "The 3-Clause BSD License"

# Resource metadata

budget.description = "Budget FY 2016 - Receipts data from the Office of Management and Budget (U.S.)"

[budget.sources]
title = "Office of Management and Budget (U.S.)" 
path = "https://www.govinfo.gov/app/details/BUDGET-2016-DB/BUDGET-2016-DB-3"

countries.description = """
This dataset combines key demographic indicators (life expectancy at birth and fertility rate measured as babies per woman) for various countries from 1955 to 2000 at 5-year intervals. It includes both current values and adjacent time period values (previous and next) for each indicator. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis.
"""

[[countries.sources]]
title = "Gapminder Foundation - Life Expectancy"
path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676"
version = "v14"

[[countries.sources]]
title = "Gapminder Foundation - Fertility"
path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676"
version = "v14"

[countries.licenses]
name = "CC-BY-4.0"
path = "https://www.gapminder.org/free-material/"
title = "Creative Commons Attribution 4.0 International"

gapminder.description = """
This dataset combines key demographic indicators (life expectancy at birth, population, and fertility rate measured as babies per woman) for various countries from 1955 to 2005 at 5-year intervals. It also includes a 'cluster' column, a categorical variable grouping countries. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis.
"""

[[gapminder.sources]]
title = "Gapminder Foundation - Life Expectancy"
path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676"
version = "v14"

[[gapminder.sources]]
title = "Gapminder Foundation - Population"
path = "https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676"
version = "v7"

[[gapminder.sources]]
title = "Gapminder Foundation - Fertility"
path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676"
version = "v14"

[[gapminder.sources]]
title = "Gapminder Foundation - Data Geographies"
path = "https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158"
version = "v2"

[gapminder.licenses]
name = "CC-BY-4.0"
path = "https://www.gapminder.org/free-material/"
title = "Creative Commons Attribution 4.0 International"

Here are the relevant excerpts from the data package definition and data resource definition

sources

sources (resource-level)
List of data sources as for Data Package. If not specified the resource inherits from the data package.

sources (package-level)
The raw sources for this data package. It MUST be an array of Source objects. A Source object MUST have at least one property. A Source object is RECOMMENDED to have title property and MAY have path, email, and version properties:

  • title: A string containing a title of the source (e.g. document or organization name).
  • path: A URL or Path, that is a fully qualified HTTP address, or a relative POSIX path.
  • email: A string containing an email address.
  • version: A string containing a version of the source.
    An example of the object structure is as follows:

"sources": [{
"title": "World Bank and OECD",
"path": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
}]

licenses

licenses (resource-level)

List of licenses as for Data Package. If not specified the resource inherits from the data package.

licenses (package-level)

The license(s) under which the package is provided.

Caution

This property is not legally binding and does not guarantee the package is licensed under the terms defined in this property.

licenses MUST be an array. Each item in the array is a License. Each MUST be an object. The object MUST contain a name property and/or a path property, and it MAY contain a title property:

  • name: A string containing an Open Definition license ID
  • path: A URL or Path, that is a fully qualified HTTP address, or a relative POSIX path.
  • title: A string containing human-readable title.
    An example of using the licenses property:

"licenses": [{
"name": "ODC-PDDL-1.0",
"path": "http://opendatacommons.org/licenses/pddl/",
"title": "Open Data Commons Public Domain Dedication and License v1.0"
}]

@domoritz
Copy link
Member

domoritz commented Dec 1, 2024

Looks good. Why are some entries in single [ and some in double [[?

@dangotbanned
Copy link
Member

dangotbanned commented Dec 1, 2024

@dsmedia I like the look of the proposed SOURCES.toml.

I wanted to provide a comparison to what datapackage.json would look like in toml.

The main difference is that the bulk of the content is within [[resources]] tables (array of tables)

datapackage.toml

name = "vega-datasets"
description = "Common repository for example datasets used by Vega related projects."
homepage = "http://github.com/vega/vega-datasets.git"
sources = [
    { path = "https://github.com/vega/vega-datasets/blob/next/SOURCES.md" },
]
contributors = [
    { title = "UW Interactive Data Lab", path = "http://idl.cs.washington.edu" },
]
version = "2.11.0"
created = "2024-12-01T15:50:47.863271+00:00"

[[licenses]]
name = "BSD-3-Clause"
path = "https://opensource.org/license/bsd-3-clause"
title = "The 3-Clause BSD License"

[[resources]]
name = "7zip"
type = "file"
path = "7zip.png"
scheme = "file"
format = "png"
mediatype = "image/png"
encoding = "utf-8"

[[resources]]
name = "airports"
type = "table"
path = "airports.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "iata", type = "string" },
    { name = "name", type = "string" },
    { name = "city", type = "string" },
    { name = "state", type = "string" },
    { name = "country", type = "string" },
    { name = "latitude", type = "number" },
    { name = "longitude", type = "number" },
]

[[resources]]
name = "annual-precip"
type = "json"
path = "annual-precip.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[[resources]]
name = "anscombe"
type = "table"
path = "anscombe.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "Series", type = "string" },
    { name = "X", type = "integer" },
    { name = "Y", type = "number" },
]

[[resources]]
name = "barley"
type = "table"
path = "barley.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "yield", type = "number" },
    { name = "variety", type = "string" },
    { name = "year", type = "integer" },
    { name = "site", type = "string" },
]

[[resources]]
name = "birdstrikes"
type = "table"
path = "birdstrikes.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "Airport Name", type = "string" },
    { name = "Aircraft Make Model", type = "string" },
    { name = "Effect Amount of damage", type = "string" },
    { name = "Flight Date", type = "date" },
    { name = "Aircraft Airline Operator", type = "string" },
    { name = "Origin State", type = "string" },
    { name = "Phase of flight", type = "string" },
    { name = "Wildlife Size", type = "string" },
    { name = "Wildlife Species", type = "string" },
    { name = "Time of day", type = "string" },
    { name = "Cost Other", type = "integer" },
    { name = "Cost Repair", type = "integer" },
    { name = "Cost Total $", type = "integer" },
    { name = "Speed IAS in knots", type = "integer" },
]

[[resources]]
name = "budget"
type = "table"
path = "budget.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "Source Category Code", type = "integer" },
    { name = "Source category name", type = "string" },
    { name = "Source subcategory", type = "integer" },
    { name = "Source subcategory name", type = "string" },
    { name = "Agency code", type = "integer" },
    { name = "Agency name", type = "string" },
    { name = "Bureau code", type = "integer" },
    { name = "Bureau name", type = "string" },
    { name = "Account code", type = "integer" },
    { name = "Account name", type = "string" },
    { name = "Treasury Agency code", type = "integer" },
    { name = "On- or off-budget", type = "string" },
    { name = "1962", type = "string" },
    { name = "1963", type = "string" },
    { name = "1964", type = "string" },
    { name = "1965", type = "string" },
    { name = "1966", type = "string" },
    { name = "1967", type = "string" },
    { name = "1968", type = "string" },
    { name = "1969", type = "string" },
    { name = "1970", type = "string" },
    { name = "1971", type = "string" },
    { name = "1972", type = "string" },
    { name = "1973", type = "string" },
    { name = "1974", type = "string" },
    { name = "1975", type = "string" },
    { name = "1976", type = "string" },
    { name = "TQ", type = "string" },
    { name = "1977", type = "string" },
    { name = "1978", type = "string" },
    { name = "1979", type = "string" },
    { name = "1980", type = "string" },
    { name = "1981", type = "string" },
    { name = "1982", type = "string" },
    { name = "1983", type = "string" },
    { name = "1984", type = "string" },
    { name = "1985", type = "string" },
    { name = "1986", type = "string" },
    { name = "1987", type = "string" },
    { name = "1988", type = "string" },
    { name = "1989", type = "string" },
    { name = "1990", type = "string" },
    { name = "1991", type = "string" },
    { name = "1992", type = "string" },
    { name = "1993", type = "string" },
    { name = "1994", type = "string" },
    { name = "1995", type = "string" },
    { name = "1996", type = "string" },
    { name = "1997", type = "string" },
    { name = "1998", type = "string" },
    { name = "1999", type = "string" },
    { name = "2000", type = "string" },
    { name = "2001", type = "string" },
    { name = "2002", type = "string" },
    { name = "2003", type = "string" },
    { name = "2004", type = "string" },
    { name = "2005", type = "string" },
    { name = "2006", type = "string" },
    { name = "2007", type = "string" },
    { name = "2008", type = "string" },
    { name = "2009", type = "string" },
    { name = "2010", type = "string" },
    { name = "2011", type = "string" },
    { name = "2012", type = "string" },
    { name = "2013", type = "string" },
    { name = "2014", type = "string" },
    { name = "2015", type = "string" },
    { name = "2016", type = "string" },
    { name = "2017", type = "string" },
    { name = "2018", type = "string" },
    { name = "2019", type = "string" },
    { name = "2020", type = "string" },
]

[[resources]]
name = "budgets"
type = "table"
path = "budgets.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "budgetYear", type = "integer" },
    { name = "forecastYear", type = "integer" },
    { name = "value", type = "number" },
]

[[resources]]
name = "burtin"
type = "table"
path = "burtin.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "Bacteria", type = "string" },
    { name = "Penicillin", type = "number" },
    { name = "Streptomycin", type = "number" },
    { name = "Neomycin", type = "number" },
    { name = "Gram_Staining", type = "string" },
    { name = "Genus", type = "string" },
]

[[resources]]
name = "cars"
type = "table"
path = "cars.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "Name", type = "string" },
    { name = "Miles_per_Gallon", type = "integer" },
    { name = "Cylinders", type = "integer" },
    { name = "Displacement", type = "number" },
    { name = "Horsepower", type = "integer" },
    { name = "Weight_in_lbs", type = "integer" },
    { name = "Acceleration", type = "number" },
    { name = "Year", type = "date" },
    { name = "Origin", type = "string" },
]

[[resources]]
name = "co2-concentration"
type = "table"
path = "co2-concentration.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "Date", type = "date" },
    { name = "CO2", type = "number" },
    { name = "adjusted CO2", type = "number" },
]

[[resources]]
name = "countries"
type = "table"
path = "countries.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "_comment", type = "string" },
    { name = "year", type = "integer" },
    { name = "fertility", type = "number" },
    { name = "life_expect", type = "number" },
    { name = "n_fertility", type = "number" },
    { name = "n_life_expect", type = "number" },
    { name = "country", type = "string" },
]

[[resources]]
name = "crimea"
type = "table"
path = "crimea.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "date", type = "date" },
    { name = "wounds", type = "integer" },
    { name = "other", type = "integer" },
    { name = "disease", type = "integer" },
]

[[resources]]
name = "disasters"
type = "table"
path = "disasters.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "Entity", type = "string" },
    { name = "Year", type = "integer" },
    { name = "Deaths", type = "integer" },
]

[[resources]]
name = "driving"
type = "table"
path = "driving.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "side", type = "string" },
    { name = "year", type = "integer" },
    { name = "miles", type = "integer" },
    { name = "gas", type = "number" },
]

[[resources]]
name = "earthquakes"
type = "json"
path = "earthquakes.json"
scheme = "file"
format = "geojson"
mediatype = "text/geojson"
encoding = "utf-8"

[[resources]]
name = "ffox"
type = "file"
path = "ffox.png"
scheme = "file"
format = "png"
mediatype = "image/png"
encoding = "utf-8"

[[resources]]
name = "flare-dependencies"
type = "table"
path = "flare-dependencies.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "source", type = "integer" },
    { name = "target", type = "integer" },
]

[[resources]]
name = "flare"
type = "table"
path = "flare.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "id", type = "integer" },
    { name = "name", type = "string" },
]

[[resources]]
name = "flights-10k"
type = "table"
path = "flights-10k.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "date", type = "string" },
    { name = "delay", type = "integer" },
    { name = "distance", type = "integer" },
    { name = "origin", type = "string" },
    { name = "destination", type = "string" },
]

[[resources]]
name = "flights-200k"
type = "table"
path = "flights-200k.arrow"
scheme = "file"
format = "arrow"
mediatype = "application/vnd.apache.arrow.file"

[resources.schema]
fields = [
    { name = "delay", type = "integer" },
    { name = "distance", type = "integer" },
    { name = "time", type = "number" },
]

[[resources]]
name = "flights-200k"
type = "table"
path = "flights-200k.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "delay", type = "integer" },
    { name = "distance", type = "integer" },
    { name = "time", type = "number" },
]

[[resources]]
name = "flights-20k"
type = "table"
path = "flights-20k.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "date", type = "string" },
    { name = "delay", type = "integer" },
    { name = "distance", type = "integer" },
    { name = "origin", type = "string" },
    { name = "destination", type = "string" },
]

[[resources]]
name = "flights-2k"
type = "table"
path = "flights-2k.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "date", type = "string" },
    { name = "delay", type = "integer" },
    { name = "distance", type = "integer" },
    { name = "origin", type = "string" },
    { name = "destination", type = "string" },
]

[[resources]]
name = "flights-3m"
type = "table"
path = "flights-3m.parquet"
scheme = "file"
format = "parquet"
mediatype = "application/parquet"

[resources.schema]
fields = [
    { name = "date", type = "integer" },
    { name = "delay", type = "integer" },
    { name = "distance", type = "integer" },
    { name = "origin", type = "string" },
    { name = "destination", type = "string" },
]

[[resources]]
name = "flights-5k"
type = "table"
path = "flights-5k.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "date", type = "string" },
    { name = "delay", type = "integer" },
    { name = "distance", type = "integer" },
    { name = "origin", type = "string" },
    { name = "destination", type = "string" },
]

[[resources]]
name = "flights-airport"
type = "table"
path = "flights-airport.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "origin", type = "string" },
    { name = "destination", type = "string" },
    { name = "count", type = "integer" },
]

[[resources]]
name = "football"
type = "table"
path = "football.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "date", type = "date" },
    { name = "division", type = "string" },
    { name = "home_team", type = "string" },
    { name = "away_team", type = "string" },
    { name = "home_score", type = "integer" },
    { name = "away_score", type = "integer" },
]

[[resources]]
name = "gapminder-health-income"
type = "table"
path = "gapminder-health-income.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "country", type = "string" },
    { name = "income", type = "integer" },
    { name = "health", type = "number" },
    { name = "population", type = "integer" },
    { name = "region", type = "string" },
]

[[resources]]
name = "gapminder"
type = "table"
path = "gapminder.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "year", type = "integer" },
    { name = "country", type = "string" },
    { name = "cluster", type = "integer" },
    { name = "pop", type = "integer" },
    { name = "life_expect", type = "number" },
    { name = "fertility", type = "number" },
]

[[resources]]
name = "gimp"
type = "file"
path = "gimp.png"
scheme = "file"
format = "png"
mediatype = "image/png"
encoding = "utf-8"

[[resources]]
name = "github"
type = "table"
path = "github.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "time", type = "string" },
    { name = "count", type = "integer" },
]

[[resources]]
name = "global-temp"
type = "table"
path = "global-temp.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "year", type = "integer" },
    { name = "temp", type = "number" },
]

[[resources]]
name = "income"
type = "table"
path = "income.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "name", type = "string" },
    { name = "region", type = "string" },
    { name = "id", type = "integer" },
    { name = "pct", type = "number" },
    { name = "total", type = "integer" },
    { name = "group", type = "string" },
]

[[resources]]
name = "iowa-electricity"
type = "table"
path = "iowa-electricity.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "year", type = "date" },
    { name = "source", type = "string" },
    { name = "net_generation", type = "integer" },
]

[[resources]]
name = "jobs"
type = "table"
path = "jobs.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "job", type = "string" },
    { name = "sex", type = "string" },
    { name = "year", type = "integer" },
    { name = "count", type = "integer" },
    { name = "perc", type = "number" },
]

[[resources]]
name = "la-riots"
type = "table"
path = "la-riots.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "first_name", type = "string" },
    { name = "last_name", type = "string" },
    { name = "age", type = "integer" },
    { name = "gender", type = "string" },
    { name = "race", type = "string" },
    { name = "death_date", type = "date" },
    { name = "address", type = "string" },
    { name = "neighborhood", type = "string" },
    { name = "type", type = "string" },
    { name = "longitude", type = "number" },
    { name = "latitude", type = "number" },
]

[[resources]]
name = "londonboroughs"
type = "json"
path = "londonBoroughs.json"
scheme = "file"
format = "topojson"
mediatype = "text/topojson"
encoding = "utf-8"

[[resources]]
name = "londoncentroids"
type = "table"
path = "londonCentroids.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "name", type = "string" },
    { name = "cx", type = "number" },
    { name = "cy", type = "number" },
]

[[resources]]
name = "londontubelines"
type = "json"
path = "londonTubeLines.json"
scheme = "file"
format = "topojson"
mediatype = "text/topojson"
encoding = "utf-8"

[[resources]]
name = "lookup_groups"
type = "table"
path = "lookup_groups.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "group", type = "integer" },
    { name = "person", type = "string" },
]

[[resources]]
name = "lookup_people"
type = "table"
path = "lookup_people.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "name", type = "string" },
    { name = "age", type = "integer" },
    { name = "height", type = "integer" },
]

[[resources]]
name = "miserables"
type = "json"
path = "miserables.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[[resources]]
name = "monarchs"
type = "table"
path = "monarchs.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "name", type = "string" },
    { name = "start", type = "integer" },
    { name = "end", type = "integer" },
    { name = "index", type = "integer" },
]

[[resources]]
name = "movies"
type = "table"
path = "movies.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "Title", type = "string" },
    { name = "US Gross", type = "integer" },
    { name = "Worldwide Gross", type = "integer" },
    { name = "US DVD Sales", type = "integer" },
    { name = "Production Budget", type = "integer" },
    { name = "Release Date", type = "string" },
    { name = "MPAA Rating", type = "string" },
    { name = "Running Time min", type = "integer" },
    { name = "Distributor", type = "string" },
    { name = "Source", type = "string" },
    { name = "Major Genre", type = "string" },
    { name = "Creative Type", type = "string" },
    { name = "Director", type = "string" },
    { name = "Rotten Tomatoes Rating", type = "integer" },
    { name = "IMDB Rating", type = "number" },
    { name = "IMDB Votes", type = "integer" },
]

[[resources]]
name = "normal-2d"
type = "table"
path = "normal-2d.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "u", type = "number" },
    { name = "v", type = "number" },
]

[[resources]]
name = "obesity"
type = "table"
path = "obesity.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "id", type = "integer" },
    { name = "rate", type = "number" },
    { name = "state", type = "string" },
]

[[resources]]
name = "ohlc"
type = "table"
path = "ohlc.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "date", type = "date" },
    { name = "open", type = "number" },
    { name = "high", type = "number" },
    { name = "low", type = "number" },
    { name = "close", type = "number" },
    { name = "signal", type = "string" },
    { name = "ret", type = "number" },
]

[[resources]]
name = "penguins"
type = "table"
path = "penguins.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "Species", type = "string" },
    { name = "Island", type = "string" },
    { name = "Beak Length (mm)", type = "number" },
    { name = "Beak Depth (mm)", type = "number" },
    { name = "Flipper Length (mm)", type = "integer" },
    { name = "Body Mass (g)", type = "integer" },
    { name = "Sex", type = "string" },
]

[[resources]]
name = "platformer-terrain"
type = "table"
path = "platformer-terrain.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "x", type = "integer" },
    { name = "y", type = "integer" },
    { name = "lumosity", type = "number" },
    { name = "saturation", type = "integer" },
    { name = "name", type = "string" },
    { name = "id", type = "string" },
    { name = "color", type = "string" },
    { name = "key", type = "string" },
]

[[resources]]
name = "points"
type = "table"
path = "points.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "x", type = "number" },
    { name = "y", type = "number" },
]

[[resources]]
name = "political-contributions"
type = "table"
path = "political-contributions.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "Candidate_Identification", type = "string" },
    { name = "Candidate_Name", type = "string" },
    { name = "Incumbent_Challenger_Status", type = "string" },
    { name = "Party_Code", type = "integer" },
    { name = "Party_Affiliation", type = "string" },
    { name = "Total_Receipts", type = "number" },
    { name = "Transfers_from_Authorized_Committees", type = "integer" },
    { name = "Total_Disbursements", type = "number" },
    { name = "Transfers_to_Authorized_Committees", type = "number" },
    { name = "Beginning_Cash", type = "number" },
    { name = "Ending_Cash", type = "number" },
    { name = "Contributions_from_Candidate", type = "number" },
    { name = "Loans_from_Candidate", type = "integer" },
    { name = "Other_Loans", type = "integer" },
    { name = "Candidate_Loan_Repayments", type = "number" },
    { name = "Other_Loan_Repayments", type = "integer" },
    { name = "Debts_Owed_By", type = "number" },
    { name = "Total_Individual_Contributions", type = "integer" },
    { name = "Candidate_State", type = "string" },
    { name = "Candidate_District", type = "integer" },
    { name = "Contributions_from_Other_Political_Committees", type = "integer" },
    { name = "Contributions_from_Party_Committees", type = "integer" },
    { name = "Coverage_End_Date", type = "string" },
    { name = "Refunds_to_Individuals", type = "integer" },
    { name = "Refunds_to_Committees", type = "integer" },
]

[[resources]]
name = "population"
type = "table"
path = "population.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "year", type = "integer" },
    { name = "age", type = "integer" },
    { name = "sex", type = "integer" },
    { name = "people", type = "integer" },
]

[[resources]]
name = "population_engineers_hurricanes"
type = "table"
path = "population_engineers_hurricanes.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "state", type = "string" },
    { name = "id", type = "integer" },
    { name = "population", type = "integer" },
    { name = "engineers", type = "number" },
    { name = "hurricanes", type = "integer" },
]

[[resources]]
name = "seattle-weather-hourly-normals"
type = "table"
path = "seattle-weather-hourly-normals.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "date", type = "datetime" },
    { name = "pressure", type = "number" },
    { name = "temperature", type = "number" },
    { name = "wind", type = "number" },
]

[[resources]]
name = "seattle-weather"
type = "table"
path = "seattle-weather.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "date", type = "date" },
    { name = "precipitation", type = "number" },
    { name = "temp_max", type = "number" },
    { name = "temp_min", type = "number" },
    { name = "wind", type = "number" },
    { name = "weather", type = "string" },
]

[[resources]]
name = "sp500-2000"
type = "table"
path = "sp500-2000.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "date", type = "date" },
    { name = "open", type = "number" },
    { name = "high", type = "number" },
    { name = "low", type = "number" },
    { name = "close", type = "number" },
    { name = "adjclose", type = "number" },
    { name = "volume", type = "integer" },
]

[[resources]]
name = "sp500"
type = "table"
path = "sp500.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "date", type = "string" },
    { name = "price", type = "number" },
]

[[resources]]
name = "stocks"
type = "table"
path = "stocks.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "symbol", type = "string" },
    { name = "date", type = "string" },
    { name = "price", type = "number" },
]

[[resources]]
name = "udistrict"
type = "table"
path = "udistrict.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "key", type = "string" },
    { name = "lat", type = "number" },
]

[[resources]]
name = "unemployment-across-industries"
type = "table"
path = "unemployment-across-industries.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "series", type = "string" },
    { name = "year", type = "integer" },
    { name = "month", type = "integer" },
    { name = "count", type = "integer" },
    { name = "rate", type = "number" },
    { name = "date", type = "datetime" },
]

[[resources]]
name = "unemployment"
type = "table"
path = "unemployment.tsv"
scheme = "file"
format = "tsv"
mediatype = "text/tsv"
encoding = "utf-8"

[resources.dialect.csv]
delimiter = "	"

[resources.schema]
fields = [
    { name = "id", type = "integer" },
    { name = "rate", type = "number" },
]

[[resources]]
name = "uniform-2d"
type = "table"
path = "uniform-2d.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "u", type = "number" },
    { name = "v", type = "number" },
]

[[resources]]
name = "us-10m"
type = "json"
path = "us-10m.json"
scheme = "file"
format = "topojson"
mediatype = "text/topojson"
encoding = "utf-8"

[[resources]]
name = "us-employment"
type = "table"
path = "us-employment.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "month", type = "date" },
    { name = "nonfarm", type = "integer" },
    { name = "private", type = "integer" },
    { name = "goods_producing", type = "integer" },
    { name = "service_providing", type = "integer" },
    { name = "private_service_providing", type = "integer" },
    { name = "mining_and_logging", type = "integer" },
    { name = "construction", type = "integer" },
    { name = "manufacturing", type = "integer" },
    { name = "durable_goods", type = "integer" },
    { name = "nondurable_goods", type = "integer" },
    { name = "trade_transportation_utilties", type = "integer" },
    { name = "wholesale_trade", type = "number" },
    { name = "retail_trade", type = "number" },
    { name = "transportation_and_warehousing", type = "number" },
    { name = "utilities", type = "number" },
    { name = "information", type = "integer" },
    { name = "financial_activities", type = "integer" },
    { name = "professional_and_business_services", type = "integer" },
    { name = "education_and_health_services", type = "integer" },
    { name = "leisure_and_hospitality", type = "integer" },
    { name = "other_services", type = "integer" },
    { name = "government", type = "integer" },
    { name = "nonfarm_change", type = "integer" },
]

[[resources]]
name = "us-state-capitals"
type = "table"
path = "us-state-capitals.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "lon", type = "number" },
    { name = "lat", type = "number" },
    { name = "state", type = "string" },
    { name = "city", type = "string" },
]

[[resources]]
name = "volcano"
type = "json"
path = "volcano.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[[resources]]
name = "weather"
type = "table"
path = "weather.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "location", type = "string" },
    { name = "date", type = "date" },
    { name = "precipitation", type = "number" },
    { name = "temp_max", type = "number" },
    { name = "temp_min", type = "number" },
    { name = "wind", type = "number" },
    { name = "weather", type = "string" },
]

[[resources]]
name = "weather"
type = "json"
path = "weather.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[[resources]]
name = "wheat"
type = "table"
path = "wheat.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "year", type = "integer" },
    { name = "wheat", type = "number" },
    { name = "wages", type = "number" },
]

[[resources]]
name = "windvectors"
type = "table"
path = "windvectors.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "longitude", type = "number" },
    { name = "latitude", type = "number" },
    { name = "dir", type = "integer" },
    { name = "dirCat", type = "integer" },
    { name = "speed", type = "number" },
]

[[resources]]
name = "world-110m"
type = "json"
path = "world-110m.json"
scheme = "file"
format = "topojson"
mediatype = "text/topojson"
encoding = "utf-8"

[[resources]]
name = "zipcodes"
type = "table"
path = "zipcodes.csv"
scheme = "file"
format = "csv"
mediatype = "text/csv"
encoding = "utf-8"

[resources.schema]
fields = [
    { name = "zip_code", type = "integer" },
    { name = "latitude", type = "number" },
    { name = "longitude", type = "number" },
    { name = "city", type = "string" },
    { name = "state", type = "string" },
    { name = "county", type = "string" },
]

I generated the above with the following diff:

build_datapackage.py changes

diff --git a/scripts/build_datapackage.py b/scripts/build_datapackage.py
index 30834a6..88566b2 100755
--- a/scripts/build_datapackage.py
+++ b/scripts/build_datapackage.py
@@ -5,6 +5,7 @@
 # dependencies = [
 #     "frictionless[json,parquet]",
 #     "polars",
+#     "tomli-w",
 # ]
 # ///
 """
@@ -306,6 +307,19 @@ def iter_resources(data_root: Path, /) -> Iterator[Resource]:
             continue
 
 
+def to_toml(pkg: Package, fp: Path | None = None):
+    import tomli_w
+
+    mapping = pkg.to_dict()
+
+    if fp:
+        fp.touch()
+        with fp.open("wb") as f:
+            tomli_w.dump(mapping, f)
+    else:
+        return tomli_w.dumps(mapping)
+
+
 def main(
     *,
     stem: str = "datapackage",
@@ -333,6 +347,7 @@ def main(
         p = (repo_dir / f"{stem}.yaml").as_posix()
         logger.info(f"Writing {p!r}")
         pkg.to_yaml(p)
+    to_toml(pkg, repo_dir / f"{stem}.toml")
 
 
 if __name__ == "__main__":

countries.json

Below is what the addition of the sources + license in your reply would change in toml

toml

[[resources]]
name = "countries"
type = "table"
path = "countries.json"
scheme = "file"
format = "json"
mediatype = "text/json"
encoding = "utf-8"
description = """
This dataset combines key demographic indicators (life expectancy at birth and fertility rate measured as babies per woman) for various countries from 1955 to 2000 at 5-year intervals. It includes both current values and adjacent time period values (previous and next) for each indicator. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis.
"""

[resources.dialect.json]
keyed = true

[resources.schema]
fields = [
    { name = "_comment", type = "string" },
    { name = "year", type = "integer" },
    { name = "fertility", type = "number" },
    { name = "life_expect", type = "number" },
    { name = "n_fertility", type = "number" },
    { name = "n_life_expect", type = "number" },
    { name = "country", type = "string" },
]

[[resources.sources]]
title = "Gapminder Foundation - Life Expectancy"
path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676"
version = "v14"

[[resources.sources]]
title = "Gapminder Foundation - Fertility"
path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676"
version = "v14"

[[resources.licenses]]
name = "CC-BY-4.0"
path = "https://www.gapminder.org/free-material/"
title = "Creative Commons Attribution 4.0 International"

And then converted back to json

json

{
  "resources": [
    {
      "name": "countries",
      "type": "table",
      "path": "countries.json",
      "scheme": "file",
      "format": "json",
      "mediatype": "text/json",
      "encoding": "utf-8",
      "description": "This dataset combines key demographic indicators (life expectancy at birth and fertility rate measured as babies per woman) for various countries from 1955 to 2000 at 5-year intervals. It includes both current values and adjacent time period values (previous and next) for each indicator. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to \"show people the big picture\" rather than support detailed numeric analysis.\r\n",
      "dialect": {
        "json": {
          "keyed": true
        }
      },
      "schema": {
        "fields": [
          {
            "name": "_comment",
            "type": "string"
          },
          {
            "name": "year",
            "type": "integer"
          },
          {
            "name": "fertility",
            "type": "number"
          },
          {
            "name": "life_expect",
            "type": "number"
          },
          {
            "name": "n_fertility",
            "type": "number"
          },
          {
            "name": "n_life_expect",
            "type": "number"
          },
          {
            "name": "country",
            "type": "string"
          }
        ]
      },
      "sources": [
        {
          "title": "Gapminder Foundation - Life Expectancy",
          "path": "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676",
          "version": "v14"
        },
        {
          "title": "Gapminder Foundation - Fertility",
          "path": "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676",
          "version": "v14"
        }
      ],
      "licenses": [
        {
          "name": "CC-BY-4.0",
          "path": "https://www.gapminder.org/free-material/",
          "title": "Creative Commons Attribution 4.0 International"
        }
      ]
    }
  ]
}

Suggestion

Match the Package schema exactly.
But require only "name" "path", ("sources", "licenses", "description", ...) in a [[resources]] table.

Then we can just merge in the intrinsic metadata by matching the path

@dangotbanned
Copy link
Member

dangotbanned commented Dec 1, 2024

Looks good. Why are some entries in single [ and some in double [[?

@domoritz interesting timing on this - hopefully the examples in (#634 (comment)) can explain the nesting.

I'm not sure @dsmedia's sample would translate directly into the target schema.

This is what SOURCES.toml converts to

{
  "package": {
    "license": {
      "name": "BSD-3-Clause",
      "path": "https://opensource.org/license/bsd-3-clause",
      "title": "The 3-Clause BSD License",
      "budget": {
        "description": "Budget FY 2016 - Receipts data from the Office of Management and Budget (U.S.)"
      }
    }
  },
  "budget": {
    "sources": {
      "title": "Office of Management and Budget (U.S.)",
      "path": "https://www.govinfo.gov/app/details/BUDGET-2016-DB/BUDGET-2016-DB-3",
      "countries": {
        "description": "This dataset combines key demographic indicators (life expectancy at birth and fertility rate measured as babies per woman) for various countries from 1955 to 2000 at 5-year intervals. It includes both current values and adjacent time period values (previous and next) for each indicator. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to \"show people the big picture\" rather than support detailed numeric analysis.\n"
      }
    }
  },
  "countries": {
    "sources": [
      {
        "title": "Gapminder Foundation - Life Expectancy",
        "path": "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676",
        "version": "v14"
      },
      {
        "title": "Gapminder Foundation - Fertility",
        "path": "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676",
        "version": "v14"
      }
    ],
    "licenses": {
      "name": "CC-BY-4.0",
      "path": "https://www.gapminder.org/free-material/",
      "title": "Creative Commons Attribution 4.0 International",
      "gapminder": {
        "description": "This dataset combines key demographic indicators (life expectancy at birth, population, and fertility rate measured as babies per woman) for various countries from 1955 to 2005 at 5-year intervals. It also includes a 'cluster' column, a categorical variable grouping countries. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to \"show people the big picture\" rather than support detailed numeric analysis.\n"
      }
    }
  },
  "gapminder": {
    "sources": [
      {
        "title": "Gapminder Foundation - Life Expectancy",
        "path": "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676",
        "version": "v14"
      },
      {
        "title": "Gapminder Foundation - Population",
        "path": "https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676",
        "version": "v7"
      },
      {
        "title": "Gapminder Foundation - Fertility",
        "path": "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676",
        "version": "v14"
      },
      {
        "title": "Gapminder Foundation - Data Geographies",
        "path": "https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158",
        "version": "v2"
      }
    ],
    "licenses": {
      "name": "CC-BY-4.0",
      "path": "https://www.gapminder.org/free-material/",
      "title": "Creative Commons Attribution 4.0 International"
    }
  }
}

@dsmedia
Copy link
Collaborator Author

dsmedia commented Dec 1, 2024

Got it. So something like the following?

sources.toml
$schema = "https://datapackage.org/profiles/2.0/datapackage.json"

# Package-level metadata using inline tables
name = "vega-datasets"
description = "Common repository for example datasets used by Vega related projects."
homepage = "http://github.com/vega/vega-datasets.git"
sources = [
    { path = "https://github.com/vega/vega-datasets/blob/next/SOURCES.md" },
]
contributors = [
    { title = "UW Interactive Data Lab", path = "http://idl.cs.washington.edu" },
]
version = "2.11.0"
created = "2024-12-01T15:50:47.863271+00:00"

[[licenses]]
name = "BSD-3-Clause"
path = "https://opensource.org/license/bsd-3-clause"
title = "The 3-Clause BSD License"

# Resources array
[[resources]]
name = "budget"
path = "budget.json"
description = "Budget FY 2016 - Receipts data from the Office of Management and Budget (U.S.)"

[[resources.sources]]
title = "Office of Management and Budget (U.S.)" 
path = "https://www.govinfo.gov/app/details/BUDGET-2016-DB/BUDGET-2016-DB-3"

[[resources]]
name = "countries"
path = "countries.json"
description = """
This dataset combines key demographic indicators (life expectancy at birth and fertility rate measured as babies per woman) for various countries from 1955 to 2000 at 5-year intervals. It includes both current values and adjacent time period values (previous and next) for each indicator. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis.
"""

[[resources.sources]]
title = "Gapminder Foundation - Life Expectancy"
path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676"
version = "v14"

[[resources.sources]]
title = "Gapminder Foundation - Fertility"
path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676"
version = "v14"

[[resources.licenses]]
name = "CC-BY-4.0"
path = "https://www.gapminder.org/free-material/"
title = "Creative Commons Attribution 4.0 International"

[[resources]]
name = "gapminder"
path = "gapminder.json"
description = """
This dataset combines key demographic indicators (life expectancy at birth, population, and fertility rate measured as babies per woman) for various countries from 1955 to 2005 at 5-year intervals. It also includes a 'cluster' column, a categorical variable grouping countries. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis.
"""

[[resources.sources]]
title = "Gapminder Foundation - Life Expectancy"
path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676"
version = "v14"

[[resources.sources]]
title = "Gapminder Foundation - Population"
path = "https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676"
version = "v7"

[[resources.sources]]
title = "Gapminder Foundation - Fertility"
path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676"
version = "v14"

[[resources.sources]]
title = "Gapminder Foundation - Data Geographies"
path = "https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158"
version = "v2"

[[resources.licenses]]
name = "CC-BY-4.0"
path = "https://www.gapminder.org/free-material/"
title = "Creative Commons Attribution 4.0 International"
  • I added a root-level package data descriptor per my understanding of the recommendations of the spec, and also hard-coded the package-level metadata in the TOML. (I assume it's better to have this here than in the generation script?)
  • @domoritz just wanted to confirm that the package-level contributor metadata in package.json is still current/complete:

"author": {
"name": "UW Interactive Data Lab",
"url": "http://idl.cs.washington.edu"

  • Quick question about the workflow - how should conflicts be handled if there's overlap between metadata in the TOML file and what's automatically detected by the build script? These might be inadvertent, or there might be cases where the intrinsic metadata doesn't generate properly from the script, and

@dangotbanned
Copy link
Member

dangotbanned commented Dec 1, 2024

@dsmedia #634 (comment) looks good

and also hard-coded the package-level metadata in the TOML. (I assume it's better to have this here than in the generation script?)

Happy for this to be moved out of build_datapackage.py, except for the bits that are dynamic like:

version = "2.11.0"
created = "2024-12-01T15:50:47.863271+00:00"

Quick question about the workflow - how should conflicts be handled if there's overlap between metadata in the TOML file and what's automatically detected by the build script?

Whatever is in the .toml should have a higher precedence.
I don't think there would be any conflicts with what you have so far.

But I imagine it could be helpful to manually define the parts we discover are detected incorrectly like:

Incorrect schema detection

"schema": {
"fields": [
{
"name": "date",
"type": "string"
},

vega-datasets/datapackage.json

Lines 1229 to 1234 in 719c388

"schema": {
"fields": [
{
"name": "date",
"type": "string"
},

vega-datasets/datapackage.json

Lines 2065 to 2070 in 719c388

"schema": {
"fields": [
{
"name": "date",
"type": "string"
},

vega-datasets/datapackage.json

Lines 2091 to 2096 in 719c388

"schema": {
"fields": [
{
"name": "date",
"type": "string"
},

vega-datasets/datapackage.json

Lines 2425 to 2430 in 719c388

"schema": {
"fields": [
{
"name": "date",
"type": "string"
},

vega-datasets/datapackage.json

Lines 2551 to 2556 in 719c388

"schema": {
"fields": [
{
"name": "date",
"type": "integer"
},

vega-datasets/datapackage.json

Lines 2681 to 2690 in 719c388

"schema": {
"fields": [
{
"name": "symbol",
"type": "string"
},
{
"name": "date",
"type": "string"
},

@domoritz
Copy link
Member

domoritz commented Dec 2, 2024

For the package authors, we might want to change all of our packages to be "the Vega organization" or something like that. Can we do that separately from this?

@dsmedia dsmedia changed the title Conversion of SOURCES.md to SOURCES.yaml Creation of SOURCES.toml Dec 7, 2024
@dsmedia
Copy link
Collaborator Author

dsmedia commented Dec 7, 2024

Thanks for the guidance above. For the purposes of resolving this issue, I'll plan to generate a toml file compatible with the existing script that contains a resource description, and, where available, source and license, for each dataset in the repo. This will produce a more complete datapackage.json file when the script is run.

@dsmedia
Copy link
Collaborator Author

dsmedia commented Dec 7, 2024

@dangotbanned will the below work for the script? I assume some modifications may need to be done to the script, so if it helps to modify the format of the toml file just let me know.

TOML Generation Notes

File Structure

  • Resource names are lowercase, while paths preserve the original filename case.
  • Each resource is defined separately, even when related datasets share descriptions
  • Markdown formatting is preserved in description fields for future rendering
  • Comment separators (# Path: filename) are maintained between resources to facilitate manual edits

Content Handling

  • For ease in generating, description fields contain the complete markdown text from SOURCES.md
  • Some content may be duplicated across description, source, and license fields
  • Source entries are interpreted broadly as reference points, not just direct dataset links

Future Improvements

  • Deduplicate content between description and source/license fields
  • Hard-code intrinsic source data not captured (or captured erroneously) by the current script
  • Generate standalone markdown documentation from the package JSON
sources.toml
# Path: 7zip.png
[[resources]]
name = "7zip.png"
path = "7zip.png"
description = """Application icons from open-source software projects."""

# Path: airports.csv
[[resources]]
name = "airports.csv"
path = "airports.csv"

# Path: annual-precip.json
[[resources]]
name = "annual-precip.json"
path = "annual-precip.json"
description = """A raster grid of global annual precipitation for the year 2016 at a resolution 1 degree of lon/lat per cell, from [CFSv2](https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/climate-forecast-system-version2-cfsv2)."""

[[resources.sources]]
title = "Climate Forecast System Version 2"
path = "https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/climate-forecast-system-version2-cfsv2"


# Path: anscombe.json
[[resources]]
name = "anscombe.json"
path = "anscombe.json"
description = """Graphs in Statistical Analysis, F. J. Anscombe, The American Statistician."""

# Path: barley.json
[[resources]]
name = "barley.json"
path = "barley.json"
description = """The result of a 1930s agricultural experiment in Minnesota, this dataset contains yields for 10 different varieties of barley at six different sites. It was first published by agronomists F.R. Immer, H.K. Hayes, and L. Powers in the 1934 paper \"Statistical Determination of Barley Varietal Adaption.\" R.A. Fisher's popularized its use in the field of statistics when he included it in his book [\"The Design of Experiments.\"](https://en.wikipedia.org/wiki/The_Design_of_Experiments) Since then it has been used to demonstrate new statistical techniques, including the [trellis charts](http://ml.stat.purdue.edu/stat695t/writings/TrellisDesignControl.pdf) developed by Richard Becker, William Cleveland and others in the 1990s."""

[[resources.sources]]
title = "The Design of Experiments Reference"
path = "https://en.wikipedia.org/wiki/The_Design_of_Experiments"

[[resources.sources]]
title = "Trellis Charts Paper"
path = "http://ml.stat.purdue.edu/stat695t/writings/TrellisDesignControl.pdf"

# Path: birdstrikes.csv
[[resources]]
name = "birdstrikes.csv"
path = "birdstrikes.csv"
description = """http://wildlife.faa.gov"""

[[resources.sources]]
title = "FAA Wildlife Strike Database"
path = "http://wildlife.faa.gov"

# Path: budget.json
[[resources]]
name = "budget.json"
path = "budget.json"
description = """Source: Office of Management and Budget (U.S.)
[Budget FY 2016 - Receipts](https://www.govinfo.gov/app/details/BUDGET-2016-DB/BUDGET-2016-DB-3)"""

[[resources.sources]]
title = "Office of Management and Budget - Budget FY 2016"
path = "https://www.govinfo.gov/app/details/BUDGET-2016-DB/BUDGET-2016-DB-3"


# Path: budgets.json
[[resources]]
name = "budgets.json"
path = "budgets.json"

# Path: burtin.json
[[resources]]
name = "burtin.json"
path = "burtin.json"
description = """The burtin.json dataset is based on graphic designer [Will Burtin's](https://en.wikipedia.org/wiki/Will_Burtin) 1951 visualization of antibiotic effectiveness, originally published in [Scope Magazine](https://graphicdesignarchives.org/projects/scope-magazine-vol-iii-5/). The dataset compares the performance of three antibiotics against 16 different bacteria. The numerical values in the dataset represent the minimum inhibitory concentration (MIC) of each antibiotic, measured in units per milliliter, with lower values indicating higher antibiotic effectiveness. The dataset was featured as an [example](https://mbostock.github.io/protovis/ex/antibiotics-burtin.html) in the Protovis project, a precursor to D3.js. The Protovis example notes that, \"Recreating this display revealed some minor errors in the original: a missing grid line at 0.01 μg/ml, and an exaggeration of some values for penicillin.\" The vega-datsets version is largely consistent with the Protovis version of the dataset, with one correction  (changing 'Brucella antracis' to the correct 'Bacillus anthracis') and the addition of a new column, 'Genus', to group related bacterial species together.
The caption of the original 1951 [visualization](https://graphicdesignarchives.org/wp-content/uploads/wmgda_8616c.jpg) reads as follows:
> ## Antibacterial ranges of Neomycin, Penicillin and Streptomycin
> 
> The chart compares the in vitro sensitivities to neomycin of some of the common pathogens (gram+ in red and gram- in blue) with their sensitivities to penicillin, and streptomycin. The effectiveness of the antibiotics is expressed as the highest dilution in μ/ml. which inhibits the test organism. High dilutions are toward the periphery; consequently the length of the colored bar is proportional to the effectiveness. It is apparent that neomycin is especially effective against Staph. albus and aureus, Streph. fecalis, A. aerogenes, S. typhosa, E. coli, Ps. aeruginosa, Br. abortus, K. pneumoniae, Pr. vulgaris, S. schottmuelleri and M. tuberculosis. Unfortunately, some strains of proteus, pseudomonas and hemolytic streptococcus are resistant to neomycin, although the majority of these are sensitive to neomycin. It also inhibits actinomycetes, but is inactive against viruses and fungi. Its mode of action is not understood."""

[[resources.sources]]
title = "Scope Magazine"
path = "https://graphicdesignarchives.org/projects/scope-magazine-vol-iii-5/"

[[resources.sources]]
title = "Protovis Antibiotics Example"
path = "https://mbostock.github.io/protovis/ex/antibiotics-burtin.html"

# Path: cars.json
[[resources]]
name = "cars.json"
path = "cars.json"
description = """http://lib.stat.cmu.edu/datasets/"""

[[resources.sources]]
title = "StatLib Datasets Archive"
path = "http://lib.stat.cmu.edu/datasets/"

# Path: co2-concentration.csv
[[resources]]
name = "co2-concentration.csv"
path = "co2-concentration.csv"
description = """https://scrippsco2.ucsd.edu/data/atmospheric_co2/primary_mlo_co2_record but modified to only include date, CO2, seasonally adjusted CO2 and only include rows with valid data."""

[[resources.sources]]
title = "Scripps CO2 Program"
path = "https://scrippsco2.ucsd.edu/data/atmospheric_co2/primary_mlo_co2_record"

# Path: countries.json
[[resources]]
name = "countries.json"
path = "countries.json"
description = """- **Original Data**: [Gapminder Foundation](https://www.gapminder.org/)
- **URLs**: 
  - Life Expectancy (v14): [Data](https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676) | [Reference](https://www.gapminder.org/data/documentation/gd004/)
  - Fertility (v14): [Data](https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676) | [Reference](https://www.gapminder.org/data/documentation/gd008/) 
- **Date Accessed**: July 31, 2024
- **License**: Creative Commons Attribution 4.0 International (CC BY 4.0) | [Reference](https://www.gapminder.org/free-material/)
This dataset combines key demographic indicators (life expectancy at birth and fertility rate measured as babies per woman) for various countries from 1955 to 2000 at 5-year intervals. It includes both current values and adjacent time period values (previous and next) for each indicator. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to \"show people the big picture\" rather than support detailed numeric analysis.
1. `year` (type: integer): Years from 1955 to 2000 at 5-year intervals
2. `country` (type: string): Name of the country
3. `fertility` (type: float): Fertility rate (average number of children per woman) for the given year
4. `life_expect` (type: float): Life expectancy in years for the given year
5. `p_fertility` (type: float): Fertility rate for the previous 5-year interval
6. `n_fertility` (type: float): Fertility rate for the next 5-year interval
7. `p_life_expect` (type: float): Life expectancy for the previous 5-year interval
8. `n_life_expect` (type: float): Life expectancy for the next 5-year interval"""

[[resources.sources]]
title = "Gapminder Foundation - Life Expectancy"
path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676"
version = "14"

[[resources.sources]]
title = "Gapminder Foundation - Fertility"
path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676"
version = "14"

[[resources.licenses]]
name = "countries.json"
title = "Creative Commons Attribution 4.0 International"
path = "https://www.gapminder.org/free-material/"

# Path: crimea.json
[[resources]]
name = "crimea.json"
path = "crimea.json"

# Path: disasters.csv
[[resources]]
name = "disasters.csv"
path = "disasters.csv"
description = """https://ourworldindata.org/natural-catastrophes"""

[[resources.sources]]
title = "Our World in Data - Natural Catastrophes"
path = "https://ourworldindata.org/natural-catastrophes"

# Path: driving.json
[[resources]]
name = "driving.json"
path = "driving.json"
description = """https://archive.nytimes.com/www.nytimes.com/imagepages/2010/05/02/business/02metrics.html"""

[[resources.sources]]
title = "New York Times"
path = "https://archive.nytimes.com/www.nytimes.com/imagepages/2010/05/02/business/02metrics.html"

# Path: earthquakes.json
[[resources]]
name = "earthquakes.json"
path = "earthquakes.json"
description = """https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_week.geojson
(Feb 6, 2018)"""

[[resources.sources]]
title = "USGS Earthquake Feed"
path = "https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_week.geojson"

# Path: ffox.png
[[resources]]
name = "ffox.png"
path = "ffox.png"
description = """Application icons from open-source software projects."""

# Path: flare-dependencies.json
[[resources]]
name = "flare-dependencies.json"
path = "flare-dependencies.json"

# Path: flare.json
[[resources]]
name = "flare.json"
path = "flare.json"

# Path: flights-10k.json
[[resources]]
name = "flights-10k.json"
path = "flights-10k.json"
description = """Flight delay statistics from U.S. Bureau of Transportation Statistics. https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr 
Transformed using `/scripts/flights.py`"""

[[resources.sources]]
title = "U.S. Bureau of Transportation Statistics"
path = "https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr"

# Path: flights-200k.arrow
[[resources]]
name = "flights-200k.arrow"
path = "flights-200k.arrow"
description = """Flight delay statistics from U.S. Bureau of Transportation Statistics. https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr 
Transformed using `/scripts/flights.py`"""

[[resources.sources]]
title = "U.S. Bureau of Transportation Statistics"
path = "https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr"

# Path: flights-200k.json
[[resources]]
name = "flights-200k.json"
path = "flights-200k.json"
description = """Flight delay statistics from U.S. Bureau of Transportation Statistics. https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr 
Transformed using `/scripts/flights.py`"""

[[resources.sources]]
title = "U.S. Bureau of Transportation Statistics"
path = "https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr"

# Path: flights-20k.json
[[resources]]
name = "flights-20k.json"
path = "flights-20k.json"
description = """Flight delay statistics from U.S. Bureau of Transportation Statistics. https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr 
Transformed using `/scripts/flights.py`"""

[[resources.sources]]
title = "U.S. Bureau of Transportation Statistics"
path = "https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr"

# Path: flights-2k.json
[[resources]]
name = "flights-2k.json"
path = "flights-2k.json"
description = """Flight delay statistics from U.S. Bureau of Transportation Statistics. https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr 
Transformed using `/scripts/flights.py`"""

[[resources.sources]]
title = "U.S. Bureau of Transportation Statistics"
path = "https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr"

# Path: flights-3m.parquet
[[resources]]
name = "flights-3m.parquet"
path = "flights-3m.parquet"
description = """Flight delay statistics from U.S. Bureau of Transportation Statistics. https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr 
Transformed using `/scripts/flights.py`"""

[[resources.sources]]
title = "U.S. Bureau of Transportation Statistics"
path = "https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr"

# Path: flights-5k.json
[[resources]]
name = "flights-5k.json"
path = "flights-5k.json"
description = """Flight delay statistics from U.S. Bureau of Transportation Statistics. https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr 
Transformed using `/scripts/flights.py`"""

[[resources.sources]]
title = "U.S. Bureau of Transportation Statistics"
path = "https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr"

# Path: flights-airport.csv
[[resources]]
name = "flights-airport.csv"
path = "flights-airport.csv"
description = """Flight delay statistics from U.S. Bureau of Transportation Statistics. https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr 
Transformed using `/scripts/flights.py`"""

[[resources.sources]]
title = "U.S. Bureau of Transportation Statistics"
path = "https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr"

# Path: football.json
[[resources]]
name = "football.json"
path = "football.json"
description = """Football match outcomes across multiple divisions from 2013 to 2017. This dataset is a subset of a larger dataset from https://github.com/openfootball/football.json. The subset was made such that there are records for all five chosen divisions over the time period."""

[[resources.sources]]
title = "OpenFootball"
path = "https://github.com/openfootball/football.json"

# Path: gapminder-health-income.csv
[[resources]]
name = "gapminder-health-income.csv"
path = "gapminder-health-income.csv"
description = """**Original Data**: [Gapminder Foundation](https://www.gapminder.org/)
**Description** Per-capita income, life expectancy, population and regional grouping. Dataset does not specify the reference year for the data. Gapminder historical data is subject to revisions.
Gapminder (v30, 2023) defines per-capita income as follows: 
>\"This is real GDP per capita (gross domestic product per person adjusted for inflation) converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States.\" | [Source](https://docs.google.com/spreadsheets/d/1i5AEui3WZNZqh7MQ4AKkJuCz4rRxGR_pw_9gtbcBOqQ/edit?gid=501532268#gid=501532268)
**License**: Creative Commons Attribution 4.0 International (CC BY 4.0) | [Reference](https://www.gapminder.org/free-material/)"""

[[resources.sources]]
title = "Gapminder Foundation"
path = "https://www.gapminder.org"

[[resources.sources]]
title = "Gapminder GDP Per Capita Data"
path = "https://docs.google.com/spreadsheets/d/1i5AEui3WZNZqh7MQ4AKkJuCz4rRxGR_pw_9gtbcBOqQ/edit?gid=501532268#gid=501532268"

# Path: gapminder.json
[[resources]]
name = "gapminder.json"
path = "gapminder.json"
description = """- **Original Data**: [Gapminder Foundation](https://www.gapminder.org/)
- **URLs**: 
  - Life Expectancy (v14): [Data](https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676) | [Reference](https://www.gapminder.org/data/documentation/gd004/)
  - Population (v7): [Data](https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676) | [Reference](https://www.gapminder.org/data/documentation/gd003/)
  - Fertility (v14): [Data](https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676) | [Reference](https://www.gapminder.org/data/documentation/gd008/) 
  - Data Geographies (v2): [Data](https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158) | [Reference](https://www.gapminder.org/data/geo/)
- **Date Accessed**: July 11, 2024
- **License**: Creative Commons Attribution 4.0 International (CC BY 4.0) | [Reference](https://www.gapminder.org/free-material/)
This dataset combines key demographic indicators (life expectancy at birth, population, and fertility rate measured as babies per woman) for various countries from 1955 to 2005 at 5-year intervals. It also includes a 'cluster' column, a categorical variable grouping countries. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to \"show people the big picture\" rather than support detailed numeric analysis.
1. `year` (type: integer): Years from 1955 to 2005 at 5-year intervals
2. `country` (type: string): Name of the country
3. `cluster` (type: integer): A categorical variable (values 0-5) grouping countries. See Revision Notes for details.
4. `pop` (type: integer): Population of the country
5. `life_expect` (type: float): Life expectancy in years
6. `fertility` (type: float): Fertility rate (average number of children per woman)
1. Country Selection: The set of countries in this file matches the version of this dataset originally added to this collection in 2015. The specific criteria for country selection in that version are not known. Data for Aruba are no longer available in the new version. Hong Kong has been revised to Hong Kong, China in the new version.
2. Data Precision: The precision of float values may have changed from the original version. These changes reflect the most recent source data used for each indicator.
3. Regional Groupings: The 'cluster' column represents a regional mapping of countries corresponding to the 'six_regions' schema in Gapminder's Data Geographies dataset. To preserve continuity with previous versions of this dataset, we have retained the column name 'cluster' instead of renaming it to 'six_regions'. The six regions represented are:
`0: south_asia, 1: europe_central_asia, 2: sub_saharan_africa, 3: america, 4: east_asia_pacific, 5: middle_east_north_africa`."""

[[resources.sources]]
title = "Gapminder Foundation - Population"
path = "https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676"
version = "7"

[[resources.sources]]
title = "Gapminder Foundation - Data Geographies"
path = "https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158"
version = "2"

[[resources.sources]]
title = "Gapminder Data Documentation"
path = "https://www.gapminder.org/data/documentation/"

# Path: gimp.png
[[resources]]
name = "gimp.png"
path = "gimp.png"
description = """Application icons from open-source software projects."""

# Path: github.csv
[[resources]]
name = "github.csv"
path = "github.csv"
description = """Generated using `/scripts/github.py`."""

# Path: global-temp.csv
[[resources]]
name = "global-temp.csv"
path = "global-temp.csv"
description = """Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies (Land-Ocean Temperature Index, L-OTI), 1880-2023. Source: NASA's Goddard Institute for Space Studies https://data.giss.nasa.gov/gistemp/"""

[[resources.sources]]
title = "NASA Goddard Institute for Space Studies"
path = "https://data.giss.nasa.gov/gistemp/"


# Path: income.json
[[resources]]
name = "income.json"
path = "income.json"

# Path: iowa-electricity.csv
[[resources]]
name = "iowa-electricity.csv"
path = "iowa-electricity.csv"
description = """The state of Iowa has dramatically increased its production of renewable wind power in recent years. This file contains the annual net generation of electricity in the state by source in thousand megawatthours. The dataset was compiled by the [U.S. Energy Information Administration](https://www.eia.gov/beta/electricity/data/browser/#/topic/0?agg=2,0,1&fuel=vvg&geo=00000g&sec=g&linechart=ELEC.GEN.OTH-IA-99.A~ELEC.GEN.COW-IA-99.A~ELEC.GEN.PEL-IA-99.A~ELEC.GEN.PC-IA-99.A~ELEC.GEN.NG-IA-99.A~~ELEC.GEN.NUC-IA-99.A~ELEC.GEN.HYC-IA-99.A~ELEC.GEN.AOR-IA-99.A~ELEC.GEN.HPS-IA-99.A~&columnchart=ELEC.GEN.ALL-IA-99.A&map=ELEC.GEN.ALL-IA-99.A&freq=A&start=2001&end=2017&ctype=linechart&ltype=pin&tab=overview&maptype=0&rse=0&pin=) and downloaded on May 6, 2018. It is useful for illustrating stacked area charts."""

[[resources.sources]]
title = "U.S. Energy Information Administration"
path = "https://www.eia.gov/beta/electricity/data/browser/#/topic/0?agg=2,0,1&fuel=vvg&geo=00000g&sec=g&linechart=ELEC.GEN.OTH-IA-99.A~ELEC.GEN.COW-IA-99.A~ELEC.GEN.PEL-IA-99.A~ELEC.GEN.PC-IA-99.A~ELEC.GEN.NG-IA-99.A~~ELEC.GEN.NUC-IA-99.A~ELEC.GEN.HYC-IA-99.A~ELEC.GEN.AOR-IA-99.A~ELEC.GEN.HPS-IA-99.A~&columnchart=ELEC.GEN.ALL-IA-99.A&map=ELEC.GEN.ALL-IA-99.A&freq=A&start=2001&end=2017&ctype=linechart&ltype=pin&tab=overview&maptype=0&rse=0&pin="

# Path: jobs.json
[[resources]]
name = "jobs.json"
path = "jobs.json"
description = """U.S. census data on [occupations](https://usa.ipums.org/usa-action/variables/OCC1950#codes_section) by sex and year across decades between 1850 and 2000. The dataset was obtained from IPUMS USA, which \"collects, preserves and harmonizes U.S. census microdata\" from as early as 1790.
Originally created for a 2006 data visualization project called *sense.us* by IBM Research (Jeff Heer, Martin Wattenberg and Fernanda Viégas), described [here](https://homes.cs.washington.edu/~jheer/files/bdata_ch12.pdf). The dataset is also referenced in this vega [example](https://vega.github.io/vega/examples/job-voyager/).
Data is based on a tabulation of the [OCC1950](https://usa.ipums.org/usa-action/variables/OCC1950) variable by sex across IPUMS USA samples. The dataset appears to be derived from Version 6.0 (2015) of [IPUMS USA](https://usa.ipums.org/usa/), according to 2024 correspondence with the IPUMS Project. IPUMS has made improvements to occupation coding since version 6, particularly for 19th-century samples, which may result in discrepancies between this dataset and current IPUMS data. Details on data revisions are available [here](https://usa.ipums.org/usa-action/revisions).
The dataset is structured as follows:
- job: The occupation title
- sex: Sex (men/women)
- year: Census year
- count: Number of individuals in the occupation
- perc: Percentage of the workforce in the occupation
IPUMS USA confirmed in 2024 correspondence that hosting this dataset on vega-datasets is permissible, stating:
>We're excited to hear that this dataset made its way to this repository and is being used by students for data visualization. We allow for these types of redistributions of summary data so long as the underlying microdata records are not shared.
This dataset contains only summary statistics and does not include any underlying microdata records.
1. This dataset represents summary data. The underlying microdata records are not included.
2. Users attempting to replicate or extend this data should use the [PERWT](https://usa.ipums.org/usa-action/variables/PERWT#description_section) (person weight) variable as an expansion factor when working with IPUMS USA extracts.
3. Due to coding revisions, figures for earlier years (particularly 19th century) may not match current IPUMS USA data exactly.
When using this dataset, please refer to IPUMS USA [terms of use](https://usa.ipums.org/usa/terms.shtml). The organization requests use of the following citation for this json file: 
Steven Ruggles, Katie Genadek, Ronald Goeken, Josiah Grover, and Matthew Sobek. Integrated Public Use Microdata Series: Version 6.0. Minneapolis: University of Minnesota, 2015. http://doi.org/10.18128/D010.V6.0"""

[[resources.sources]]
title = "IPUMS USA"
path = "https://usa.ipums.org/usa/"

# Path: la-riots.csv
[[resources]]
name = "la-riots.csv"
path = "la-riots.csv"
description = """More than 60 people lost their lives amid the looting and fires that ravaged Los Angeles for five days starting on April 29, 1992. This file contains metadata about each person, including the geographic coordinates of their death. It was compiled and published by the [Los Angeles Times Data Desk](http://spreadsheets.latimes.com/la-riots-deaths/)."""

[[resources.sources]]
title = "Los Angeles Times Data Desk"
path = "http://spreadsheets.latimes.com/la-riots-deaths/"

# Path: londonBoroughs.json
[[resources]]
name = "londonBoroughs.json"
path = "londonBoroughs.json"
description = """Boundaries of London boroughs reprojected and simplified from `London_Borough_Excluding_MHW` shapefile held at https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london. Original data \"contains National Statistics data © Crown copyright and database right (2015)\" and \"Contains Ordnance Survey data © Crown copyright and database right [2015]."""

[[resources.sources]]
title = "London Datastore"
path = "https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london"

# Path: londonCentroids.json
[[resources]]
name = "londoncentroids.json"
path = "londonCentroids.json"
description = """Calculated from `londongBoroughs.json` using `d3.geoCentroid`."""

# Path: londonTubeLines.json
[[resources]]
name = "londontubeLines.json"
path = "londonTubeLines.json"
description = """Selected rail lines simplified from `tfl_lines.json` at https://github.com/oobrien/vis/tree/master/tube/data"""

[[resources.sources]]
title = "London Tube Data"
path = "https://github.com/oobrien/vis/tree/master/tube/data"

# Path: lookup_groups.csv
[[resources]]
name = "lookup_groups.csv"
path = "lookup_groups.csv"

# Path: lookup_people.csv
[[resources]]
name = "lookup_people.csv"
path = "lookup_people.csv"

# Path: miserables.json
[[resources]]
name = "miserables.json"
path = "miserables.json"

# Path: monarchs.json
[[resources]]
name = "monarchs.json"
path = "monarchs.json"
description = """A chronological list of English and British monarchs from Elizabeth I through George IV. 
Each entry includes:
- `name`: The ruler's name or identifier (e.g., \"W&M\" for William and Mary, \"Cromwell\" for the period of interregnum)
- `start`: The year their rule began.
- `end`: The year their rule ended
- `index`: A [zero-based sequential number](https://en.wikipedia.org/wiki/Zero-based_numbering) assigned to each entry, representing the chronological order of rulers
- `commonwealth`: A Boolean flag (true) for the period from 1649 to 1660. This field is omitted for all other entries. 
The dataset contains two intentional inaccuracies to maintain compatibility with the [Wheat and Wages](https://vega.github.io/vega/examples/wheat-and-wages/) example visualization: 
1. the start date for the reign of Elizabeth I is shown as 1565, instead of 1558; 
2. the end date for the reign of George IV is shown as 1820, instead of 1830. 
These discrepancies align the `monarchs.json` dataset with the start and end dates of the `wheat.json` dataset used i the visualization.
The entry \"W&M\" represents the joint reign of William III and Mary II. While the dataset shows their reign as 1689-1702, the official Web site of the British royal family indicates that Mary II's reign ended in 1694, though William III continued to rule until 1702.
The `commonwealth` field is used to flag the period from 1649 to 1660, which includes the Commonwealth of England, the Protectorate, and the period leading to the Restoration. While historically more accurate to call this the \"interregnum,\" the field name of `commonwealth` from the original dataset is retained for backwards compatibility.
The dataset was revised in Aug. 2024. James II's reign now ends in 1688 (previously 1689).
Source data has been verified against the [kings & queens](https://www.royal.uk/kings-and-queens-1066) and [interregnum](https://www.royal.uk/interregnum-1649-1660) [official website of the British royal family](https://www.royal.uk) pages of the official Web site of the British royal family (retrieved in Aug. 2024). Content on the site is protected by Crown Copyright. Under the [UK Government Licensing Framework](https://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/), most Crown copyright information is available under the [Open Government Licence v3.0](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/)."""

[[resources.sources]]
title = "The Royal Family"
path = "https://www.royal.uk/kings-and-queens-1066"

# Path: movies.json
[[resources]]
name = "movies.json"
path = "movies.json"
description = """The dataset has well known and intentionally included errors. This dataset is used for instructional purposes, including the need to reckon with dirty data."""

# Path: normal-2d.json
[[resources]]
name = "normal-2d.json"
path = "normal-2d.json"

# Path: obesity.json
[[resources]]
name = "obesity.json"
path = "obesity.json"

# Path: ohlc.json
[[resources]]
name = "ohlc.json"
path = "ohlc.json"
description = """This dataset contains the performance of the Chicago Board Options Exchange [Volatility Index](https://en.wikipedia.org/wiki/VIX) ([VIX](https://finance.yahoo.com/chart/%5EVIX?ltr=1#eyJpbnRlcnZhbCI6ImRheSIsInBlcmlvZGljaXR5IjoxLCJ0aW1lVW5pdCI6bnVsbCwiY2FuZGxlV2lkdGgiOjgsInZvbHVtZVVuZGVybGF5Ijp0cnVlLCJhZGoiOnRydWUsImNyb3NzaGFpciI6dHJ1ZSwiY2hhcnRUeXBlIjoibGluZSIsImV4dGVuZGVkIjpmYWxzZSwibWFya2V0U2Vzc2lvbnMiOnt9LCJhZ2dyZWdhdGlvblR5cGUiOiJvaGxjIiwiY2hhcnRTY2FsZSI6ImxpbmVhciIsInN0dWRpZXMiOnsidm9sIHVuZHIiOnsidHlwZSI6InZvbCB1bmRyIiwiaW5wdXRzIjp7ImlkIjoidm9sIHVuZHIiLCJkaXNwbGF5Ijoidm9sIHVuZHIifSwib3V0cHV0cyI6eyJVcCBWb2x1bWUiOiIjMDBiMDYxIiwiRG93biBWb2x1bWUiOiIjRkYzMzNBIn0sInBhbmVsIjoiY2hhcnQiLCJwYXJhbWV0ZXJzIjp7IndpZHRoRmFjdG9yIjowLjQ1LCJjaGFydE5hbWUiOiJjaGFydCJ9fX0sInBhbmVscyI6eyJjaGFydCI6eyJwZXJjZW50IjoxLCJkaXNwbGF5IjoiXlZJWCIsImNoYXJ0TmFtZSI6ImNoYXJ0IiwidG9wIjowfX0sInNldFNwYW4iOnt9LCJsaW5lV2lkdGgiOjIsInN0cmlwZWRCYWNrZ3JvdWQiOnRydWUsImV2ZW50cyI6dHJ1ZSwiY29sb3IiOiIjMDA4MWYyIiwiZXZlbnRNYXAiOnsiY29ycG9yYXRlIjp7ImRpdnMiOnRydWUsInNwbGl0cyI6dHJ1ZX0sInNpZ0RldiI6e319LCJzeW1ib2xzIjpbeyJzeW1ib2wiOiJeVklYIiwic3ltYm9sT2JqZWN0Ijp7InN5bWJvbCI6Il5WSVgifSwicGVyaW9kaWNpdHkiOjEsImludGVydmFsIjoiZGF5IiwidGltZVVuaXQiOm51bGwsInNldFNwYW4iOnt9fV19)) in the summer of 2009."""

[[resources.sources]]
title = "Yahoo Finance VIX Data"
path = "https://finance.yahoo.com/chart/%5EVIX"

# Path: penguins.json
[[resources]]
name = "penguins.json"
path = "penguins.json"
description = """Palmer Archipelago (Antarctica) penguin data collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/), a member of the [Long Term Ecological Research Network](https://lternet.edu/). For more information visit [allisonhorst/penguins](https://github.com/allisonhorst/penguins) on GitHub."""

[[resources.sources]]
title = "Palmer Station Antarctica LTER"
path = "https://pal.lternet.edu/"

[[resources.sources]]
title = "Allison Horst's Penguins Repository"
path = "https://github.com/allisonhorst/penguins"

# Path: platformer-terrain.json
[[resources]]
name = "platformer-terrain.json"
path = "platformer-terrain.json"
description = """Assets from the video game [Celeste](http://www.celestegame.com/)."""

[[resources.sources]]
title = "Celeste Game"
path = "http://www.celestegame.com/"

# Path: points.json
[[resources]]
name = "points.json"
path = "points.json"

# Path: political-contributions.json
[[resources]]
name = "political-contributions.json"
path = "political-contributions.json"
description = """Summary financial information on contributions to candidates for U.S. elections. An updated version of this datset is available from the \"all candidates\" files (in pipe-delimited format) on the [bulk data download](https://www.fec.gov/data/browse-data/?tab=bulk-data) page of the U.S. Federal Election Commission, or, alternatively, via [OpenFEC](https://api.open.fec.gov/developers/). Information on each of the 25 columns is available from the [FEC All Candidates File Description](https://www.fec.gov/campaign-finance-data/all-candidates-file-description/). The sample dataset in `political-contributions.json` contains 58 records with dates from 2015.
FEC data is subject to the commission's:
- [Sale or Use Policy](https://www.fec.gov/updates/sale-or-use-contributor-information/)
- [Privacy and Security Policy](https://www.fec.gov/about/privacy-and-security-policy/)
- [Acceptable Use Policy](https://github.com/fecgov/FEC/blob/master/ACCEPTABLE-USE-POLICY.md)
Additionally, the FEC's Github [repository](https://github.com/fecgov/FEC) states:
> This project is in the public domain within the United States, and we waive worldwide copyright and related rights through [CC0 universal public domain](https://creativecommons.org/publicdomain/zero/1.0/) dedication. Read more on our license page. A few restrictions limit the way you can use FEC data. For example, you can't use contributor lists for commercial purposes or to solicit donations. Learn more on [FEC.gov](https://www.fec.gov/)."""

[[resources.sources]]
title = "Federal Election Commission Bulk Data"
path = "https://www.fec.gov/data/browse-data/?tab=bulk-data"

[[resources.sources]]
title = "OpenFEC API"
path = "https://api.open.fec.gov/developers/"

# Path: population.json
[[resources]]
name = "population.json"
path = "population.json"
description = """United States population statistics by sex and age group across decades between 1850 and 2000. The dataset was obtained from [IPUMS USA](https://usa.ipums.org/usa/), which \"collects, preserves and harmonizes U.S. census microdata\" from as early as 1790.
The dataset is structured as follows:
- year: four-digit year of the survey. - [IPUMS description](https://usa.ipums.org/usa-action/variables/YEAR#description_section)
- age: age group in 5-year intervals (0 represents ages 0-4, 5 represents 5-9, 10 represents 10-14, etc., up to 90 representing 90 and above) - [IPUMS description](https://usa.ipums.org/usa-action/variables/AGE#description_section)
- sex: Sex (men = 1 / women = 2) - [IPUMS description](https://usa.ipums.org/usa-action/variables/SEX#description_section)
- people: Number of individuals, equivalent to IPUMS variable name [PERWT](https://usa.ipums.org/usa-action/variables/PERWT#description_section).
IPUMS updates and revises datasets over time, which may result in discrepancies between this dataset and current IPUMS data. Details on data revisions are available [here](https://usa.ipums.org/usa-action/revisions).
When using this dataset, please refer to IPUMS USA [terms of use](https://usa.ipums.org/usa/terms.shtml). The organization requests the use of the following citation for this json file: 
Steven Ruggles, Katie Genadek, Ronald Goeken, Josiah Grover, and Matthew Sobek. Integrated Public Use Microdata Series: Version 6.0. Minneapolis: University of Minnesota, 2015. http://doi.org/10.18128/D010.V6.0"""

[[resources.sources]]
title = "IPUMS USA"
path = "https://usa.ipums.org/usa/"

# Path: population_engineers_hurricanes.csv
[[resources]]
name = "population_engineers_hurricanes.csv"
path = "population_engineers_hurricanes.csv"
description = """Data about engineers from https://www.bls.gov/oes/tables.htm. Hurricane data from http://www.nhc.noaa.gov/paststate.shtml. Income data from https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_07_3YR_S1901&prodType=table."""

[[resources.sources]]
title = "Bureau of Labor Statistics"
path = "https://www.bls.gov/oes/tables.htm"

[[resources.sources]]
title = "American Community Survey"
path = "https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_07_3YR_S1901&prodType=table"

[[resources.sources]]
title = "NOAA National Climatic Data Center"
path = "https://www.ncdc.noaa.gov/cdo-web/datatools/records"

# Path: seattle-weather-hourly-normals.csv
[[resources]]
name = "seattle-weather-hourly-normals.csv"
path = "seattle-weather-hourly-normals.csv"
description = """Data from [NOAA](https://www.ncdc.noaa.gov/cdo-web/datatools/normals). Hourly weather normals with metric units. The 1981-2010 Climate Normals are NCDC's three-decade averages of climatological variables, including temperature and precipitation. Learn more in the [documentation](https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/NORMAL_HLY_documentation.pdf). We only included temperature, wind, and pressure and updated the format to be easier to parse."""

[[resources.sources]]
title = "NOAA National Climatic Data Center (NCDC)"
path = "https://www.ncdc.noaa.gov/cdo-web/datatools/normals"

# Path: seattle-weather.csv
[[resources]]
name = "seattle-weather.csv"
path = "seattle-weather.csv"
description = """Data from [NOAA](https://www.ncdc.noaa.gov/cdo-web/datatools/records). Daily weather records with metric units. Transformed using `/scripts/weather.py`. We synthesized the categorical \"weather\" field from multiple fields in the original dataset. This data is intended for instructional purposes."""

[[resources.sources]]
title = "NOAA National Climatic Data Center"
path = "https://www.ncdc.noaa.gov/cdo-web/datatools/records"

# Path: sp500-2000.csv
[[resources]]
name = "sp500-2000.csv"
path = "sp500-2000.csv"
description = """S&P 500 index values from 2000 to 2020, retrieved from [Yahoo Finance](https://finance.yahoo.com/quote/%5EDJI/history/)."""

[[resources.sources]]
title = "Yahoo Finance"
path = "https://finance.yahoo.com/quote/%5EDJI/history/"


# Path: sp500.csv
[[resources]]
name = "sp500.csv"
path = "sp500.csv"

# Path: stocks.csv
[[resources]]
name = "stocks.csv"
path = "stocks.csv"

# Path: udistrict.json
[[resources]]
name = "udistrict.json"
path = "udistrict.json"

# Path: unemployment-across-industries.json
[[resources]]
name = "unemployment-across-industries.json"
path = "unemployment-across-industries.json"
description = """Industry-level unemployment statistics from the [Current Population Survey](https://www.census.gov/programs-surveys/cps.html) (CPS), published monthly by the U.S. Bureau of Labor Statistics. Includes unemployed persons and unemployment rate across 11 private industries, as well as agricultural, government, and self-employed workers. Covers January 2000 through February 2010. Industry classification follows format of CPS [Table A-31](https://www.bls.gov/web/empsit/cpseea31.htm).
Each entry in the JSON file contains:
- `series`: Industry name
- `year`: Year (2000-2010)
- `month`: Month (1-12)
- `count`: Number of unemployed persons (in thousands)
- `rate`: Unemployment rate (percentage)
- `date`: [ISO 8601](https://www.iso.org/iso-8601-date-and-time-format.html)-formatted date string (e.g., \"2000-01-01T08:00:00.000Z\")
The dataset can be replicated using the BLS API. For more, see the `scripts` folder of this repository.
The BLS Web site states:
> \"Users of the public API should cite the date that data were accessed or retrieved using the API. Users must clearly state that “BLS.gov cannot vouch for the data or analyses derived from these data after the data have been retrieved from BLS.gov.” The BLS.gov logo may not be used by persons who are not BLS employees or on products (including web pages) that are not BLS-sponsored.\"
See full BLS [terms of service](https://www.bls.gov/developers/termsOfService.htm)."""

[[resources.sources]]
title = "U.S. Census Bureau Current Population Survey"
path = "https://www.census.gov/programs-surveys/cps.html"

[[resources.sources]]
title = "BLS Local Area Unemployment Statistics"
path = "https://www.bls.gov/lau/"

[[resources.sources]]
title = "BLS LAUS Data Tools"
path = "https://www.bls.gov/lau/data.htm"

[[resources.sources]]
title = "Bureau of Labor Statistics Table A-31"
path = "https://www.bls.gov/web/empsit/cpseea31.htm"

# Path: unemployment.tsv
[[resources]]
name = "unemployment.tsv"
path = "unemployment.tsv"
description = """This dataset contains county-level unemployment rates in the United States, with data generally consistent with levels reported in 2009. The dataset is structured as tab-separated values with two columns:
1. `id`: The combined [state and county FIPS code](https://www.census.gov/library/reference/code-lists/ansi.html)
2. `rate`: The unemployment rate for the county
The unemployment rate represents the number of unemployed persons as a percentage of the labor force. According to the [Bureau of Labor Statistics (BLS) glossary](https://www.bls.gov/opub/hom/glossary.htm#U):
> Unemployed persons (Current Population Survey) [are] persons aged 16 years and older who had no employment during the reference week, were available for work, except for temporary illness, and had made specific efforts to find employment sometime during the 4-week period ending with the reference week. Persons who were waiting to be recalled to a job from which they had been laid off need not have been looking for work to be classified as unemployed.
The labor force includes all persons classified as employed or unemployed in accordance with the BLS definitions.
This dataset is derived from the [Local Area Unemployment Statistics (LAUS)](https://www.bls.gov/lau/) program, a federal-state cooperative effort overseen by the Bureau of Labor Statistics (BLS). The LAUS program produces monthly and annual employment, unemployment, and labor force data for census regions and divisions, states, counties, metropolitan areas, and many cities and towns.
For the most up-to-date LAUS data:
1. **Monthly and Annual Data Downloads**: 
   - Visit the [LAUS Data Tools](https://www.bls.gov/lau/data.htm) page for [monthly](https://www.bls.gov/lau/tables.htm#mcounty) and [annual](https://www.bls.gov/lau/tables.htm#cntyaa) county data.
2. **BLS Public Data API**:
   - The BLS provides an [API for developers](https://www.bls.gov/developers/) to access various datasets, including LAUS data.
   - To use the API for LAUS data, refer to the [LAUS Series ID Formats](https://www.bls.gov/help/hlpforma.htm#LA) to construct your query.
   - API documentation and examples are available on the [BLS Developers](https://www.bls.gov/developers/) page.
When using BLS public data API and datasets, users should adhere to the [BLS Terms of Service](https://www.bls.gov/developers/termsOfService.htm), which includes the following guidelines:
1. Cite the date that data were accessed or retrieved.
2. Acknowledge that \"BLS.gov cannot vouch for the data or analyses derived from these data after the data have been retrieved from BLS.gov.\"
3. Do not use the BLS logo without permission.
For detailed methodology and technical information about LAUS estimates, refer to the [BLS Handbook of Methods](https://www.bls.gov/opub/hom/lau/home.htm)."""

[[resources.sources]]
title = "BLS Handbook of Methods"
path = "https://www.bls.gov/opub/hom/lau/home.htm"

[[resources.sources]]
title = "BLS Developers API"
path = "https://www.bls.gov/developers/"

# Path: uniform-2d.json
[[resources]]
name = "uniform-2d.json"
path = "uniform-2d.json"

# Path: us-10m.json
[[resources]]
name = "us-10m.json"
path = "us-10m.json"

# Path: us-employment.csv
[[resources]]
name = "us-employment.csv"
path = "us-employment.csv"
description = """In the mid 2000s the global economy was hit by a crippling recession. One result: Massive job losses across the United States. The downturn in employment, and the slow recovery in hiring that followed, was tracked each month by the [Current Employment Statistics](https://www.bls.gov/ces/) program at the U.S. Bureau of Labor Statistics.
This file contains the monthly employment total in a variety of job categories from January 2006 through December 2015. The numbers are seasonally adjusted and reported in thousands. The data were downloaded on Nov. 11, 2018, and reformatted for use in this library.
Totals are included for the [22 \"supersectors\"](https://download.bls.gov/pub/time.series/ce/ce.supersector) tracked by the BLS. The \"nonfarm\" total is the category typically used by economists and journalists as a stand-in for the country's employment total.
A calculated \"nonfarm_change\" column has been appended with the month-to-month change in that supersector's employment. It is useful for illustrating how to make bar charts that report both negative and positive values."""

[[resources.sources]]
title = "U.S. Bureau of Labor Statistics Current Employment Statistics"
path = "https://www.bls.gov/ces/"

[[resources.sources]]
title = "BLS Supersectors"
path = "https://download.bls.gov/pub/time.series/ce/ce.supersector"

# Path: us-state-capitals.json
[[resources]]
name = "us-state-capitals.json"
path = "us-state-capitals.json"

# Path: volcano.json
[[resources]]
name = "volcano.json"
path = "volcano.json"
description = """Maunga Whau (Mt Eden) is one of about 50 volcanos in the Auckland volcanic field. This data set gives topographic information for Maunga Whau on a 10m by 10m grid. Digitized from a topographic map by Ross Ihaka, adapted from [R datasets](https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/volcano.html). These data should not be regarded as accurate."""

[[resources.sources]]
title = "R Datasets"
path = "https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/volcano.html"

# Path: weather.csv
[[resources]]
name = "weather.csv"
path = "weather.csv"
description = """Data from [NOAA](http://www.ncdc.noaa.gov/cdo-web/datatools/findstation). Transformed using `/scripts/weather.py`. We synthesized the categorical \"weather\" field from multiple fields in the original dataset. This data is intended for instructional purposes."""

[[resources.sources]]
title = "NOAA Climate Data Online"
path = "http://www.ncdc.noaa.gov/cdo-web/datatools/findstation"

# Path: weather.json
[[resources]]
name = "weather.json"
path = "weather.json"
description = """Instructional dataset showing actual and predicted temperature data."""

# Path: wheat.json
[[resources]]
name = "wheat.json"
path = "wheat.json"
description = """In an 1822 letter to Parliament, [William Playfair](https://en.wikipedia.org/wiki/William_Playfair), a Scottish engineer who is often credited as the founder of statistical graphics, published [an elegant chart on the price of wheat](http://dh101.humanities.ucla.edu/wp-content/uploads/2014/08/Vis_2.jpg). It plots 250 years of prices alongside weekly wages and the reigning monarch. He intended to demonstrate that “never at any former period was wheat so cheap, in proportion to mechanical labour, as it is at the present time.”"""

[[resources.sources]]
title = "1822 Playfair Chart"
path = "http://dh101.humanities.ucla.edu/wp-content/uploads/2014/08/Vis_2.jpg"

# Path: windvectors.csv
[[resources]]
name = "windvectors.csv"
path = "windvectors.csv"
description = """Simulated wind patterns over northwestern Europe."""

# Path: world-110m.json
[[resources]]
name = "world-110m.json"
path = "world-110m.json"

# Path: zipcodes.csv
[[resources]]
name = "zipcodes.csv"
path = "zipcodes.csv"
description = """GeoNames.org"""

[[resources.sources]]
title = "GeoNames"
path = "https://www.geonames.org"

@domoritz
Copy link
Member

domoritz commented Dec 7, 2024

Since name and path are the same, can we have only one in the file designed for human editing? Also I like that some files have column descriptions. Do we merge those into the data package?

@dangotbanned
Copy link
Member

(#634 (comment))
Looking good overall @dsmedia 😄

Since name and path are the same, can we have only one in the file designed for human editing?

I agree with @domoritz, just path would be enough - since we derive name from it.

There seems to be what I'm assuming is an escaping issue with this description:

burtin.json warning

Image

Also I can see the benefit of having a comment for editing, but would strongly recommend moving it.
Hopefully you can see the code folding issue in the video - that is easy to resolve:

2024-12-07.19-57-46.mp4

Besides that, if you wanna open a PR I can help with reducing the duplication from the description fields?
I think it would make sense to refer to sources in the description by source.name (instead of the link)

@dsmedia
Copy link
Collaborator Author

dsmedia commented Dec 7, 2024

Sounds good on comments, and will do on the PR.

Can you clarify with an example what you mean by your last point on referring to sources in the description?

Also, some of the resource descriptions currently include column information. The resource spec has a special place for those (called a table schema), but I was thinking we could leave the column descriptions where they are (just within the source description field) for now, and migrate those into a resource table schema in a future PR. Or we could do that now as well.

** The table schema of course will still be part of the script-generated Json. But we'd need that to be merged with the column descriptions in the toml file to have them fully integrated.

@dangotbanned
Copy link
Member

Can you clarify with an example what you mean by your last point on referring to sources in the description?
@dsmedia

Sure, I thought we were talking about the same thing here:

Content Handling

  • Some content may be duplicated across description, source, and license fields

Future Improvements

  • Deduplicate content between description and source/license fields

As an example, in the section below:

# Path: annual-precip.json
[[resources]]
name = "annual-precip.json"
path = "annual-precip.json"
description = """A raster grid of global annual precipitation for the year 2016 at a resolution 1 degree of lon/lat per cell, from [CFSv2](https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/climate-forecast-system-version2-cfsv2)."""
#                                                                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[[resources.sources]]
title = "Climate Forecast System Version 2"
#       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
path = "https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/climate-forecast-system-version2-cfsv2"
#      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I would rewrite this as:

[[resources]] # Path: annual-precip.json
path = "annual-precip.json"
description = "A raster grid of global annual precipitation for the year 2016 at a resolution 1 degree of lon/lat per cell."

[[resources.sources]]
title = "Climate Forecast System Version 2"
path = "https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/climate-forecast-system-version2-cfsv2"

Or if there are cases where you need to refer to a source mid-sentence, use the resources.sources.title key instead of the link.

dsmedia added a commit to dsmedia/vega-datasets that referenced this issue Dec 8, 2024
- Add SOURCES.toml to provide supplemental (extrinisic) metadata on datasets, from SOURCES.md, in a form usable by build_datapackage.py
- Include resource descriptions, sources, and licenses to supplement script output
- Preserve existing markdown content for future documentation
- TODO: Remove duplicated content between descriptions and sources
- TODO: Incorporate resource-level column descriptions into table schema, where available
- TODO: determine if root-level $schema property should be specified in the TOML file with the value "https://datapackage.org/profiles/2.0/datapackage.json" per Frictionless Data guidelines

Resolves vega#634
dangotbanned added a commit to dsmedia/vega-datasets that referenced this issue Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants