-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional metadata for datasets #629
Comments
I'm not following. What is the enhancement you are asking for this repo? Should we provide a file with additional metadata about the datasets? Maybe a json? I'm not sure what would be in that file that would be generic enough so that the file shouldn't live in Altair. |
I'm wondering if sources.md could still be helpful here. As background, we're making steady progress in cataloging the remaining undocumented datasets in this repo's sources.md file. Once those are done, it probably would make sense to reformat that file into a more standard style such that each dataset had, at a minimum, clearly defined details about column names, column types, licensing, a generation script path (if available), etc. Best efforts would be made to keep this metadata up to date when datasets were significantly modified or if new datasets were added. Just thinking out loud, but instead of directly maintaining the sources.md file, we could keep the dataset metadata in a json or yaml file, and generate the sources.md file from this machine-readable format. We could consider writing the file to follow a standardized format like Frictionless Data Package. I could see a case that such a process would be beneficial to this repo even separate from the issue raised. And if such a machine-readable file helped to make it easier for downstream to work with these datasets, then perhaps all the better? |
I helped with some early versions of frictionless data packages. That may be a good fit. |
@dsmedia yeah this seems like exactly the kind of standardized format that would be helpful for us. I did mention I'm happy to help contribute towards the goal of a migrating the contents of EditReading through the Frictionless Data Package standard; it seems like the perfect fit to me. https://framework.frictionlessdata.io/docs/console/overview.html |
I've played around with Using just the defaults on the CLI managed to produce some fairly useful metadata. Essentially just did this for each of >>> frictionless describe data/*.json --yaml > json-out.yaml Seems like the Inferred output
resources:
- name: airports
type: table
path: airports.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: iata
type: string
- name: name
type: string
- name: city
type: string
- name: state
type: string
- name: country
type: string
- name: latitude
type: number
- name: longitude
type: number
- name: birdstrikes
type: table
path: birdstrikes.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: Airport Name
type: string
- name: Aircraft Make Model
type: string
- name: Effect Amount of damage
type: string
- name: Flight Date
type: date
- name: Aircraft Airline Operator
type: string
- name: Origin State
type: string
- name: Phase of flight
type: string
- name: Wildlife Size
type: string
- name: Wildlife Species
type: string
- name: Time of day
type: string
- name: Cost Other
type: integer
- name: Cost Repair
type: integer
- name: Cost Total $
type: integer
- name: Speed IAS in knots
type: integer
- name: co2-concentration
type: table
path: co2-concentration.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: Date
type: date
- name: CO2
type: number
- name: adjusted CO2
type: number
- name: disasters
type: table
path: disasters.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: Entity
type: string
- name: Year
type: integer
- name: Deaths
type: integer
- name: flights-airport
type: table
path: flights-airport.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: origin
type: string
- name: destination
type: string
- name: count
type: integer
- name: gapminder-health-income
type: table
path: gapminder-health-income.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: country
type: string
- name: income
type: integer
- name: health
type: number
- name: population
type: integer
- name: region
type: string
- name: github
type: table
path: github.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: time
type: string
- name: count
type: integer
- name: global-temp
type: table
path: global-temp.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: year
type: integer
- name: temp
type: number
- name: iowa-electricity
type: table
path: iowa-electricity.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: year
type: date
- name: source
type: string
- name: net_generation
type: integer
- name: la-riots
type: table
path: la-riots.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: first_name
type: string
- name: last_name
type: string
- name: age
type: integer
- name: gender
type: string
- name: race
type: string
- name: death_date
type: date
- name: address
type: string
- name: neighborhood
type: string
- name: type
type: string
- name: longitude
type: number
- name: latitude
type: number
- name: lookup_groups
type: table
path: lookup_groups.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: group
type: integer
- name: person
type: string
- name: lookup_people
type: table
path: lookup_people.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: name
type: string
- name: age
type: integer
- name: height
type: integer
- name: population_engineers_hurricanes
type: table
path: population_engineers_hurricanes.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: state
type: string
- name: id
type: integer
- name: population
type: integer
- name: engineers
type: number
- name: hurricanes
type: integer
- name: seattle-weather-hourly-normals
type: table
path: seattle-weather-hourly-normals.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: date
type: datetime
- name: pressure
type: number
- name: temperature
type: number
- name: wind
type: number
- name: seattle-weather
type: table
path: seattle-weather.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: date
type: date
- name: precipitation
type: number
- name: temp_max
type: number
- name: temp_min
type: number
- name: wind
type: number
- name: weather
type: string
- name: sp500-2000
type: table
path: sp500-2000.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: date
type: date
- name: open
type: number
- name: high
type: number
- name: low
type: number
- name: close
type: number
- name: adjclose
type: number
- name: volume
type: integer
- name: sp500
type: table
path: sp500.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: date
type: string
- name: price
type: number
- name: stocks
type: table
path: stocks.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: symbol
type: string
- name: date
type: string
- name: price
type: number
- name: us-employment
type: table
path: us-employment.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: month
type: date
- name: nonfarm
type: integer
- name: private
type: integer
- name: goods_producing
type: integer
- name: service_providing
type: integer
- name: private_service_providing
type: integer
- name: mining_and_logging
type: integer
- name: construction
type: integer
- name: manufacturing
type: integer
- name: durable_goods
type: integer
- name: nondurable_goods
type: integer
- name: trade_transportation_utilties
type: integer
- name: wholesale_trade
type: number
- name: retail_trade
type: number
- name: transportation_and_warehousing
type: number
- name: utilities
type: number
- name: information
type: integer
- name: financial_activities
type: integer
- name: professional_and_business_services
type: integer
- name: education_and_health_services
type: integer
- name: leisure_and_hospitality
type: integer
- name: other_services
type: integer
- name: government
type: integer
- name: nonfarm_change
type: integer
- name: weather
type: table
path: weather.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: location
type: string
- name: date
type: date
- name: precipitation
type: number
- name: temp_max
type: number
- name: temp_min
type: number
- name: wind
type: number
- name: weather
type: string
- name: windvectors
type: table
path: windvectors.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: longitude
type: number
- name: latitude
type: number
- name: dir
type: integer
- name: dirCat
type: integer
- name: speed
type: number
- name: zipcodes
type: table
path: zipcodes.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
schema:
fields:
- name: zip_code
type: integer
- name: latitude
type: number
- name: longitude
type: number
- name: city
type: string
- name: state
type: string
- name: county
type: string
- name: annual-precip
type: json
path: annual-precip.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: anscombe
type: json
path: anscombe.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: barley
type: json
path: barley.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: budget
type: json
path: budget.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: budgets
type: json
path: budgets.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: burtin
type: json
path: burtin.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: cars
type: json
path: cars.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: countries
type: json
path: countries.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: crimea
type: json
path: crimea.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: driving
type: json
path: driving.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: earthquakes
type: json
path: earthquakes.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: flare-dependencies
type: json
path: flare-dependencies.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: flare
type: json
path: flare.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: flights-10k
type: json
path: flights-10k.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: flights-200k
type: json
path: flights-200k.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: flights-20k
type: json
path: flights-20k.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: flights-2k
type: json
path: flights-2k.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: flights-5k
type: json
path: flights-5k.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: football
type: json
path: football.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: gapminder
type: json
path: gapminder.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: income
type: json
path: income.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: jobs
type: json
path: jobs.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: londonboroughs
type: json
path: londonBoroughs.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: londoncentroids
type: json
path: londonCentroids.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: londontubelines
type: json
path: londonTubeLines.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: miserables
type: json
path: miserables.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: monarchs
type: json
path: monarchs.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: movies
type: json
path: movies.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: normal-2d
type: json
path: normal-2d.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: obesity
type: json
path: obesity.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: ohlc
type: json
path: ohlc.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: penguins
type: json
path: penguins.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: platformer-terrain
type: json
path: platformer-terrain.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: points
type: json
path: points.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: political-contributions
type: json
path: political-contributions.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: population
type: json
path: population.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: udistrict
type: json
path: udistrict.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: unemployment-across-industries
type: json
path: unemployment-across-industries.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: uniform-2d
type: json
path: uniform-2d.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: us-10m
type: json
path: us-10m.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: us-state-capitals
type: json
path: us-state-capitals.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: volcano
type: json
path: volcano.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: weather
type: json
path: weather.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: wheat
type: json
path: wheat.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: world-110m
type: json
path: world-110m.json
scheme: file
format: json
mediatype: text/json
encoding: utf-8
- name: flights-3m
type: table
path: flights-3m.parquet
scheme: file
format: parquet
mediatype: application/parquet
schema:
fields:
- name: date
type: integer
- name: delay
type: integer
- name: distance
type: integer
- name: origin
type: string
- name: destination
type: string
- name: 7zip
type: file
path: 7zip.png
scheme: file
format: png
mediatype: image/png
encoding: utf-8
- name: ffox
type: file
path: ffox.png
scheme: file
format: png
mediatype: image/png
encoding: utf-8
- name: gimp
type: file
path: gimp.png
scheme: file
format: png
mediatype: image/png
encoding: utf-8
- name: unemployment
type: table
path: unemployment.tsv
scheme: file
format: tsv
mediatype: text/tsv
encoding: utf-8
dialect:
csv:
delimiter: "\t"
schema:
fields:
- name: id
type: integer
- name: rate
type: number @dsmedia have you used this package in the past? |
Oh nice. Yeah, I would love to have a more machine readable format for the datasets here. Too bad the cli doesn't deal that well with JSON files. |
I filed frictionlessdata/frictionless-py#1712 to get JSON support. May be easier to contribute that than to create the schemas by hand here probably. |
Oh @domoritz thanks for opening the issue! I'll post the updated |
@domoritz as promised in (#629 (comment)) Updated
|
Nice. Let's add this to the repo. We should have a script to generate it, though, so I can make sure it stays up to date. |
Closes #629 - Adds `build_datapackage.py` script - Generates initial metadata in both `.yaml` and `.json` formats
* feat: generate `frictionless` data package metadata Closes vega#629 - Adds `build_datapackage.py` script - Generates initial metadata in both `.yaml` and `.json` formats * feat(typing): spell mappings more explicitly Improves readability, but mainly to support per-resource `description`, `sources`, `licenses` See vega#631 (comment) * refactor(ruff): misc linting * ci: change default output to `json`, fix missing `contributors` vega#631 (review), vega#631 (comment) * feat: add support for `.arrow` vega#631 (comment), vega#631 (comment) * feat(DRAFT): add `.with_extras()`, for `description`, `source`, and `license` Unused currently, depends on having a more structured `SOURCES.md` vega#631 (comment) * add data package to build step * ci: add uv * ci: fix uv setup * ci: fail if there are changes * just kidding (timestamps change things) * chore: update pr template --------- Co-authored-by: Dominik Moritz <[email protected]>
- Adds SOURCES.toml to provide supplemental metadata on resources for build_datapackage.py - TOML structure designed to sync with Frictionless Data Resource 2.0 specifications - Includes resource descriptions, sources, and licenses to supplement script output, where available in SOURCES.md - Preserves existing markdown content for use in human-readable source documentation - TODO: Remove duplicated content between descriptions and sources - TODO: Verify compatibility with build_datapackage.py - TODO: Incorporate resource-level column descriptions into table schema, where currently included - TODO: determine if root-level $schema property should be specified in the TOML file with the value "https://datapackage.org/profiles/2.0/datapackage.json" per Frictionless spec Part of vega#629
Note
Originally posted by @dangotbanned in vega/altair#3631 (comment)
I thought this would be easier to discuss in a new issue over here
@domoritz right now the additional metadata I'm adding is described above, and a preview of
metadata.parquet
is in vega/altair#3631 (comment)I'm not too concerned about the logic for this living in
altair
.The only static part is
ext_supported
- which is somewhataltair
/python
-specific:For the current datasets, this is pretty much just to avoid
.png
.Maybe that could just be a
tabular
flag?I think we could benefit from other kinds of metadata that may be less trivial to work out on our end:
(Geo|Topo)JSON
The text was updated successfully, but these errors were encountered: