feat(RFC): Adds `altair.datasets` #3631

dangotbanned · 2024-10-04T18:57:00Z

Status

Waiting on the next vega-datasets release.
Once there is a stable datapackage.json available - there is quite a lot of tools/datasets that can be simplified/removed.

3.0.0 Release vega-datasets#654

Description

Providing a minimal, but up-to-date source for https://github.com/vega/vega-datasets.

This PR takes a different approach to that of https://github.com/altair-viz/vega_datasets, notably:

No datasets are included in the package
- Instead, included is only a single 18.7KB file metadata.parquet
- The file describes all versions of all datasets
  - provided they are accessible via both npm and github
Strong support for typing
- Annotations are generated from the metadata itself
- https://github.com/vega/altair/blob/9e9deeb95668d2c4e7d30311e85a8f9f6acdc88c/altair/datasets/_typing.py
So far, 4 backends have been implemented, instead of only pandas
- These provide precise IDE completions, with a lot of help from https://github.com/narwhals-dev/narwhals
Users can opt-in to caching remote dataset requests
- With the "polars" backend, the slowest I've had on a cache-hit is 0.1s to load
  - https://cdn.jsdelivr.net/npm/[email protected]/data/flights-200k.json

Examples

These all come from the docstrings of:

Loader
Loader.from_backend
Loader.__call__

import altair as alt
from altair.datasets import Loader

data = Loader.from_backend("polars")
>>> data
Loader[polars]

cars = data("cars")

>>> type(cars)
polars.dataframe.frame.DataFrame

data = Loader.from_backend("pandas")
cars = data("cars")

>>> type(cars)
pandas.core.frame.DataFrame

data = Loader.from_backend("pandas[pyarrow]")
cars = data("cars", tag="v1.29.0")

>>> type(cars)
pandas.core.frame.DataFrame

>>> cars.dtypes
Name                       string[pyarrow]
Miles_per_Gallon           double[pyarrow]
Cylinders                   int64[pyarrow]
Displacement               double[pyarrow]
Horsepower                  int64[pyarrow]
Weight_in_lbs               int64[pyarrow]
Acceleration               double[pyarrow]
Year                timestamp[ns][pyarrow]
Origin                     string[pyarrow]
dtype: object

data = Loader.from_backend("pandas")
source = data("stocks", tag="v2.10.0")

>>> source.columns
Index(['symbol', 'date', 'price'], dtype='object')

data = Loader.from_backend("pyarrow")
source = data("stocks", tag="v2.10.0")

>>> source.column_names
['symbol', 'date', 'price']

Tasks

Resolved

Investigate bundling metadata

Investigating bundling metadata (22a5039), (1792340)
- Depending on how well the compression scales, it might be reasonable to include this for some number of versions
- Deliberately including redundant info early on - can always chip away at it later

npm does not have every version available GitHub

Sources
- npm/vega-datasets
  - Fixed with: https://data.jsdelivr.com/v1/packages/npm/vega-datasets
- https://github.com/vega/vega-datasets/tags
Known missing
feat(DRAFT): Add a source for available npm versions
Need to add some handling to invalidate these entries returned from list-repository-tags once confirmed they cannot be requested from npm
- Can technically request from github, but during testing this was much slower
- Also, these versions would not have been available from https://github.com/altair-viz/vega_datasets, since that only used npm

Plan strategy for user-configurable dataset cache

Everything so far has been building the tools for a compact bundled index
- 1, 2, 3, 4, 5
- Refreshing the index would not be included in altair, each release would simply ship with changes baked in
Trying to avoid bloating altair package size with datasets
User-facing
- Goal of requesting each unique dataset version once
  - The user cache would not need to be updated between altair versions
- Some kind of opt-in config to say store the datasets in this directory please
  - Basic solution would be defining an env variable like ALTAIR_DATASETS_DIR
  - When not provided, always perform remote requests
    - User motivation would be that it would be faster to enable caching

Deferred

Reducing cache footprint

e.g. storing the .(csv|tsv|json) files as .parquet
Need to do more testing on this though to ensure
- the shape of each dataset is preserved
- where relevant - intentional errors remain intact

Investigate providing a decorator to add a backend

Will be trivial for the user-side, since they don't need to be concerned about imports
Just need to provide these attributes:
- _name: LiteralString
- _read_fn: dict[Extension, Callable[..., IntoDataFrameT]]
- _scan_fn: dict[_ExtensionScan, Callable[..., IntoFrameT]]

Provide more meaningful info on the state of `ALTAIR_DATASETS_DIR`

How many datasets, size (per & total)?
What version range does a given sha cover?
Blocked: Running into issues with
- pandas/pyarrow group_by warnings
- min and max return all nulls in pl.Enum pola-rs/polars#18394
- Missing nw.Expr.(first|last)
- nw.Expr.(head|tail)(1) not equivalent in a group_by().agg(...) context
  - pandas -> scalar
  - polars -> list
- pl.Enum translating to non-ordered pd.Categorical

polars-native solution

from __future__ import annotations

from pathlib import Path

import polars as pl
from altair.datasets import Loader, _readers

data = Loader.from_backend("polars")

# NOTE: Enable caching, populate with some responses
data.cache_dir = Path.home() / ".altair_cache"
data("cars")
data("cars", tag="v1.5.0")
data("movies")
data("movies", tag="v1.24.0")
data("jobs")


if cache_dir := data.cache_dir:
    cached_stems: tuple[str, ...] = tuple(fp.stem for fp in cache_dir.iterdir())
else:
    msg = "Datasets cache unset"
    raise TypeError(msg)

# NOTE: Lots of redundancies, many urls point to the same data (sha)
>>> pl.read_parquet(_readers._METADATA).shape
# (2879, 9)

# NOTE: Version range per sha
tag_sort: pl.Expr = pl.col("tag").sort()
tag_range: pl.Expr = pl.concat_str(tag_sort.first(), tag_sort.last(), separator=" - ")

# NOTE: Producing a name only when the file is already in the cache
file_name: pl.Expr = pl.when(pl.col("sha").is_in(cached_stems)).then(
    pl.concat_str("sha", "suffix")
)

cache_summary: pl.DataFrame = (
    pl.scan_parquet(_readers._METADATA)
    .group_by("dataset_name", "suffix", "size", "sha")
    .agg(tag_range=tag_range)
    .select(pl.exclude("sha"), file_name=file_name)
    .sort("dataset_name", "size")
    .collect()
)

>>> cache_summary.shape
# (116, 5)

>>> cache_summary.head(10)

shape: (10, 5)
┌───────────────┬────────┬─────────┬───────────────────┬─────────────────────────────────┐
│ dataset_name  ┆ suffix ┆ size    ┆ tag_range         ┆ file_name                       │
│ ---           ┆ ---    ┆ ---     ┆ ---               ┆ ---                             │
│ str           ┆ str    ┆ i64     ┆ str               ┆ str                             │
╞═══════════════╪════════╪═════════╪═══════════════════╪═════════════════════════════════╡
│ 7zip          ┆ .png   ┆ 3969    ┆ v1.5.0 - v2.10.0  ┆ null                            │
│ airports      ┆ .csv   ┆ 210365  ┆ v1.5.0 - v2.10.0  ┆ 608ba6d51fa70584c3fa1d31eb9453… │
│ annual-precip ┆ .json  ┆ 266265  ┆ v1.29.0 - v2.10.0 ┆ null                            │
│ anscombe      ┆ .json  ┆ 1703    ┆ v1.5.0 - v2.10.0  ┆ null                            │
│ barley        ┆ .json  ┆ 8487    ┆ v1.5.0 - v2.10.0  ┆ 8dc50de2509b6e197ce95c24c98f90… │
│ birdstrikes   ┆ .csv   ┆ 1223329 ┆ v2.0.0 - v2.10.0  ┆ null                            │
│ birdstrikes   ┆ .json  ┆ 4183924 ┆ v1.5.0 - v1.31.1  ┆ null                            │
│ budget        ┆ .json  ┆ 374289  ┆ v1.5.0 - v2.8.1   ┆ null                            │
│ budget        ┆ .json  ┆ 391353  ┆ v2.9.0 - v2.10.0  ┆ null                            │
│ budgets       ┆ .json  ┆ 18079   ┆ v1.5.0 - v2.10.0  ┆ 8a909e24f698a3b0f6c637c30ec95e… │
└───────────────┴────────┴─────────┴───────────────────┴─────────────────────────────────┘

- Allow quickly switching between version tags #3150 (comment)

To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)

Not required for these requests, but may be helpful to avoid limits

As an example, for comparing against the most recent I've added the 5 most recent

See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags

- Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests

Experimenting with querying the url cache w/ expressions

- `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb**

https://github.com/vega/altair/actions/runs/11495437283/job/31994955413

- Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well

dangotbanned · 2024-11-24T10:16:33Z

@jonmmease I just tried updating this branch, seems to be some vegafusion issues?

9d97096 (#3631)

Update

Resolved in #3702

f21b52b

Feature has been adopted upstream in narwhals-dev/narwhals#1417

Not using doctest style here, none of these return anything but I want them hinted at

Mutability is not needed. Also see #3573

narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment)

@joelostblom

Provides a generalized solution to `pd.read_(csv|json)` requiring the names of date columns to attempt parsing. cc @joelostblom The solution is possible in large part to vega/vega-datasets#631 #3631 (comment)

Related #3706

Related 909e7d0

…arrow` Provides better dtype inference

Switching to one with a timestamp that `frictionless` recognises https://github.com/vega/vega-datasets/blob/8745f5c61ba951fe057a42562b8b88604b4a3735/datapackage.json#L2674-L2689 https://github.com/vega/vega-datasets/blob/8745f5c61ba951fe057a42562b8b88604b4a3735/datapackage.json#L45-L57

https://github.com/vega/altair/actions/runs/12439154550/job/34732432411?pr=3631

https://github.com/vega/altair/actions/runs/12439184312/job/34732516789?pr=3631

Changes from vega/vega-datasets#648 Currently pinned on `main` until `v3.0.0` introduces `datapackage.json` https://github.com/vega/vega-datasets/tree/main

- Adds some incomplete types for fields (`sources`, `licenses`) - Misc changes from vega/vega-datasets#651, vega/vega-datasets#643

dangotbanned added 6 commits October 2, 2024 22:13

wip

7933771

feat(DRAFT): Minimal reimplementation

b30081e

refactor: Make version accessible via data.source_tag

279586b

- Allow quickly switching between version tags #3150 (comment)

refactor: ext_fn -> Dataset.read_fn

32150ad

docs: Add trailing docs to long literals

f1d18a2

docs: Add module-level doc

4d3c550

dangotbanned added the maintenance label Oct 4, 2024

dangotbanned added 23 commits October 4, 2024 20:15

Merge branch 'main' into vega-datasets

7e65841

Merge branch 'main' into vega-datasets

05773af

Merge branch 'main' into vega-datasets

4fff80a

feat: Adds .arrow support

3a284a5

To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)

feat: Add support for caching metadata

22a5039

feat: Support env var VEGA_GITHUB_TOKEN

a618ffc

Not required for these requests, but may be helpful to avoid limits

feat: Add support for multi-version metadata

1792340

As an example, for comparing against the most recent I've added the 5 most recent

refactor: Renaming, docs, reorganize

fa2c9e7

feat: Support collecting release tags

24cd7d7

See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags

feat: Adds refresh_tags

7dd461f

- Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests

feat(DRAFT): Adds url_from

9768495

Experimenting with querying the url cache w/ expressions

fix: Wrap all requests with auth

c38c235

chore: Remove DATASET_NAMES_USED

a22cc8a

feat: Major GitHub rewrite, handle rate limiting

1181860

- `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb**

feat(DRAFT): Partial implement data("name")

31eeb20

fix(typing): Resolve some mypy errors

511a845

Merge branch 'main' into vega-datasets

c76cfd4

Merge branch 'main' into vega-datasets

d3f0497

Merge branch 'main' into vega-datasets

1b3390b

fix(ruff): Apply 3.8 fixes

a770ba9

https://github.com/vega/altair/actions/runs/11495437283/job/31994955413

docs(typing): Add WorkInProgress marker to data(...)

686a485

- Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well

Merge branch 'main' into vega-datasets

ba4491d

Merge branch 'main' into vega-datasets

1a4e107

dangotbanned added 2 commits November 23, 2024 19:51

Merge branch 'main' into vega-datasets

8ba48a9

Merge branch 'main' into vega-datasets

9d97096

mattijn mentioned this pull request Nov 24, 2024

Test suite is failing with Vegafusion 2 #3701

Open

dangotbanned added 7 commits November 24, 2024 13:50

Merge remote-tracking branch 'upstream/main' into vega-datasets

a698de9

revert(ruff): Ignore 0.8.0 violations

c907dc5

f21b52b

revert: Remove _readers._filter

a3b38c4

Feature has been adopted upstream in narwhals-dev/narwhals#1417

feat: Adds example and tests for disabling caching

a6c5096

refactor: Tidy up DatasetCache

71423ea

docs: Finish Loader.cache

7dd9c18

Not using doctest style here, none of these return anything but I want them hinted at

refactor(typing): Use Mapping instead of dict

a982759

Mutability is not needed. Also see #3573

This was referenced Nov 24, 2024

Additional metadata for datasets vega/vega-datasets#629

Closed

breaking: Rename weather.json -> weekly-weather.json vega/vega-datasets#633

Closed

dangotbanned added 8 commits November 30, 2024 14:44

perf: Use to_list() for all backends

d20e9c1

narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment)

feat(DRAFT): Utilize datapackage schemas in pandas backends

909e7d0

Provides a generalized solution to `pd.read_(csv|json)` requiring the names of date columns to attempt parsing. cc @joelostblom The solution is possible in large part to vega/vega-datasets#631 #3631 (comment)

Merge remote-tracking branch 'upstream/main' into vega-datasets

d93fda1

refactor(ruff): Apply TC006 fixes in new code

9274284

Related #3706

docs(DRAFT): Add notes on datapackage.features_typing

8e232b8

docs: Update Loader.from_backend example w/ dtypes

9330895

Related 909e7d0

feat: Use _pl_read_json_roundtrip instead of pl.read_json for `py…

caf534d

…arrow` Provides better dtype inference

This was referenced Dec 5, 2024

Use a datetime column in flights-3m.parquet vega/vega-datasets#641

Closed

feat: Use a datetime column in flights-3m.parquet vega/vega-datasets#642

Merged

dangotbanned added 4 commits December 20, 2024 22:08

Merge branch 'main' into vega-datasets

9e1fd09

fix(ruff): resolve RUF043 warnings

d4930e7

https://github.com/vega/altair/actions/runs/12439154550/job/34732432411?pr=3631

build: run generate-schema-wrapper

5a31333

https://github.com/vega/altair/actions/runs/12439184312/job/34732516789?pr=3631

chore: update schemas

6080116

Changes from vega/vega-datasets#648 Currently pinned on `main` until `v3.0.0` introduces `datapackage.json` https://github.com/vega/vega-datasets/tree/main

dangotbanned mentioned this pull request Dec 21, 2024

3.0.0 Release vega/vega-datasets#654

Open

2 tasks

feat(typing): Update frictionless model hierarchy

897e8f9

- Adds some incomplete types for fields (`sources`, `licenses`) - Misc changes from vega/vega-datasets#651, vega/vega-datasets#643

dangotbanned mentioned this pull request Dec 22, 2024

refactor: Centralize Vega project versioning #3720

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(RFC): Adds `altair.datasets` #3631

feat(RFC): Adds `altair.datasets` #3631

dangotbanned commented Oct 4, 2024 •

edited

Loading

dangotbanned commented Nov 24, 2024 •

edited

Loading

feat(RFC): Adds altair.datasets #3631

Are you sure you want to change the base?

feat(RFC): Adds altair.datasets #3631

Conversation

dangotbanned commented Oct 4, 2024 • edited Loading

Related

Status

Description

Examples

Tasks

Resolved

Deferred

Reducing cache footprint

Investigate providing a decorator to add a backend

Provide more meaningful info on the state of ALTAIR_DATASETS_DIR

dangotbanned commented Nov 24, 2024 • edited Loading

Update

feat(RFC): Adds `altair.datasets` #3631

feat(RFC): Adds `altair.datasets` #3631

dangotbanned commented Oct 4, 2024 •

edited

Loading

Provide more meaningful info on the state of `ALTAIR_DATASETS_DIR`

dangotbanned commented Nov 24, 2024 •

edited

Loading