feat: Use a datetime column in `flights-3m.parquet` #642

dangotbanned · 2024-12-05T15:21:01Z

Closes #641

Description

If I've understood the existing script correctly; then this should produce a datetime/timestamp column for "date".
Or more correctly, it won't convert to a format string for .parquet - as it supports data types.

Question

@dsmedia I haven't run the script, as I'm unsure on:

What environment is needed for the script?
How exactly is the linked data selected?
- The doc says 1 per month
- flights-3m.parquet has 6 months
- flights-?k.json have 3 months?
What were the most recent CLI args used?

Update

In (a908adb) I opted for making the change in place.

Fix script

from pathlib import Path
import polars as pl

def fix_inplace_3m_parquet(
    fp: Path, /, year: int | str, *, column: str = "date"
) -> None:
    year_zero_pad = str(year)[-2:]
    if len(year_zero_pad) != 2:
        raise TypeError(year)

    ldf = pl.scan_parquet(fp)
    if not ldf.collect_schema()[column].is_temporal():
        date_conv: pl.Expr = (
            pl.concat_str(pl.lit(year_zero_pad), column)
            .str.to_datetime("%y%m%d%H%M")
            .alias(column)
        )
        ldf.with_columns(date_conv).collect().write_parquet(fp, compression_level=22)

>>> fix_inplace_3m_parquet(
...     Path.cwd() / "data" / "flights-3m.parquet", year=2001, column="date"
... )

Example

from pathlib import Path
import polars as pl

>>> pl.read_parquet(Path.cwd() / "data" / "flights-3m.parquet")
shape: (3_000_000, 5)
┌─────────────────────┬───────┬──────────┬────────┬─────────────┐
│ date                ┆ delay ┆ distance ┆ origin ┆ destination │
│ ---                 ┆ ---   ┆ ---      ┆ ---    ┆ ---         │
│ datetime[μs]        ┆ i64   ┆ i64      ┆ str    ┆ str         │
╞═════════════════════╪═══════╪══════════╪════════╪═════════════╡
│ 2001-01-01 00:01:00 ┆ -13   ┆ 2345     ┆ ANC    ┆ LAX         │
│ 2001-01-01 00:03:00 ┆ -20   ┆ 1946     ┆ LAX    ┆ ATL         │
│ 2001-01-01 00:10:00 ┆ 0     ┆ 1671     ┆ PHX    ┆ DTW         │
│ 2001-01-01 00:20:00 ┆ 10    ┆ 1709     ┆ SEA    ┆ STL         │
│ 2001-01-01 00:22:00 ┆ -13   ┆ 1736     ┆ SFO    ┆ STL         │
│ …                   ┆ …     ┆ …        ┆ …      ┆ …           │
│ 2001-06-30 23:58:00 ┆ 30    ┆ 215      ┆ ATL    ┆ SAV         │
│ 2001-06-30 23:59:00 ┆ 87    ┆ 496      ┆ PIT    ┆ BOS         │
│ 2001-06-30 23:59:00 ┆ 9     ┆ 151      ┆ ATL    ┆ HSV         │
│ 2001-06-30 23:59:00 ┆ 83    ┆ 641      ┆ DFW    ┆ DEN         │
│ 2001-06-30 23:59:00 ┆ 16    ┆ 594      ┆ ATL    ┆ DTW         │
└─────────────────────┴───────┴──────────┴────────┴─────────────┘

The first definition would be overwritten by the second https://github.com/vega/vega-datasets/blob/ca792c8a973ff0ec75f54d4228a10164c70f82cb/scripts/flights.py#L449 https://github.com/vega/vega-datasets/blob/ca792c8a973ff0ec75f54d4228a10164c70f82cb/scripts/flights.py#L691

#641

- The `pandas` api `.to_parquet()` already does the `pyarrow` calls under the hood - Logging reduced to one formatted message

I've been unable to answer questions I had on recreating this dataset (#642 (comment)) The *safest* options seems to be just updating the output in-place #641

domoritz · 2024-12-06T18:16:35Z

It would be good to add the dependencies to the script so they can run with uvx.

@domoritz

@domoritz #642 (comment)

dangotbanned · 2024-12-06T18:37:21Z

It would be good to add the dependencies to the script so they can run with uvx.

@domoritz I've added what I assume would work, but the script is dependent on manually sourced data - so won't be able to run in ci.

I've been exploring rewriting the script in more reproducible way (locally), so I could follow this up with another PR?

dsmedia · 2024-12-06T18:42:19Z

Thanks for your help there. One issue is there does not appear to be an API or programmatic way to retrieve the raw files.

…

On Fri, Dec 6, 2024, 1:37 PM Dan Redding ***@***.***> wrote: It would be good to add the dependencies to the script so they can run with uvx. @domoritz <https://github.com/domoritz> I've added what I assume would work, but the script is dependent on manually sourced data - so won't be able to run in ci. I've been exploring rewriting the script in more reproducible way (locally), so I could follow this up with another PR? — Reply to this email directly, view it on GitHub <#642 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APBHV2MIYE5ZQES3SF5LVT32EHVHRAVCNFSM6AAAAABTCWT3ZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRTHEZDQMBRGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dangotbanned · 2024-12-06T18:43:54Z

Thanks for your help there. One issue is there does not appear to be an API or programmatic way to retrieve the raw files.
…

@dsmedia I've got that part solved 😅

dangotbanned · 2024-12-06T18:57:59Z

Thanks for your help there. One issue is there does not appear to be an API or programmatic way to retrieve the raw files.
…

@dsmedia I've got that part solved 😅

Code block

This is part of the work I've been doing locally.

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "typing-extensions>=4.12.2",
# ]
# ///
from __future__ import annotations

from collections import deque
import datetime as dt
from pathlib import Path
from typing import TYPE_CHECKING
from urllib import request

if TYPE_CHECKING:
    from collections.abc import Iterable
    from typing import Literal, Any

    from typing_extensions import TypeIs

BASE_URL = "https://www.transtats.bts.gov/"
ROUTE_ZIP = f"{BASE_URL}PREZIP/"
REPORTING_PREFIX = "On_Time_Reporting_Carrier_On_Time_Performance_1987_present_"
ZIP: Literal[".zip"] = ".zip"

type MonthNumber = Literal[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
type YearAvailable = int

def _is_month_number(obj: Any) -> TypeIs[MonthNumber]:
    return obj in {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}

def _is_year_available(obj: Any) -> TypeIs[YearAvailable]:
    return isinstance(obj, int) and obj >= 1987 and obj <= dt.date.today().year

def _download_zip(output_dir: Path, url: str, /) -> Path:
    fp = output_dir / url.removeprefix(ROUTE_ZIP)
    fp.touch()
    with request.urlopen(url) as response:
        fp.write_bytes(response.read())
    return fp


def download_all_zip(
    output_dir: Path, /, year: YearAvailable, months: Iterable[MonthNumber]
) -> deque[Path]:
    output_fps = deque[Path]()
    output_dir.mkdir(exist_ok=True)
    if _is_year_available(year):
        base: str = f"{ROUTE_ZIP}{REPORTING_PREFIX}{year}_"
    else:
        raise ValueError(year)
    for month_no in months:
        if _is_month_number(month_no):
            fp = _download_zip(output_dir, f"{base}{month_no}{ZIP}")
            output_fps.append(fp)
        else:
            raise ValueError(month_no)
    return output_fps

The rest is utilizing the output, which is pretty close to finished:

So all you'd need to do here would be:

some_dir = Path.cwd() / ".vega-datasets"
>>> download_all_zip(some_dir, 2001, [1, 2, 3, 4, 5, 6])

And you'll have the first 6 months from 2001 downloaded.

It does take a while, but it works

dsmedia · 2024-12-08T22:07:30Z

Regarding the number of source months per dataset:

all except flights-3m: three months (jan-mar) were chosen to match the timescale of the existing datasets. (The script was created to replicate the existing datasets.)
flights-3m: six months of records were needed (jan-june) as this was the minimum timescale required to generate a dataset with 3 million records.

domoritz

Can we

get rid of https://github.com/vega/vega-datasets/blob/main/scripts/flights.js?
put the commands to generate the parquet/csv etc files into the sources doc so that one can reproduce the files?

dangotbanned · 2024-12-09T00:02:45Z

Thanks for approving @domoritz

Can we

put the commands to generate the parquet/csv etc files into the sources doc so that one can reproduce the files?

I'm going to follow this up with another PR in a next few days with a fully reproducible flights.py.

Preview

Inputs declared in toml

[[specs]]
range = [2001-01-01, 2001-03-31]
n_rows = 1_000
suffix = ".csv"
dt_format = "ISO"

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 2_000
suffix = ".json"
dt_format = "iso"

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 5_000
suffix = ".json"
dt_format = "iso"

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 10_000
suffix = ".json"
dt_format = "iso"

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 20_000
suffix = ".json"
dt_format = "iso"

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 200_000
suffix = ".json"
dt_format = "decimal"
columns=["delay", "distance", "time"]

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 200_000
suffix = ".arrow"
dt_format = "decimal"
columns=["delay", "distance", "time"]

[[specs]]
start = 2001-01-01
end = 2001-06-30
n_rows = 3_000_000
suffix = ".parquet"

Downloading & sharing data

Mix & match outputs

I just wanted to share this ahead of time to avoid repeating any work 🙂

#642 (review), #642 (comment)

@domoritz

@domoritz resolves #642 (review)

dangotbanned added 4 commits December 5, 2024 12:29

feat: Skip format_datetime for flights-.+.parquet

bc7659b

#641

refactor: Simplify save_output

bea27e7

refactor: Simplify save_as_parquet

11fd0de

- The `pandas` api `.to_parquet()` already does the `pyarrow` calls under the hood - Logging reduced to one formatted message

dangotbanned added the enhancement label Dec 5, 2024

fix: Correct flights-3m.parquet column inplace

a908adb

I've been unable to answer questions I had on recreating this dataset (#642 (comment)) The *safest* options seems to be just updating the output in-place #641

dangotbanned marked this pull request as ready for review December 6, 2024 14:41

dangotbanned requested a review from domoritz December 6, 2024 14:47

build: Update datapackage.json

6036e82

ci: Add inline dependencies

14d583e

@domoritz #642 (comment)

domoritz approved these changes Dec 8, 2024

View reviewed changes

dangotbanned merged commit 0c5dc68 into main Dec 9, 2024
2 checks passed

dangotbanned deleted the parquet-datetime branch December 9, 2024 18:44

dangotbanned added a commit that referenced this pull request Dec 10, 2024

feat(DRAFT): Improve flights.* dataset reproducibility

ad1b862

#642 (review), #642 (comment)

dangotbanned mentioned this pull request Dec 10, 2024

feat: Improve flights.* dataset reproducibility #645

Merged

5 tasks

dangotbanned added a commit that referenced this pull request Dec 19, 2024

chore: remove flights.js

05707d9

@domoritz resolves #642 (review)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Use a datetime column in `flights-3m.parquet` #642

feat: Use a datetime column in `flights-3m.parquet` #642

dangotbanned commented Dec 5, 2024 •

edited

Loading

domoritz commented Dec 6, 2024

dangotbanned commented Dec 6, 2024

dsmedia commented Dec 6, 2024 via email

dangotbanned commented Dec 6, 2024

dangotbanned commented Dec 6, 2024 •

edited

Loading

dsmedia commented Dec 8, 2024

domoritz left a comment

dangotbanned commented Dec 9, 2024 •

edited

Loading

feat: Use a datetime column in flights-3m.parquet #642

feat: Use a datetime column in flights-3m.parquet #642

Conversation

dangotbanned commented Dec 5, 2024 • edited Loading

Related

Description

Question

Update

Example

domoritz commented Dec 6, 2024

dangotbanned commented Dec 6, 2024

dsmedia commented Dec 6, 2024 via email

dangotbanned commented Dec 6, 2024

dangotbanned commented Dec 6, 2024 • edited Loading

dsmedia commented Dec 8, 2024

domoritz left a comment

Choose a reason for hiding this comment

dangotbanned commented Dec 9, 2024 • edited Loading

Preview

feat: Use a datetime column in `flights-3m.parquet` #642

feat: Use a datetime column in `flights-3m.parquet` #642

dangotbanned commented Dec 5, 2024 •

edited

Loading

dangotbanned commented Dec 6, 2024 •

edited

Loading

dangotbanned commented Dec 9, 2024 •

edited

Loading