Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Use a datetime column in flights-3m.parquet #642

Merged
merged 7 commits into from
Dec 9, 2024

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Dec 5, 2024

Closes #641

Related

Description

If I've understood the existing script correctly; then this should produce a datetime/timestamp column for "date".
Or more correctly, it won't convert to a format string for .parquet - as it supports data types.

Question

@dsmedia I haven't run the script, as I'm unsure on:

  • What environment is needed for the script?
  • How exactly is the linked data selected?
    • The doc says 1 per month
    • flights-3m.parquet has 6 months
    • flights-?k.json have 3 months?
  • What were the most recent CLI args used?

Update

In (a908adb) I opted for making the change in place.

Fix script

from pathlib import Path
import polars as pl

def fix_inplace_3m_parquet(
    fp: Path, /, year: int | str, *, column: str = "date"
) -> None:
    year_zero_pad = str(year)[-2:]
    if len(year_zero_pad) != 2:
        raise TypeError(year)

    ldf = pl.scan_parquet(fp)
    if not ldf.collect_schema()[column].is_temporal():
        date_conv: pl.Expr = (
            pl.concat_str(pl.lit(year_zero_pad), column)
            .str.to_datetime("%y%m%d%H%M")
            .alias(column)
        )
        ldf.with_columns(date_conv).collect().write_parquet(fp, compression_level=22)

>>> fix_inplace_3m_parquet(
...     Path.cwd() / "data" / "flights-3m.parquet", year=2001, column="date"
... )

Example

from pathlib import Path
import polars as pl

>>> pl.read_parquet(Path.cwd() / "data" / "flights-3m.parquet")
shape: (3_000_000, 5)
┌─────────────────────┬───────┬──────────┬────────┬─────────────┐
│ datedelaydistanceorigindestination │
│ ---------------         │
│ datetime[μs]        ┆ i64i64strstr         │
╞═════════════════════╪═══════╪══════════╪════════╪═════════════╡
│ 2001-01-01 00:01:00-132345ANCLAX         │
│ 2001-01-01 00:03:00-201946LAXATL         │
│ 2001-01-01 00:10:0001671PHXDTW         │
│ 2001-01-01 00:20:00101709SEASTL         │
│ 2001-01-01 00:22:00-131736SFOSTL         │
│ …                   ┆ …     ┆ …        ┆ …      ┆ …           │
│ 2001-06-30 23:58:0030215ATLSAV         │
│ 2001-06-30 23:59:0087496PITBOS         │
│ 2001-06-30 23:59:009151ATLHSV         │
│ 2001-06-30 23:59:0083641DFWDEN         │
│ 2001-06-30 23:59:0016594ATLDTW         │
└─────────────────────┴───────┴──────────┴────────┴─────────────┘

I've been unable to answer questions I had on recreating this dataset
(#642 (comment))

The *safest* options seems to be just updating the output in-place

#641
@dangotbanned dangotbanned marked this pull request as ready for review December 6, 2024 14:41
@dangotbanned dangotbanned requested a review from domoritz December 6, 2024 14:47
@domoritz
Copy link
Member

domoritz commented Dec 6, 2024

It would be good to add the dependencies to the script so they can run with uvx.

@dangotbanned
Copy link
Member Author

It would be good to add the dependencies to the script so they can run with uvx.

@domoritz I've added what I assume would work, but the script is dependent on manually sourced data - so won't be able to run in ci.

I've been exploring rewriting the script in more reproducible way (locally), so I could follow this up with another PR?

@dsmedia
Copy link
Collaborator

dsmedia commented Dec 6, 2024 via email

@dangotbanned
Copy link
Member Author

Thanks for your help there. One issue is there does not appear to be an API or programmatic way to retrieve the raw files.

@dsmedia I've got that part solved 😅

@dangotbanned
Copy link
Member Author

dangotbanned commented Dec 6, 2024

Thanks for your help there. One issue is there does not appear to be an API or programmatic way to retrieve the raw files.

@dsmedia I've got that part solved 😅

Code block

This is part of the work I've been doing locally.

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "typing-extensions>=4.12.2",
# ]
# ///
from __future__ import annotations

from collections import deque
import datetime as dt
from pathlib import Path
from typing import TYPE_CHECKING
from urllib import request

if TYPE_CHECKING:
    from collections.abc import Iterable
    from typing import Literal, Any

    from typing_extensions import TypeIs

BASE_URL = "https://www.transtats.bts.gov/"
ROUTE_ZIP = f"{BASE_URL}PREZIP/"
REPORTING_PREFIX = "On_Time_Reporting_Carrier_On_Time_Performance_1987_present_"
ZIP: Literal[".zip"] = ".zip"

type MonthNumber = Literal[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
type YearAvailable = int

def _is_month_number(obj: Any) -> TypeIs[MonthNumber]:
    return obj in {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}

def _is_year_available(obj: Any) -> TypeIs[YearAvailable]:
    return isinstance(obj, int) and obj >= 1987 and obj <= dt.date.today().year

def _download_zip(output_dir: Path, url: str, /) -> Path:
    fp = output_dir / url.removeprefix(ROUTE_ZIP)
    fp.touch()
    with request.urlopen(url) as response:
        fp.write_bytes(response.read())
    return fp


def download_all_zip(
    output_dir: Path, /, year: YearAvailable, months: Iterable[MonthNumber]
) -> deque[Path]:
    output_fps = deque[Path]()
    output_dir.mkdir(exist_ok=True)
    if _is_year_available(year):
        base: str = f"{ROUTE_ZIP}{REPORTING_PREFIX}{year}_"
    else:
        raise ValueError(year)
    for month_no in months:
        if _is_month_number(month_no):
            fp = _download_zip(output_dir, f"{base}{month_no}{ZIP}")
            output_fps.append(fp)
        else:
            raise ValueError(month_no)
    return output_fps

The rest is utilizing the output, which is pretty close to finished:

image

So all you'd need to do here would be:

some_dir = Path.cwd() / ".vega-datasets"
>>> download_all_zip(some_dir, 2001, [1, 2, 3, 4, 5, 6])

And you'll have the first 6 months from 2001 downloaded.

It does take a while, but it works

@dsmedia
Copy link
Collaborator

dsmedia commented Dec 8, 2024

Regarding the number of source months per dataset:

  • all except flights-3m: three months (jan-mar) were chosen to match the timescale of the existing datasets. (The script was created to replicate the existing datasets.)

  • flights-3m: six months of records were needed (jan-june) as this was the minimum timescale required to generate a dataset with 3 million records.

Copy link
Member

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we

@dangotbanned
Copy link
Member Author

dangotbanned commented Dec 9, 2024

Thanks for approving @domoritz

Can we

  • put the commands to generate the parquet/csv etc files into the sources doc so that one can reproduce the files?

I'm going to follow this up with another PR in a next few days with a fully reproducible flights.py.

Preview

Inputs declared in toml

[[specs]]
range = [2001-01-01, 2001-03-31]
n_rows = 1_000
suffix = ".csv"
dt_format = "ISO"

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 2_000
suffix = ".json"
dt_format = "iso"

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 5_000
suffix = ".json"
dt_format = "iso"

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 10_000
suffix = ".json"
dt_format = "iso"

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 20_000
suffix = ".json"
dt_format = "iso"

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 200_000
suffix = ".json"
dt_format = "decimal"
columns=["delay", "distance", "time"]

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 200_000
suffix = ".arrow"
dt_format = "decimal"
columns=["delay", "distance", "time"]

[[specs]]
start = 2001-01-01
end = 2001-06-30
n_rows = 3_000_000
suffix = ".parquet"

Downloading & sharing data

image

image

Mix & match outputs

image

I just wanted to share this ahead of time to avoid repeating any work 🙂

@dangotbanned dangotbanned merged commit 0c5dc68 into main Dec 9, 2024
2 checks passed
@dangotbanned dangotbanned deleted the parquet-datetime branch December 9, 2024 18:44
dangotbanned added a commit that referenced this pull request Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use a datetime column in flights-3m.parquet
3 participants