-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Use a datetime column in flights-3m.parquet
#642
Conversation
The first definition would be overwritten by the second https://github.com/vega/vega-datasets/blob/ca792c8a973ff0ec75f54d4228a10164c70f82cb/scripts/flights.py#L449 https://github.com/vega/vega-datasets/blob/ca792c8a973ff0ec75f54d4228a10164c70f82cb/scripts/flights.py#L691
- The `pandas` api `.to_parquet()` already does the `pyarrow` calls under the hood - Logging reduced to one formatted message
I've been unable to answer questions I had on recreating this dataset (#642 (comment)) The *safest* options seems to be just updating the output in-place #641
It would be good to add the dependencies to the script so they can run with uvx. |
@domoritz I've added what I assume would work, but the script is dependent on manually sourced data - so won't be able to run in ci. I've been exploring rewriting the script in more reproducible way (locally), so I could follow this up with another PR? |
Thanks for your help there. One issue is there does not appear to be an API
or programmatic way to retrieve the raw files.
…On Fri, Dec 6, 2024, 1:37 PM Dan Redding ***@***.***> wrote:
It would be good to add the dependencies to the script so they can run
with uvx.
@domoritz <https://github.com/domoritz> I've added what I assume would
work, but the script is dependent on manually sourced data - so won't be
able to run in ci.
I've been exploring rewriting the script in more reproducible way
(locally), so I could follow this up with another PR?
—
Reply to this email directly, view it on GitHub
<#642 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/APBHV2MIYE5ZQES3SF5LVT32EHVHRAVCNFSM6AAAAABTCWT3ZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRTHEZDQMBRGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Code block
This is part of the work I've been doing locally. # /// script
# requires-python = ">=3.12"
# dependencies = [
# "typing-extensions>=4.12.2",
# ]
# ///
from __future__ import annotations
from collections import deque
import datetime as dt
from pathlib import Path
from typing import TYPE_CHECKING
from urllib import request
if TYPE_CHECKING:
from collections.abc import Iterable
from typing import Literal, Any
from typing_extensions import TypeIs
BASE_URL = "https://www.transtats.bts.gov/"
ROUTE_ZIP = f"{BASE_URL}PREZIP/"
REPORTING_PREFIX = "On_Time_Reporting_Carrier_On_Time_Performance_1987_present_"
ZIP: Literal[".zip"] = ".zip"
type MonthNumber = Literal[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
type YearAvailable = int
def _is_month_number(obj: Any) -> TypeIs[MonthNumber]:
return obj in {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
def _is_year_available(obj: Any) -> TypeIs[YearAvailable]:
return isinstance(obj, int) and obj >= 1987 and obj <= dt.date.today().year
def _download_zip(output_dir: Path, url: str, /) -> Path:
fp = output_dir / url.removeprefix(ROUTE_ZIP)
fp.touch()
with request.urlopen(url) as response:
fp.write_bytes(response.read())
return fp
def download_all_zip(
output_dir: Path, /, year: YearAvailable, months: Iterable[MonthNumber]
) -> deque[Path]:
output_fps = deque[Path]()
output_dir.mkdir(exist_ok=True)
if _is_year_available(year):
base: str = f"{ROUTE_ZIP}{REPORTING_PREFIX}{year}_"
else:
raise ValueError(year)
for month_no in months:
if _is_month_number(month_no):
fp = _download_zip(output_dir, f"{base}{month_no}{ZIP}")
output_fps.append(fp)
else:
raise ValueError(month_no)
return output_fps The rest is utilizing the output, which is pretty close to finished: So all you'd need to do here would be: some_dir = Path.cwd() / ".vega-datasets"
>>> download_all_zip(some_dir, 2001, [1, 2, 3, 4, 5, 6]) And you'll have the first 6 months from It does take a while, but it works |
Regarding the number of source months per dataset:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we
- get rid of https://github.com/vega/vega-datasets/blob/main/scripts/flights.js?
- put the commands to generate the parquet/csv etc files into the sources doc so that one can reproduce the files?
Thanks for approving @domoritz
I'm going to follow this up with another PR in a next few days with a fully reproducible PreviewInputs declared in
|
Closes #641
Related
altair.datasets
altair#3631 (comment)Description
If I've understood the existing script correctly; then this should produce a datetime/timestamp column for
"date"
.Or more correctly, it won't convert to a format string for
.parquet
- as it supports data types.Question
@dsmedia I haven't run the script, as I'm unsure on:
flights-3m.parquet
has 6 monthsflights-?k.json
have 3 months?Update
In (a908adb) I opted for making the change in place.
Fix script
Example