Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a datetime column in flights-3m.parquet #641

Closed
dangotbanned opened this issue Dec 5, 2024 · 0 comments · Fixed by #642
Closed

Use a datetime column in flights-3m.parquet #641

dangotbanned opened this issue Dec 5, 2024 · 0 comments · Fixed by #642

Comments

@dangotbanned
Copy link
Member

dangotbanned commented Dec 5, 2024

Originally posted by @dangotbanned in #627

In the screenshot from #627 (comment), you can observe a datetime represented in "MMDDHHMM" format.

After reading through #626, I see that this is one of 3 options:

Format options

New Generation Script

New file: flights.py

A comprehensive tool for processing U.S. DOT (BTS) On-Time Flight Performance data with enhanced flexibility and features. Creates date formats used in this repo.

Output Format Options

Format Example Description
MMDDHHMM 06142330 June 14, 23:30
ISO 2023/06/14 23:30 ISO-style datetime
Decimal 23.75 23:45 in decimal form

I've done some work looking into the cost of storing the column as a datetime (not a formatted string)

Conversion

import polars as pl
from altair.datasets import Loader

load = Loader.from_backend("polars")
flights: pl.LazyFrame = load("flights-3m", tag="v2.11.0").lazy()

# NOTE: Assuming year is `2001`
date_conv: pl.Expr = (
    pl.concat_str(pl.lit("01"), "date").str.to_datetime("%y%m%d%H%M").alias("date")
)
delay_conv: pl.Expr = pl.duration(minutes=pl.col("delay")).alias("delay")

flights_date: pl.DataFrame = flights.with_columns(date_conv).collect()
flights_date_delay: pl.DataFrame = flights.with_columns(date_conv, delay_conv).collect()

Size comparison

Open the Chart in the Vega Editor

Image

python library support

polars

shape: (3_000_000, 5)
┌─────────────────────┬──────────────┬──────────┬────────┬─────────────┐
│ date                ┆ delay        ┆ distance ┆ origin ┆ destination │
│ ---                 ┆ ---          ┆ ---      ┆ ---    ┆ ---         │
│ datetime[μs]        ┆ duration[μs] ┆ i64      ┆ str    ┆ str         │
╞═════════════════════╪══════════════╪══════════╪════════╪═════════════╡
│ 2001-01-01 00:01:00 ┆ -13m         ┆ 2345     ┆ ANC    ┆ LAX         │
│ 2001-01-01 00:03:00 ┆ -20m         ┆ 1946     ┆ LAX    ┆ ATL         │
│ 2001-01-01 00:10:00 ┆ 0µs          ┆ 1671     ┆ PHX    ┆ DTW         │
│ 2001-01-01 00:20:00 ┆ 10m          ┆ 1709     ┆ SEA    ┆ STL         │
│ 2001-01-01 00:22:00 ┆ -13m         ┆ 1736     ┆ SFO    ┆ STL         │
│ …                   ┆ …            ┆ …        ┆ …      ┆ …           │
│ 2001-06-30 23:58:00 ┆ 30m          ┆ 215      ┆ ATL    ┆ SAV         │
│ 2001-06-30 23:59:00 ┆ 1h 27m       ┆ 496      ┆ PIT    ┆ BOS         │
│ 2001-06-30 23:59:00 ┆ 9m           ┆ 151      ┆ ATL    ┆ HSV         │
│ 2001-06-30 23:59:00 ┆ 1h 23m       ┆ 641      ┆ DFW    ┆ DEN         │
│ 2001-06-30 23:59:00 ┆ 16m          ┆ 594      ┆ ATL    ┆ DTW         │
└─────────────────────┴──────────────┴──────────┴────────┴─────────────┘

pandas

                       date             delay  distance origin destination
0       2001-01-01 00:01:00 -1 days +23:47:00      2345    ANC         LAX
1       2001-01-01 00:03:00 -1 days +23:40:00      1946    LAX         ATL
2       2001-01-01 00:10:00   0 days 00:00:00      1671    PHX         DTW
3       2001-01-01 00:20:00   0 days 00:10:00      1709    SEA         STL
4       2001-01-01 00:22:00 -1 days +23:47:00      1736    SFO         STL
...                     ...               ...       ...    ...         ...
2999995 2001-06-30 23:58:00   0 days 00:30:00       215    ATL         SAV
2999996 2001-06-30 23:59:00   0 days 01:27:00       496    PIT         BOS
2999997 2001-06-30 23:59:00   0 days 00:09:00       151    ATL         HSV
2999998 2001-06-30 23:59:00   0 days 01:23:00       641    DFW         DEN
2999999 2001-06-30 23:59:00   0 days 00:16:00       594    ATL         DTW

[3000000 rows x 5 columns]

pyarrow

pyarrow.Table
date: timestamp[us]
delay: duration[us]
distance: int64
origin: large_string
destination: large_string
----
date: [[2001-01-01 00:01:00.000000,2001-01-01 00:03:00.000000,2001-01-01 00:10:00.000000,2001-01-01 00:20:00.000000,2001-01-01 00:22:00.000000,...,2001-01-09 06:22:00.000000,2001-01-09 06:22:00.000000,2001-01-09 06:22:00.000000,2001-01-09 06:22:00.000000,2001-01-09 06:22:00.000000],[2001-01-09 06:22:00.000000,2001-01-09 06:22:00.000000,2001-01-09 06:22:00.000000,2001-01-09 06:22:00.000000,2001-01-09 06:22:00.000000,...,2001-01-16 22:49:00.000000,2001-01-16 22:49:00.000000,2001-01-16 22:49:00.000000,2001-01-16 22:49:00.000000,2001-01-16 22:49:00.000000],...,[2001-06-22 15:10:00.000000,2001-06-22 15:10:00.000000,2001-06-22 15:10:00.000000,2001-06-22 15:10:00.000000,2001-06-22 15:10:00.000000,...,2001-06-30 10:47:00.000000,2001-06-30 10:47:00.000000,2001-06-30 10:47:00.000000,2001-06-30 10:47:00.000000,2001-06-30 10:47:00.000000],[2001-06-30 10:47:00.000000,2001-06-30 10:47:00.000000,2001-06-30 10:47:00.000000,2001-06-30 10:47:00.000000,2001-06-30 10:47:00.000000,...,2001-06-30 23:58:00.000000,2001-06-30 23:59:00.000000,2001-06-30 23:59:00.000000,2001-06-30 23:59:00.000000,2001-06-30 23:59:00.000000]]
delay: [[-780000000,-1200000000,0,600000000,-780000000,...,-240000000,480000000,1380000000,720000000,-420000000],[-180000000,1320000000,720000000,600000000,480000000,...,-240000000,-1020000000,-2700000000,-540000000,-1080000000],...,[2220000000,960000000,360000000,240000000,-240000000,...,240000000,-360000000,360000000,360000000,360000000],[-600000000,-300000000,-660000000,-660000000,-360000000,...,1800000000,5220000000,540000000,4980000000,960000000]]
distance: [[2345,1946,1671,1709,1736,...,469,621,581,1679,324],[177,383,813,432,594,...,116,1865,1750,1750,1900],...,[1557,354,317,248,104,...,594,1024,1121,2329,867],[834,813,810,1972,200,...,215,496,151,641,594]]
origin: [["ANC","LAX","PHX","SEA","SFO",...,"IAH","ORD","ATL","SMF","GPT"],["TYS","BTR","PHL","DFW","ORD",...,"CVG","SAN","DTW","LAS","LAX"],...,["PHL","DFW","MKE","SAT","SHV",...,"ATL","EWR","BOS","LAX","ORD"],["MCO","STL","CVG","MIA","BNA",...,"ATL","PIT","ATL","DFW","ATL"]]
destination: [["LAX","ATL","DTW","STL","STL",...,"MEM","BWI","FLL","STL","MEM"],["CLT","DFW","STL","MEM","MDT",...,"CMH","CVG","LAS","DTW","CVG"],...,["DEN","CRP","STL","DAL","MLU",...,"DTW","PBI","MCO","BWI","BOS"],["PIT","PHL","PWM","PHX","MEM",...,"SAV","BOS","HSV","DEN","DTW"]]

Admittedly, we'd still need add (if possible) support for duration/timedelta on the altair-side:

However, the date column is supported, fairly cheap and consistently parsed.

That last point is something we've encountered as an issue with other datasets (see vega/altair#3631 (comment))

dangotbanned added a commit that referenced this issue Dec 6, 2024
I've been unable to answer questions I had on recreating this dataset
(#642 (comment))

The *safest* options seems to be just updating the output in-place

#641
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant