Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Improve flights.* dataset reproducibility #645

Merged
merged 29 commits into from
Dec 20, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
ad1b862
feat(DRAFT): Improve `flights.*` dataset reproducibility
dangotbanned Dec 10, 2024
fb3ccc6
build(DRAFT): Generate ISO datetime comparison
dangotbanned Dec 11, 2024
2b1be70
refactor(ruff): Adjust config and fix warnings
dangotbanned Dec 12, 2024
aede7f6
feat(perf): Async requests, use `gzip` instead of `.zip`
dangotbanned Dec 12, 2024
402c2b0
refactor: Reorganize, add `_write_rezip_async` doc
dangotbanned Dec 15, 2024
b57c02f
docs(DRAFT): Improve docs
dangotbanned Dec 15, 2024
efa417c
Merge remote-tracking branch 'upstream/main' into flights-repro
dangotbanned Dec 15, 2024
f646155
refactor: Tidy up, improve doc for `Flights.download_sources`
dangotbanned Dec 16, 2024
bfff9c4
docs: Add/amend some simple docs
dangotbanned Dec 16, 2024
0a19bae
refactor: move `"flights-"` to `Spec._name_prefix`
dangotbanned Dec 16, 2024
4a05b51
docs: fill out more docs
dangotbanned Dec 16, 2024
49205dd
fix: replace `app` with `self`
dangotbanned Dec 16, 2024
3418c5a
refactor: Replace `DateTimeFormat`, `DTF_TO_FMT`, `_transform_temporal`
dangotbanned Dec 16, 2024
0c11ec4
refactor: move global scoped code into `main`
dangotbanned Dec 16, 2024
c668fd7
feat(perf): Store `.parquet` instead of `.csv.gz`
dangotbanned Dec 17, 2024
a002c39
refactor: rename, re-doc `_clean_source` -> `SourceMap.clean`
dangotbanned Dec 17, 2024
2d47fdd
docs: finish `DateRange` doc
dangotbanned Dec 17, 2024
99169b6
docs: add `Flights.scan_sources` doc
dangotbanned Dec 17, 2024
7c5eed4
docs: finish `Flights` doc
dangotbanned Dec 17, 2024
859975e
refactor: reorganize, finish docs for `SourceMap`
dangotbanned Dec 18, 2024
f4bbda8
docs: add module-level doc
dangotbanned Dec 18, 2024
951fe8c
refactor(typing): extend `DateTimeFormat` to include `None`
dangotbanned Dec 18, 2024
a48eb8f
refactor: remove unused `PlScanCsv`
dangotbanned Dec 18, 2024
8481618
Merge branch 'main' into flights-repro
dangotbanned Dec 18, 2024
8ec3adf
feat: improves `Rows` validation
dangotbanned Dec 19, 2024
1962fea
chore: replace `flights.py`
dangotbanned Dec 19, 2024
05707d9
chore: remove `flights.js`
dangotbanned Dec 19, 2024
7c49683
fix: regen with fixed random seed
dangotbanned Dec 19, 2024
cd68193
revert: remove `flights-1k.csv`
dangotbanned Dec 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions _data/flights.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
[[specs]]
# This spec won't be part of the final PR.
# Using it to demonstrate the deviation from [ISO_8601](https://en.wikipedia.org/wiki/ISO_8601)
range = [2001-01-01, 2001-03-31]
n_rows = 1_000
suffix = ".csv"
dt_format = "iso"

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 2_000
suffix = ".json"
dt_format = "%Y/%m/%d %H:%M"

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 5_000
suffix = ".json"
dt_format = "%Y/%m/%d %H:%M"

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 10_000
suffix = ".json"
dt_format = "%Y/%m/%d %H:%M"

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 20_000
suffix = ".json"
dt_format = "%Y/%m/%d %H:%M"

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 200_000
suffix = ".json"
dt_format = "decimal"
columns = ["delay", "distance", "time"]

[[specs]]
start = 2001-01-01
end = 2001-03-31
n_rows = 200_000
suffix = ".arrow"
dt_format = "decimal"
columns = ["delay", "distance", "time"]

[[specs]]
start = 2001-01-01
end = 2001-06-30
n_rows = 3_000_000
suffix = ".parquet"
2 changes: 1 addition & 1 deletion data/flights-10k.json

Large diffs are not rendered by default.

Loading