-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cron-like scheduling into the ETL for regular updates #3016
Comments
Note from Mojmir: This is currently managed from Buildkite, with some bash scripts. |
@Marigold is testing some tooling for ETL (Perfect), so that we can orchestrate some processes. If it works, it might be easier to use it for automated updates. Consequently, roadmap and progress on this issue might change soon. |
Thanks for writing this up! You're right that the current way of running bash scripts through Buildkite is becoming a bottleneck. For instance updating wildfires would have to be run at different time, because it seems that their endpoint is timing out early in the morning (weird, right?). Rather than reinventing the wheel, we could try Prefect for scheduling tasks (that would be likely run by Buildkite). Using |
From triage discussion:
A maximal versionPut these into the ETL as a special step type, and put the scripts in a standardised place.
We would a protocol for these steps, which basically lets you ask if it's out of date and lets you run it, with a place for code for those steps. Maybe this would mess with our URI scheme though, we'd have to check. A minimal versionCron, basically, for ad-hoc scripts inside the ETL repo. (Buildkite / Prefect / etc.) Open questions
|
Put this on the agenda for next Monday's data architecture chat. |
Maybe we should discuss this during the offsite, in person (à la Chart Diff session). |
Could be solved as a side-effect of #3339 if we did that project. Closing this one for now, but feel free to re-open if it becomes urgent again. |
ETL was conceived to maintain and publish >yearly updated datasets. This covers most of our work, but in some instances we need to update datasets more frequently.
Currently, such frequently-updated datasets use custom workarounds. Instead, we should have a stable and general solution for all of them.
This will become more relevant as we migrate most of our COVID-19 work to ETL:
Why
Technical notes
flunet
,covid
,wildfires
)latest
version, so when theirlatest
snapshot gets updated, it triggers a cascade of updates all the way to grapher and the siteExamples (todo)
Look for current examples of regular-updates, and describe how we are currently tackle them:
Proposed solution
latest
for data steps (meadow
,garden
andgrapher
). Could also just uselatest
for Snapshot, too.scheduler.yml
, which defines all snapshots that need regular updates and how to update these..dvc
and.py
).dvc
with up-to-datedate_accessed
anddate_published
. Some of the fields indvc
could be filled programatically using custom-defined code-snippets. E.g. "scraping of provider's site to get the publication date", etc.latest
and overwrite existing version?All in all, we could have a
scheduler.yml
file likewhere
hour
: Hour to run the scheduled snapshot-update. DEFAULT to e.g. 6 AM UTC.minute
: If need more granularity (e.g. update at 6:30 AM UTC), can use this field. DEFAULT to 0.day_week
: DEFAULT to 1 (i.e. Monday).ignore
: The old version of the snapshot should be replaced in all downstream dependencies, except for those listed under this field. In the example above, and update tosnapshot://excess_mortality/*/excess_mortality.csv
will not update the dependencies ofdata://some/2024-01-01/random_step
, which will continue to use an old version.Open questions
(please add here)
latest
? Should we version each new snapshot?scheduler.yml
happen?The text was updated successfully, but these errors were encountered: