Add cron-like scheduling into the ETL for regular updates #3016

lucasrodes · 2024-07-25T10:55:38Z

ETL was conceived to maintain and publish >yearly updated datasets. This covers most of our work, but in some instances we need to update datasets more frequently.

Currently, such frequently-updated datasets use custom workarounds. Instead, we should have a stable and general solution for all of them.

This will become more relevant as we migrate most of our COVID-19 work to ETL:

https://github.com/owid/owid-issues/issues/1619

Why

Clear and easy-to-use mechanism to set regular updates
Closer to ETL codebase, better integration
Document it and make it part of ETL docs.

Technical notes

Today, we have a nightly pipeline (Automatic dataset updates) that fetches data for a few regularly updating datasets (e.g. flunet, covid, wildfires)
- All these dataset use latest version, so when their latest snapshot gets updated, it triggers a cascade of updates all the way to grapher and the site
Historically, for covid scraping, we had a very frequent cron job that triggered a script, which in turn scheduled different sub-jobs for different times of the day

Examples (todo)

Look for current examples of regular-updates, and describe how we are currently tackle them:

Excess Mortality
Flunet

Proposed solution

(unclear) Only version snapshots, and use version latest for data steps (meadow, garden and grapher). Could also just use latest for Snapshot, too.
Have a scheduler file scheduler.yml, which defines all snapshots that need regular updates and how to update these.
The scheduler would be loaded and read every hour or so and decide if anything needs to be executed.
Execution would mean:
1. Obtain the new snapshot and compare it with latest snapshot
2. If same, stop execution. If different, create a new snapshot version (.dvc and .py)
  - Edit the .dvc with up-to-date date_accessed and date_published. Some of the fields in dvc could be filled programatically using custom-defined code-snippets. E.g. "scraping of provider's site to get the publication date", etc.
  - Run snapshot & update data in S3
  - Question: Should the new snapshot have different version (date) or use latest and overwrite existing version?
3. Update DAG (if applicable). If Snapshot version has changed (date), we need to update downstream dependencies. Optionally, we could list the downstream dependencies we don't want to update in the scheduler YAML file.
4. Run down-stream dependencies according to DAG.

All in all, we could have a scheduler.yml file like

snapshot://who/*/flunet.csv:
  day_week: 5

snapshot://excess_mortality/*/excess_mortality.csv:
  day_week: 5
  ignore:
    - data://some/2024-01-01/random_step

where

hour: Hour to run the scheduled snapshot-update. DEFAULT to e.g. 6 AM UTC.
minute: If need more granularity (e.g. update at 6:30 AM UTC), can use this field. DEFAULT to 0.
day_week: DEFAULT to 1 (i.e. Monday).
ignore: The old version of the snapshot should be replaced in all downstream dependencies, except for those listed under this field. In the example above, and update to snapshot://excess_mortality/*/excess_mortality.csv will not update the dependencies of data://some/2024-01-01/random_step, which will continue to use an old version.

Open questions

(please add here)

Snapshot versioning: Do we want to allow snapshots to be versioned with latest? Should we version each new snapshot?
Scheduler execution: Where should the reading and executing of scheduler.yml happen?

The text was updated successfully, but these errors were encountered:

lucasrodes · 2024-07-25T11:34:46Z

Note from Mojmir: This is currently managed from Buildkite, with some bash scripts.

lucasrodes · 2024-07-25T13:29:31Z

@Marigold is testing some tooling for ETL (Perfect), so that we can orchestrate some processes. If it works, it might be easier to use it for automated updates.

Consequently, roadmap and progress on this issue might change soon.

Marigold · 2024-07-25T13:47:50Z

Thanks for writing this up! You're right that the current way of running bash scripts through Buildkite is becoming a bottleneck. For instance updating wildfires would have to be run at different time, because it seems that their endpoint is timing out early in the morning (weird, right?).

Rather than reinventing the wheel, we could try Prefect for scheduling tasks (that would be likely run by Buildkite).

Using latest for frequently updated datasets worked great for us so far. We've never needed older step versions.

larsyencken · 2024-08-01T09:35:25Z

From triage discussion:

In principle we don't need to run every hour, it's enough to check once a day
We have two things we care about:
- The ability to check if new data is available
- The ability to create a snapshot from that data (e.g. overwriting latest)
- We should use date_accessed on snapshots as a source of truth
To do this for a bunch of data providers, we would need custom code for each provider to check for an update

A maximal version

Put these into the ETL as a special step type, and put the scripts in a standardised place.

snapshot://a/b/VERSION/c:
  - upstream://a/b/c

We would a protocol for these steps, which basically lets you ask if it's out of date and lets you run it, with a place for code for those steps. Maybe this would mess with our URI scheme though, we'd have to check.

A minimal version

Cron, basically, for ad-hoc scripts inside the ETL repo. (Buildkite / Prefect / etc.)

Open questions

Could we by default enrol every snapshot-generating script to be re-run, e.g. daily or weekly?

larsyencken · 2024-08-08T09:19:31Z

Put this on the agenda for next Monday's data architecture chat.

pabloarosado · 2024-09-05T09:25:58Z

Maybe we should discuss this during the offsite, in person (à la Chart Diff session).

larsyencken · 2024-10-24T09:16:32Z

Could be solved as a side-effect of #3339 if we did that project.

Closing this one for now, but feel free to re-open if it becomes urgent again.

github-actions bot added the needs triage label Jul 25, 2024

larsyencken added needs discussion needs triage and removed needs triage labels Aug 1, 2024

larsyencken mentioned this issue Aug 1, 2024

Migrate monkeypox from ops to ETL #3034

Closed

pabloarosado added priority 3 - nice to have and removed needs triage labels Sep 5, 2024

pabloarosado mentioned this issue Sep 30, 2024

Periodically check for data updates and create automatic snapshots #3339

Open

larsyencken changed the title ~~Automate regular updates~~ Add cron-like scheduling into the ETL for regular updates Oct 24, 2024

larsyencken closed this as not planned Won't fix, can't repro, duplicate, stale Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cron-like scheduling into the ETL for regular updates #3016

Add cron-like scheduling into the ETL for regular updates #3016

lucasrodes commented Jul 25, 2024 •

edited by larsyencken

Loading

lucasrodes commented Jul 25, 2024 •

edited

Loading

lucasrodes commented Jul 25, 2024 •

edited

Loading

Marigold commented Jul 25, 2024

larsyencken commented Aug 1, 2024 •

edited

Loading

larsyencken commented Aug 8, 2024

pabloarosado commented Sep 5, 2024

larsyencken commented Oct 24, 2024

Add cron-like scheduling into the ETL for regular updates #3016

Add cron-like scheduling into the ETL for regular updates #3016

Comments

lucasrodes commented Jul 25, 2024 • edited by larsyencken Loading

Why

Technical notes

Examples (todo)

Proposed solution

Open questions

lucasrodes commented Jul 25, 2024 • edited Loading

lucasrodes commented Jul 25, 2024 • edited Loading

Marigold commented Jul 25, 2024

larsyencken commented Aug 1, 2024 • edited Loading

A maximal version

A minimal version

Open questions

larsyencken commented Aug 8, 2024

pabloarosado commented Sep 5, 2024

larsyencken commented Oct 24, 2024

lucasrodes commented Jul 25, 2024 •

edited by larsyencken

Loading

lucasrodes commented Jul 25, 2024 •

edited

Loading

lucasrodes commented Jul 25, 2024 •

edited

Loading

larsyencken commented Aug 1, 2024 •

edited

Loading