Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cron-like scheduling into the ETL for regular updates #3016

Closed
lucasrodes opened this issue Jul 25, 2024 · 7 comments
Closed

Add cron-like scheduling into the ETL for regular updates #3016

lucasrodes opened this issue Jul 25, 2024 · 7 comments

Comments

@lucasrodes
Copy link
Member

lucasrodes commented Jul 25, 2024

ETL was conceived to maintain and publish >yearly updated datasets. This covers most of our work, but in some instances we need to update datasets more frequently.

Currently, such frequently-updated datasets use custom workarounds. Instead, we should have a stable and general solution for all of them.

This will become more relevant as we migrate most of our COVID-19 work to ETL:

Why

  • Clear and easy-to-use mechanism to set regular updates
  • Closer to ETL codebase, better integration
  • Document it and make it part of ETL docs.

Technical notes

  • Today, we have a nightly pipeline (Automatic dataset updates) that fetches data for a few regularly updating datasets (e.g. flunet, covid, wildfires)
    • All these dataset use latest version, so when their latest snapshot gets updated, it triggers a cascade of updates all the way to grapher and the site
  • Historically, for covid scraping, we had a very frequent cron job that triggered a script, which in turn scheduled different sub-jobs for different times of the day

Examples (todo)

Look for current examples of regular-updates, and describe how we are currently tackle them:

  • Excess Mortality
  • Flunet

Proposed solution

  • (unclear) Only version snapshots, and use version latest for data steps (meadow, garden and grapher). Could also just use latest for Snapshot, too.
  • Have a scheduler file scheduler.yml, which defines all snapshots that need regular updates and how to update these.
  • The scheduler would be loaded and read every hour or so and decide if anything needs to be executed.
  • Execution would mean:
    1. Obtain the new snapshot and compare it with latest snapshot
    2. If same, stop execution. If different, create a new snapshot version (.dvc and .py)
      • Edit the .dvc with up-to-date date_accessed and date_published. Some of the fields in dvc could be filled programatically using custom-defined code-snippets. E.g. "scraping of provider's site to get the publication date", etc.
      • Run snapshot & update data in S3
      • Question: Should the new snapshot have different version (date) or use latest and overwrite existing version?
    3. Update DAG (if applicable). If Snapshot version has changed (date), we need to update downstream dependencies. Optionally, we could list the downstream dependencies we don't want to update in the scheduler YAML file.
    4. Run down-stream dependencies according to DAG.

All in all, we could have a scheduler.yml file like

snapshot://who/*/flunet.csv:
  day_week: 5

snapshot://excess_mortality/*/excess_mortality.csv:
  day_week: 5
  ignore:
    - data://some/2024-01-01/random_step

where

  • hour: Hour to run the scheduled snapshot-update. DEFAULT to e.g. 6 AM UTC.
  • minute: If need more granularity (e.g. update at 6:30 AM UTC), can use this field. DEFAULT to 0.
  • day_week: DEFAULT to 1 (i.e. Monday).
  • ignore: The old version of the snapshot should be replaced in all downstream dependencies, except for those listed under this field. In the example above, and update to snapshot://excess_mortality/*/excess_mortality.csv will not update the dependencies of data://some/2024-01-01/random_step, which will continue to use an old version.

Open questions

(please add here)

  • Snapshot versioning: Do we want to allow snapshots to be versioned with latest? Should we version each new snapshot?
  • Scheduler execution: Where should the reading and executing of scheduler.yml happen?
@lucasrodes
Copy link
Member Author

lucasrodes commented Jul 25, 2024

Note from Mojmir: This is currently managed from Buildkite, with some bash scripts.

@lucasrodes
Copy link
Member Author

lucasrodes commented Jul 25, 2024

@Marigold is testing some tooling for ETL (Perfect), so that we can orchestrate some processes. If it works, it might be easier to use it for automated updates.

Consequently, roadmap and progress on this issue might change soon.

@Marigold
Copy link
Collaborator

Thanks for writing this up! You're right that the current way of running bash scripts through Buildkite is becoming a bottleneck. For instance updating wildfires would have to be run at different time, because it seems that their endpoint is timing out early in the morning (weird, right?).

Rather than reinventing the wheel, we could try Prefect for scheduling tasks (that would be likely run by Buildkite).

Using latest for frequently updated datasets worked great for us so far. We've never needed older step versions.

@larsyencken
Copy link
Collaborator

larsyencken commented Aug 1, 2024

From triage discussion:

  • In principle we don't need to run every hour, it's enough to check once a day
  • We have two things we care about:
    • The ability to check if new data is available
    • The ability to create a snapshot from that data (e.g. overwriting latest)
    • We should use date_accessed on snapshots as a source of truth
  • To do this for a bunch of data providers, we would need custom code for each provider to check for an update

A maximal version

Put these into the ETL as a special step type, and put the scripts in a standardised place.

snapshot://a/b/VERSION/c:
  - upstream://a/b/c

We would a protocol for these steps, which basically lets you ask if it's out of date and lets you run it, with a place for code for those steps. Maybe this would mess with our URI scheme though, we'd have to check.

A minimal version

Cron, basically, for ad-hoc scripts inside the ETL repo. (Buildkite / Prefect / etc.)

Open questions

  • Could we by default enrol every snapshot-generating script to be re-run, e.g. daily or weekly?

@larsyencken
Copy link
Collaborator

Put this on the agenda for next Monday's data architecture chat.

@pabloarosado
Copy link
Contributor

Maybe we should discuss this during the offsite, in person (à la Chart Diff session).

@larsyencken larsyencken changed the title Automate regular updates Add cron-like scheduling into the ETL for regular updates Oct 24, 2024
@larsyencken
Copy link
Collaborator

Could be solved as a side-effect of #3339 if we did that project.

Closing this one for now, but feel free to re-open if it becomes urgent again.

@larsyencken larsyencken closed this as not planned Won't fix, can't repro, duplicate, stale Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants