Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate orchestration options #4

Closed
ian-r-rose opened this issue Dec 8, 2022 · 8 comments
Closed

Investigate orchestration options #4

ian-r-rose opened this issue Dec 8, 2022 · 8 comments

Comments

@ian-r-rose
Copy link
Member

ian-r-rose commented Dec 8, 2022

We intend to do most transformation in our data warehouse(s) via dbt, but there is still need for scheduled loads of custom data. So while a full DAG framework might be overkill at this point, some sort of workflow orchestration tool is worthwhile.

Some requirements:

  1. Execute arbitrary Python, bash scripts
  2. User-interface for viewing scheduled runs, failures, metadata
  3. Notifications upon failures
  4. Retries
  5. encrypted secrets

Nice-to-have:

  1. Execute R scripts
  2. Python virtualenv isolation
  3. Managed option (I don't want to personally keep some servers up)
@ian-r-rose
Copy link
Member Author

Orchestration options

This a fairly basic comparison of a bunch of the available options for workflow orchestration. There are also some more closed-off proprietary solutions from AWS, GCP, etc (e.g., Glue) that I haven't really evaluated. They often try to be low-code and serverless. I tend to be a bit more skittish around single cloud options.

A high-level note: Our general approach right now is to prefer an ELT-style workflow over an ETL-style one. So while these orchestration tools allow for quite complex DAGs, I would tend to treat them as more simple hosted services for running small loading scripts on a schedule. This means that some of the neat, more advanced features of these things would be less relevant to our (initial) deployments.

Airflow

Airflow is the oldest and most popular orchestration tool that is still widely used today.

Pros

  • Extremely widely-used
  • Lots of resources available online
  • Lots of third-party extensions to connect with other systems
  • Low-risk

Cons

  • Somewhat old-fashioned Python style
  • Difficult to deploy (hence managed offerings)
  • Confusing UI
  • Confusing design decisions around timestamps, execution model

There are at least three managed offerings of Airflow available from major vendors:

GCP Cloud Composer

  • Allows autoscaling
  • Fairly simple integration of GKE pod tasks
  • Easy to stand up if you are already on GCP
  • Decent local dev tooling

AWS MWAA

  • Easy to stand up if you are already on AWS
  • Decent local dev tooling
  • Kubernetes and ECS operators look available, but require additional infrastructure.

Astronomer

  • What appears to be pretty good local dev tooling
  • Easy kubernetes pod integration

Prefect Cloud

  • Available as an open source project, they sell the hosted "cloud" option (similar to dbt, others)
  • More modern, pythonic idioms. Would likely result in higher-quality code.
  • Less mature ecosystem
  • Higher risk

Dagster Cloud

  • Available as an open source project, they sell the hosted "cloud" option (similar to dbt, others)
  • More modern, pythonic idioms. Would likely result in higher-quality code.
  • Neat declarative approach to defining data assets. Probably results in more robust pipelines!
  • Less mature ecosystem
  • Higher risk

Feature Comparisons

Option Managed Environment isolation Python bash R Secrets Manager Notifications Web UI Local Dev Tooling dbt integration
GCP Composer (Airflow) Yes GKE pods or python virtualenvs Yes Yes Sort of Yes Hand-rolled Yes Yes Yes
AWS MWAA (Airflow) Yes ECS, EKS, or virtualenv Yes Yes Sort of Yes Hand-rolled Yes Yes Yes
Astronomer (Airflow) Yes KubernetesOpdOperator, virtualenvs Yes Yes Sort of Yes Hand-rolled Yes Yes Yes
Self-managed Airflow No KubernetesPodOperator, virtualenv Yes Yes Sort of Yes Hand-rolled Yes Hand-rolled Yes
Prefect Cloud Yes Kubernetes pods and Docker containers Yes Yes Coming soon? Yes Yes Yes Yes Yes
Dagster Cloud Yes Kubernetes pods, ECS, Docker containers Yes Yes No Yes Yes Yes Yes Yes

Would be particularly interested in hearing @jasonlally's thoughts about the above.

@jasonlally
Copy link
Contributor

@ian-r-rose - this is great! Thanks for putting this together.

Some thoughts:

  • Definitely yes on managed service
  • I like that dagster and prefect don't need hand rolled notifications
  • Maybe worth a team discussion on the risks around dagster and prefect...I like the idea of more modern, pythonic code, so maybe there are mitigations we can discuss
  • Regardless, maybe it's worth a bakeoff (looks like Dagster has 30 day trial, Prefect has a personal license with limits, and Astronomer I think has a trial, and you already have GCP composer set up)

What do you think of doing an intentional test drive on a simple workflow? If we do that, let's hash out a test plan before starting any assessment.

@jasonlally
Copy link
Contributor

As a side note, I found Astronomer to be really great last I used it. They really focus on developer experience and have some nice useful cli tools that wrap around airflow and make managing deployments easier.

@ian-r-rose
Copy link
Member Author

I like that dagster and prefect don't need hand rolled notifications

I don't want to stress it too much, since I think setting up AWS SES or sendgrid for an airflow deployment isn't too much work. That said, I was a bit surprised to read that Astronomer doesn't do this for you, as it seems like a fairly simple value-add. But perhaps I'm not understanding their docs correctly.

As a side note, I found Astronomer to be really great last I used it. They really focus on developer experience and have some nice useful cli tools that wrap around airflow and make managing deployments easier.

That's good to hear. I previously self-managed an airflow deployment, and it was a significant amount of work. The developer experience around these managed offerings has improved a lot.

What do you think of doing an intentional test drive on a simple workflow? If we do that, let's hash out a test plan before starting any assessment.

Sure, I think that would be instructive. One idea for a test plan could be to load the Microsoft building footprints dataset, as there are a few things that make it a moderately challenging job which might flush out issues:

  1. It likely involves a custom software environment (i.e., something with the GDAL stack)
  2. It's on the larger size (i.e., may require provisioning larger instances, possibly even horizontal scalability)
  3. It changes somewhat regularly
  4. We may want several destinations (BQ, snowflake, parquet in S3)
  5. We know that (at least) DOF is interested in this dataset

@jasonlally
Copy link
Contributor

jasonlally commented Jan 4, 2024

This is relevant to @melanie-logan working on data loading options.

@ian-r-rose should we close this since Melanie is working on evaluation now? Or I guess not since we still want to do more eval on orchestration. We can keep open, but makes sense to me to reassign.

@ian-r-rose
Copy link
Member Author

Makes sense to me!

@melanie-logan
Copy link

Yes, I can do a separate orchestration Eval. Thanks!

@ram-kishore-odi
Copy link
Contributor

Closing this as complete for now. We may revisit orchestration options at some point, but would probably start with a new set of tasks that are relevant. Please refer to this ticket for additional information - #378

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants