You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm copying some ideas from #138 on things to look out for when evaluating how easy-to-use this approach is. Along each of these dimensions, I would score them from "makes me happy" to "makes me sad".
Custom software environments
The GIS stack often requires custom software environments. That is to say, whatever default image the tool uses will not do the job, so we'll need to provide our own image. We'll want to evaluate how easy it is to build our own image and provide it to the orchestration tool.
Compute resources
Do they provide a compute cluster or other batch-like service for running custom jobs? Or will we need to bring our own K8s/ECS/Batch cluster? If the latter, how easy is it to set up?
Integration with AWS/GCP services
If we do have services running in our own cloud account, how easy is it to interact with them? Are there nice user interfaces for securely providing service account credentials?
API access and CI integration
How painful is it to deploy new versions of a pipeline? Are there CI tools or custom GitHub actions for doing this? Ideally it is simple to deploy-on-merge.
dbt Integration
This is one of the most important: all of the major orchestrators have been implementing some level of integration with dbt (which provides its own DAG abstraction). How pleasant are these integrations to use?
Is it easy to trigger a dbt run from the API? In the test, we'll want to kick off dbt after the initial load.
Is it easy to have a task triggered by the completion of a dbt run? In the test, we'll want to send off an email/ad-hoc-report after the dbt run is complete.
Is there visibility into a dbt run while it is going on? Can we see:
The structure of a dbt DAG
The current status of the dbt DAG as it runs
Any error states or warnings from dbt
Logs from the dbt cl
The text was updated successfully, but these errors were encountered:
As part of cagov/data-orchestration#4, I'd like to stand up AWS infastructure for MWAA using terraform.
I'm copying some ideas from #138 on things to look out for when evaluating how easy-to-use this approach is. Along each of these dimensions, I would score them from "makes me happy" to "makes me sad".
Custom software environments
The GIS stack often requires custom software environments. That is to say, whatever default image the tool uses will not do the job, so we'll need to provide our own image. We'll want to evaluate how easy it is to build our own image and provide it to the orchestration tool.
Compute resources
Do they provide a compute cluster or other batch-like service for running custom jobs? Or will we need to bring our own K8s/ECS/Batch cluster? If the latter, how easy is it to set up?
Integration with AWS/GCP services
If we do have services running in our own cloud account, how easy is it to interact with them? Are there nice user interfaces for securely providing service account credentials?
API access and CI integration
How painful is it to deploy new versions of a pipeline? Are there CI tools or custom GitHub actions for doing this? Ideally it is simple to deploy-on-merge.
dbt Integration
This is one of the most important: all of the major orchestrators have been implementing some level of integration with dbt (which provides its own DAG abstraction). How pleasant are these integrations to use?
The text was updated successfully, but these errors were encountered: