Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP composer proof-of-concept #5

Closed
ian-r-rose opened this issue Dec 9, 2022 · 3 comments
Closed

GCP composer proof-of-concept #5

ian-r-rose opened this issue Dec 9, 2022 · 3 comments

Comments

@ian-r-rose
Copy link
Member

One option for orchestration is GCP's managed airflow, Composer (AWS has one as well). As part of #4, we can stand up an instance and kick the tires on it. I've used Airflow before, but never the managed service. Though keeping the airflow server up and running was really annoying, so a managed service is attractive!

As a testing ground, I'd like to investigate:

  1. Dev workflow with GitHub
  2. Best practices with developing in a test environment
  3. Environment isolation (e.g., using the virtual env operator, or kubernetes pod operator)

A good first target for data artifacts could be loading some geospatial reference data into our data warehouse(s). The benefits team is particularly interested in healthy places index and CalEnviroscreen

@ian-r-rose
Copy link
Member Author

Test project here: https://github.com/cagov/data-orchestration.

It currently has two DAGs which load two datasets from the California Geo portal:

  1. California incorporated cities
  2. California counties

Some early thoughts:

  • There is a brand new CLI tool from Google which makes local development somewhat tolerable. It basically manages a docker flow using their official images, and stands up local airflow deployments based upon them.
  • I'm currently just installing custom python packages (specifically geopandas and GDAL/geos stuff) into the main environment. This works right now, but is definitely not scalable. Eventually we'd probably need a kubernetes pod operator.
  • This is (so far) way easier than managing an airflow deployment manually.

@ian-r-rose
Copy link
Member Author

Closing this as complete. There is certainly follow-up work that could be done, though scoped as individual issues, including:

  1. CI/CD
  2. A better testing story
  3. Email + Slack notifications

@ian-r-rose
Copy link
Member Author

I should also note that this is not free: a minimal version is ~$150 per month, and something that we transition to being actual infrastructure would be a few times that. I expect most of the options (#4) would have similar stories, though I haven't priced them out.

ian-r-rose added a commit that referenced this issue Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant