This repository hosts code, documentation, and configuration for ETL/ELT orchestration. It us currently structured as a Google Composer (managed Airflow) project, though that is subject to change.
We intend to do most data transformation in our data warehouses (i.e. ELT rather than ETL). There is still need to do some custom loading, however! This project is intended to do relatively simple loads into data warehouse tables, and complex DAGs should be regarded with suspicion.
Most DAGs should have a single node.
Basic commands are driven from a justfile
.
Project variables are loaded from a .env
file. To start, create your .env
file:
cp .env.sample
and populate the variables therein.
Create a local dev environment (which uses composer-dev
and docker
under the hood):
just create-local-env
Start your dev environment:
just start-local-env
Then open a web browser and navigate to localhost:8081
to view the Airflow UI.
You can view DAGs and their history from the UI, as well as trigger new test runs.
You can also run Airflow commands from the command line.
A couple of common ones are in the justfile
:
just list-local-dags # list the DAGs that Airflow sees.
just trigger-local $DAG # trigger a specific DAG for testing.
DAGs which use a KubernetesPodOperator
are more difficult to test as it requires
you to have a local kubernetes setup.
Any easier approach is to use this guide, which copies your test DAGs to a test directory in the GCS bucket and runs it in the real cluster. This should be done with care as you could interfere with the production environment.
A workflow for testing a local kubernetes-based dag is:
- Publish a new docker image. By default it will publish an image to the Google Artifact Registry with a
dev
tag (this can be customized using thePUBLISH_IMAGE_TAG
environment variable).just publish
- Set the
DEFAULT_IMAGE_TAG
environment variable to your new tag (dev
by default). - Restart your local environment to pick up the new development image:
just restart-local-env
- Trigger the task you want to run:
just test-task <dag-id> <task-id>
This project deploys on merge to main
using the
deploy
workflow.
If you need to make a manual deployment, here is a basic workflow:
- Publish a new docker image with a
prod
tag:PUBLISH_IMAGE_TAG=prod just publish
- Update the Airflow environment (syncs dags and
requirements.txt
to the environment):just deploy