This sample will give you an idea of how you can use the lakeFS Airflow provider to:
- Version control your raw, intermediate and processed data.
- Link between code versions and the data generated by running them.
-
Clone the sample:
git clone https://github.com/treeverse/lakeFS-samples cd lakeFS-samples git submodule init && git submodule update
-
Spin up the environment:
docker-compose up
-
Browse to Airflow in http://localhost:8080/.
- User:
airflow
- Password:
airflow
- User:
-
Run the etl DAG in Airflow
-
Observe the results in lakeFS. Login to the lakeFS UI in http://localhost:8080/
- User:
AKIAIOSFOLKFSSAMPLES
- Password:
wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
- User:
-
In the
airflow-example
repository'smain
branch you'll see the raw data alongside the transformed results -
Drill down on any path to view the CSV file and get an understanding of the transform process
-
Click on the Commits tab to see the commit history for the branch. Change the branch from the dropdown menu to see the history for each branch.
-
From the Branches tab note that each transform (by event type / month / user) was isolated on its own branch and only merged back into
main
once it was completed sucessfully.
Here's the DAG that's used:
The Docker Compose in this folder extends the one in the root of this repository to add the necessary containers for Airflow: