Skip to content

Latest commit

 

History

History

airflow-02

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

lakeFS + Github + Airflow - example

This sample will give you an idea of how you can use the lakeFS Airflow provider to:

  • Version control your raw, intermediate and processed data.
  • Link between code versions and the data generated by running them.

Run Instructions

  1. Clone the sample:

    git clone https://github.com/treeverse/lakeFS-samples
    cd lakeFS-samples
    git submodule init && git submodule update
    
    
  2. Spin up the environment: docker-compose up

  3. Browse to Airflow in http://localhost:8080/.

    • User: airflow
    • Password: airflow
  4. Run the etl DAG in Airflow

  5. Observe the results in lakeFS. Login to the lakeFS UI in http://localhost:8080/

    • User: AKIAIOSFOLKFSSAMPLES
    • Password: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
  6. In the airflow-example repository's main branch you'll see the raw data alongside the transformed results

  7. Drill down on any path to view the CSV file and get an understanding of the transform process

  8. Click on the Commits tab to see the commit history for the branch. Change the branch from the dropdown menu to see the history for each branch.

  9. From the Branches tab note that each transform (by event type / month / user) was isolated on its own branch and only merged back into main once it was completed sucessfully.

Here's the DAG that's used:

Containers

The Docker Compose in this folder extends the one in the root of this repository to add the necessary containers for Airflow: