LakeFS with Apache DataFusion using Python
This guide assumes that you have already installed the following:
- Docker
- Docker Compose
- Python (v3.12.3 used for development)
uv
(install guide)
Define a virtual environment and install requirements:
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
The LakeFS service is accessible by using Docker containers. Start this service by running the following command in the terminal:
docker compose up --build
Confirm that LakeFS is running by opening a browser on the following website: http://localhost:8000.
The dataset used in this example is the NYC Taxi Trip dataset. Download the .parquet
file from here: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet
Once downloaded, go to the demo
repository and upload the dataset to the following path, taxi_data/input/
:
All that remains is to run
the application with:
python app.py
Once it has completed running, you will be able to see an output in the terminal window. You will also be able to find a set of files written to the following path: http://localhost:8000/repositories/demo/objects?ref=main&path=taxi_data%2Foutput%2F.