Sample app - RouterFS + lakeFS

The app will be used to write a Parquet file to two separate servers: S3 bucket and lakeFS server (using S3 gateway).

Running the app

Pre-requisites

Before running the app, make sure you placed the hadoop-router-fs-0.1.0-jar-with-dependencies.jar jar file you built under your $SPARK_HOME/jars directory.
cd sample_app.
Run pip install -r requirements.txt.

Configurations

spark_client.py

Set the lakefs_ and aws_ variables in the code to reflect correct information. Alternatively, set the LAKEFS_ and AWS_ environment variables as specified in the code.
Set (or don't) the repo_name, branch_name and path variables in the code (make sure that if path is set, it ends with a /).
Set (or don't) the replace_prefix variable in the code to reflect the mapped prefix. Alternatively, set the S3A_REPLACE_PREFIX environment variable as specified in the code.

main.py

Set the s3a_replace_prefix variable in the code to reflect the mapped prefix. Make sure this is the same value of replace_prefix under the spark_client file. Alternatively, set the S3A_REPLACE_PREFIX environment variable as specified in the code.
Set the s3_bucket_s3a_prefix variable in the code to reflect the S3 bucket namespace to which the Parquet file will be written. This should be a valid and accessible S3 bucket prefix.

Run

spark-submit --packages "org.apache.hadoop:hadoop-aws:<your.hadoop.version>" main.py

Result

After running the app you should notice that the same Parquet file was written to two different locations (to the lakeFS server and directly to the configured S3 bucket) using a single mapping scheme.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Sample app - RouterFS + lakeFS

Running the app

Pre-requisites

Configurations

spark_client.py

main.py

Run

Result

Files

README.md

Latest commit

History

README.md

File metadata and controls

Sample app - RouterFS + lakeFS

Running the app

Pre-requisites

Configurations

spark_client.py

main.py

Run

Result