The app will be used to write a Parquet file to two separate servers: S3 bucket and lakeFS server (using S3 gateway).
- Before running the app, make sure you placed the
hadoop-router-fs-0.1.0-jar-with-dependencies.jar
jar file you built under your$SPARK_HOME/jars
directory. cd sample_app
.- Run
pip install -r requirements.txt
.
- Set the
lakefs_
andaws_
variables in the code to reflect correct information. Alternatively, set theLAKEFS_
andAWS_
environment variables as specified in the code. - Set (or don't) the
repo_name
,branch_name
andpath
variables in the code (make sure that ifpath
is set, it ends with a/
). - Set (or don't) the
replace_prefix
variable in the code to reflect the mapped prefix. Alternatively, set theS3A_REPLACE_PREFIX
environment variable as specified in the code.
- Set the
s3a_replace_prefix
variable in the code to reflect the mapped prefix. Make sure this is the same value ofreplace_prefix
under the spark_client file. Alternatively, set theS3A_REPLACE_PREFIX
environment variable as specified in the code. - Set the
s3_bucket_s3a_prefix
variable in the code to reflect the S3 bucket namespace to which the Parquet file will be written. This should be a valid and accessible S3 bucket prefix.
spark-submit --packages "org.apache.hadoop:hadoop-aws:<your.hadoop.version>" main.py
After running the app you should notice that the same Parquet file was written to two different locations (to the lakeFS server and directly to the configured S3 bucket) using a single mapping scheme.