Improve the taxi-trip-execute.sh script

The current performance of the script is very poor. Additionally there are 6 copies of this script. This commit improves the performance by doing S3 to S3 copies in the background rather than using the local network to upload 100 copies. This also removes the additional copies of the script.
awslabs · Apr 9, 2024 · df4e9fe · df4e9fe
1 parent c50c181
commit df4e9fe
Show file tree

Hide file tree

Showing 8 changed files with 16 additions and 225 deletions.
diff --git a/...me-ephemeral-storage/taxi-trip-execute.sh → analytics/scripts/taxi-trip-execute.sh b/...me-ephemeral-storage/taxi-trip-execute.sh → analytics/scripts/taxi-trip-execute.sh
@@ -11,13 +11,11 @@
 
 # Script usage ./taxi-trip-execute my-s3-bucket us-west-2
 
-# validating that use passes two arguments, if not return a message to pass the arguments
 if [ $# -ne 2 ]; then
   echo "Usage: $0 <S3_BUCKET> <REGION>"
   exit 1
 fi
 
-
 S3_BUCKET="$1"
 REGION="$2"
 
@@ -31,15 +29,21 @@ aws s3 cp pyspark-taxi-trip.py s3://${S3_BUCKET}/taxi-trip/scripts/ --region ${R
 
 # Copy Test Input data to S3 bucket
 wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet -O "input/yellow_tripdata_2022-0.parquet"
+aws s3 cp "input/yellow_tripdata_2022-0.parquet" s3://${S3_BUCKET}/input/yellow_tripdata_2022-0.parquet
+
+pids=()
 
 # Making duplicate copies to increase the size of the data.
 max=100
 for (( i=1; i <= $max; ++i ))
 do
-   cp -rf "input/yellow_tripdata_2022-0.parquet" "input/yellow_tripdata_2022-${i}.parquet"
+  aws s3 cp s3://${S3_BUCKET}/input/yellow_tripdata_2022-0.parquet s3://${S3_BUCKET}/input/yellow_tripdata_2022-${i}.parquet &
+  pids+=($!)
 done
 
-aws s3 sync "input/" ${INPUT_DATA_S3_PATH}
+for pid in "${pids[@]}"; do
+    wait $pid
+done
 
 # Delete a local input folder
-rm -rf input
+rm -rf input
diff --git a/...ark-k8s-operator/examples/cluster-autoscaler/ebs-storage-dynamic-pvc/taxi-trip-execute.sh b/...ark-k8s-operator/examples/cluster-autoscaler/ebs-storage-dynamic-pvc/taxi-trip-execute.sh
diff --git a/...s-operator/examples/cluster-autoscaler/nvme-yunikorn-gang-scheduling/taxi-trip-execute.sh b/...s-operator/examples/cluster-autoscaler/nvme-yunikorn-gang-scheduling/taxi-trip-execute.sh
diff --git a/...raform/spark-k8s-operator/examples/karpenter/ebs-storage-dynamic-pvc/taxi-trip-execute.sh b/...raform/spark-k8s-operator/examples/karpenter/ebs-storage-dynamic-pvc/taxi-trip-execute.sh
diff --git a/...rraform/spark-k8s-operator/examples/karpenter/nvme-ephemeral-storage/taxi-trip-execute.sh b/...rraform/spark-k8s-operator/examples/karpenter/nvme-ephemeral-storage/taxi-trip-execute.sh
diff --git a/.../spark-k8s-operator/examples/karpenter/nvme-yunikorn-gang-scheduling/taxi-trip-execute.sh b/.../spark-k8s-operator/examples/karpenter/nvme-yunikorn-gang-scheduling/taxi-trip-execute.sh
diff --git a/website/docs/blueprints/data-analytics/_taxi_trip_exec.md b/website/docs/blueprints/data-analytics/_taxi_trip_exec.md
@@ -5,5 +5,5 @@ it in order to increase the size a bit. This will take a bit of time and will
 require a relatively fast internet connection.
 
 ```bash
-./taxi-trip-execute.sh ${S3_BUCKET} YOUR_REGION_HERE
-```
+${DOEKS_HOME}/analytics/scripts/taxi-trip-execute.sh ${S3_BUCKET} YOUR_REGION_HERE
+```
diff --git a/website/docs/blueprints/data-analytics/observability-spark-on-eks.md b/website/docs/blueprints/data-analytics/observability-spark-on-eks.md
@@ -2,6 +2,9 @@
 sidebar_position: 3
 sidebar_label: Observability Spark on EKS
 ---
+
+import TaxiTripExec from './_taxi_trip_exec.md';
+
 # Observability Spark on EKS
 
 ## Introduction
@@ -15,11 +18,10 @@ We will reuse the previous Spark on Operator example. Please follow [this link](
 let's navigate to one example folder under spark-k8s-operator and run the shell script to upload data and py script to the S3 bucket created by terraform above.
 ```bash
 cd data-on-eks/analytics/terraform/spark-k8s-operator/examples/cluster-autoscaler/nvme-ephemeral-storage
-
-# replace \<S3_BUCKET\> with your S3 bucket and \<REGION\> with your region, then run
-./taxi-trip-execute.sh
 ```
 
+<TaxiTripExec />
+
 ## Spark Web UI
 When you submit a Spark application, Spark context is created which ideally gives you [Spark Web UI](https://sparkbyexamples.com/spark/spark-web-ui-understanding/) to monitor the execution of the application. Monitoring includes the following.
 - Spark configurations used