chore: Taxi trip script improvements (awslabs#500)

bbgu1 · Apr 10, 2024 · 13b053d · 13b053d
2 parents c50c181 + d922ead
commit 13b053d
Show file tree

Hide file tree

Showing 8 changed files with 16 additions and 225 deletions.
diff --git a/...me-ephemeral-storage/taxi-trip-execute.sh → analytics/scripts/taxi-trip-execute.sh b/...me-ephemeral-storage/taxi-trip-execute.sh → analytics/scripts/taxi-trip-execute.sh
@@ -11,17 +11,15 @@
 
 # Script usage ./taxi-trip-execute my-s3-bucket us-west-2
 
-# validating that use passes two arguments, if not return a message to pass the arguments
 if [ $# -ne 2 ]; then
   echo "Usage: $0 <S3_BUCKET> <REGION>"
   exit 1
 fi
 
-
 S3_BUCKET="$1"
 REGION="$2"
 
-INPUT_DATA_S3_PATH="s3://${S3_BUCKET}/taxi-trip/input/"
+INPUT_DATA_S3_PATH="s3://${S3_BUCKET}/taxi-trip/input"
 
 # Create a local input folder
 mkdir input
@@ -31,15 +29,21 @@ aws s3 cp pyspark-taxi-trip.py s3://${S3_BUCKET}/taxi-trip/scripts/ --region ${R
 
 # Copy Test Input data to S3 bucket
 wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet -O "input/yellow_tripdata_2022-0.parquet"
+aws s3 cp "input/yellow_tripdata_2022-0.parquet" ${INPUT_DATA_S3_PATH}/yellow_tripdata_2022-0.parquet
+
+pids=()
 
 # Making duplicate copies to increase the size of the data.
-max=100
+max=25
 for (( i=1; i <= $max; ++i ))
 do
-   cp -rf "input/yellow_tripdata_2022-0.parquet" "input/yellow_tripdata_2022-${i}.parquet"
+  aws s3 cp ${INPUT_DATA_S3_PATH}/yellow_tripdata_2022-0.parquet ${INPUT_DATA_S3_PATH}/yellow_tripdata_2022-${i}.parquet &
+  pids+=($!)
 done
 
-aws s3 sync "input/" ${INPUT_DATA_S3_PATH}
+for pid in "${pids[@]}"; do
+    wait $pid
+done
 
 # Delete a local input folder
 rm -rf input
diff --git a/...ark-k8s-operator/examples/cluster-autoscaler/ebs-storage-dynamic-pvc/taxi-trip-execute.sh b/...ark-k8s-operator/examples/cluster-autoscaler/ebs-storage-dynamic-pvc/taxi-trip-execute.sh
diff --git a/...s-operator/examples/cluster-autoscaler/nvme-yunikorn-gang-scheduling/taxi-trip-execute.sh b/...s-operator/examples/cluster-autoscaler/nvme-yunikorn-gang-scheduling/taxi-trip-execute.sh
diff --git a/...raform/spark-k8s-operator/examples/karpenter/ebs-storage-dynamic-pvc/taxi-trip-execute.sh b/...raform/spark-k8s-operator/examples/karpenter/ebs-storage-dynamic-pvc/taxi-trip-execute.sh
diff --git a/...rraform/spark-k8s-operator/examples/karpenter/nvme-ephemeral-storage/taxi-trip-execute.sh b/...rraform/spark-k8s-operator/examples/karpenter/nvme-ephemeral-storage/taxi-trip-execute.sh
diff --git a/.../spark-k8s-operator/examples/karpenter/nvme-yunikorn-gang-scheduling/taxi-trip-execute.sh b/.../spark-k8s-operator/examples/karpenter/nvme-yunikorn-gang-scheduling/taxi-trip-execute.sh
diff --git a/website/docs/blueprints/data-analytics/_taxi_trip_exec.md b/website/docs/blueprints/data-analytics/_taxi_trip_exec.md
@@ -5,5 +5,5 @@ it in order to increase the size a bit. This will take a bit of time and will
 require a relatively fast internet connection.
 
 ```bash
-./taxi-trip-execute.sh ${S3_BUCKET} YOUR_REGION_HERE
+${DOEKS_HOME}/analytics/scripts/taxi-trip-execute.sh ${S3_BUCKET} YOUR_REGION_HERE
 ```
diff --git a/website/docs/blueprints/data-analytics/observability-spark-on-eks.md b/website/docs/blueprints/data-analytics/observability-spark-on-eks.md
@@ -2,6 +2,9 @@
 sidebar_position: 3
 sidebar_label: Observability Spark on EKS
 ---
+
+import TaxiTripExec from './_taxi_trip_exec.md';
+
 # Observability Spark on EKS
 
 ## Introduction
@@ -15,11 +18,10 @@ We will reuse the previous Spark on Operator example. Please follow [this link](
 let's navigate to one example folder under spark-k8s-operator and run the shell script to upload data and py script to the S3 bucket created by terraform above.
 ```bash
 cd data-on-eks/analytics/terraform/spark-k8s-operator/examples/cluster-autoscaler/nvme-ephemeral-storage
-
-# replace \<S3_BUCKET\> with your S3 bucket and \<REGION\> with your region, then run
-./taxi-trip-execute.sh
 ```
 
+<TaxiTripExec />
+
 ## Spark Web UI
 When you submit a Spark application, Spark context is created which ideally gives you [Spark Web UI](https://sparkbyexamples.com/spark/spark-web-ui-understanding/) to monitor the execution of the application. Monitoring includes the following.
 - Spark configurations used