Skip to content

Commit

Permalink
Improve the taxi-trip-execute.sh script
Browse files Browse the repository at this point in the history
The current performance of the script is very poor.
Additionally there  are 6 copies of this script.

This commit improves the performance by doing S3 to S3 copies in the
background rather than using the local network to upload 100 copies.
This also removes the additional copies of the script.
  • Loading branch information
raykrueger committed Apr 9, 2024
1 parent c50c181 commit df4e9fe
Show file tree
Hide file tree
Showing 8 changed files with 16 additions and 225 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,11 @@

# Script usage ./taxi-trip-execute my-s3-bucket us-west-2

# validating that use passes two arguments, if not return a message to pass the arguments
if [ $# -ne 2 ]; then
echo "Usage: $0 <S3_BUCKET> <REGION>"
exit 1
fi


S3_BUCKET="$1"
REGION="$2"

Expand All @@ -31,15 +29,21 @@ aws s3 cp pyspark-taxi-trip.py s3://${S3_BUCKET}/taxi-trip/scripts/ --region ${R

# Copy Test Input data to S3 bucket
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet -O "input/yellow_tripdata_2022-0.parquet"
aws s3 cp "input/yellow_tripdata_2022-0.parquet" s3://${S3_BUCKET}/input/yellow_tripdata_2022-0.parquet

pids=()

# Making duplicate copies to increase the size of the data.
max=100
for (( i=1; i <= $max; ++i ))
do
cp -rf "input/yellow_tripdata_2022-0.parquet" "input/yellow_tripdata_2022-${i}.parquet"
aws s3 cp s3://${S3_BUCKET}/input/yellow_tripdata_2022-0.parquet s3://${S3_BUCKET}/input/yellow_tripdata_2022-${i}.parquet &
pids+=($!)
done

aws s3 sync "input/" ${INPUT_DATA_S3_PATH}
for pid in "${pids[@]}"; do
wait $pid
done

# Delete a local input folder
rm -rf input
rm -rf input

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

4 changes: 2 additions & 2 deletions website/docs/blueprints/data-analytics/_taxi_trip_exec.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,5 @@ it in order to increase the size a bit. This will take a bit of time and will
require a relatively fast internet connection.

```bash
./taxi-trip-execute.sh ${S3_BUCKET} YOUR_REGION_HERE
```
${DOEKS_HOME}/analytics/scripts/taxi-trip-execute.sh ${S3_BUCKET} YOUR_REGION_HERE
```
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@
sidebar_position: 3
sidebar_label: Observability Spark on EKS
---

import TaxiTripExec from './_taxi_trip_exec.md';

# Observability Spark on EKS

## Introduction
Expand All @@ -15,11 +18,10 @@ We will reuse the previous Spark on Operator example. Please follow [this link](
let's navigate to one example folder under spark-k8s-operator and run the shell script to upload data and py script to the S3 bucket created by terraform above.
```bash
cd data-on-eks/analytics/terraform/spark-k8s-operator/examples/cluster-autoscaler/nvme-ephemeral-storage

# replace \<S3_BUCKET\> with your S3 bucket and \<REGION\> with your region, then run
./taxi-trip-execute.sh
```

<TaxiTripExec />

## Spark Web UI
When you submit a Spark application, Spark context is created which ideally gives you [Spark Web UI](https://sparkbyexamples.com/spark/spark-web-ui-understanding/) to monitor the execution of the application. Monitoring includes the following.
- Spark configurations used
Expand Down

0 comments on commit df4e9fe

Please sign in to comment.