Skip to content

Commit

Permalink
chore: Taxi trip script improvements (awslabs#500)
Browse files Browse the repository at this point in the history
  • Loading branch information
askulkarni2 authored Apr 10, 2024
2 parents c50c181 + d922ead commit 13b053d
Show file tree
Hide file tree
Showing 8 changed files with 16 additions and 225 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,15 @@

# Script usage ./taxi-trip-execute my-s3-bucket us-west-2

# validating that use passes two arguments, if not return a message to pass the arguments
if [ $# -ne 2 ]; then
echo "Usage: $0 <S3_BUCKET> <REGION>"
exit 1
fi


S3_BUCKET="$1"
REGION="$2"

INPUT_DATA_S3_PATH="s3://${S3_BUCKET}/taxi-trip/input/"
INPUT_DATA_S3_PATH="s3://${S3_BUCKET}/taxi-trip/input"

# Create a local input folder
mkdir input
Expand All @@ -31,15 +29,21 @@ aws s3 cp pyspark-taxi-trip.py s3://${S3_BUCKET}/taxi-trip/scripts/ --region ${R

# Copy Test Input data to S3 bucket
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet -O "input/yellow_tripdata_2022-0.parquet"
aws s3 cp "input/yellow_tripdata_2022-0.parquet" ${INPUT_DATA_S3_PATH}/yellow_tripdata_2022-0.parquet

pids=()

# Making duplicate copies to increase the size of the data.
max=100
max=25
for (( i=1; i <= $max; ++i ))
do
cp -rf "input/yellow_tripdata_2022-0.parquet" "input/yellow_tripdata_2022-${i}.parquet"
aws s3 cp ${INPUT_DATA_S3_PATH}/yellow_tripdata_2022-0.parquet ${INPUT_DATA_S3_PATH}/yellow_tripdata_2022-${i}.parquet &
pids+=($!)
done

aws s3 sync "input/" ${INPUT_DATA_S3_PATH}
for pid in "${pids[@]}"; do
wait $pid
done

# Delete a local input folder
rm -rf input

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

2 changes: 1 addition & 1 deletion website/docs/blueprints/data-analytics/_taxi_trip_exec.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,5 @@ it in order to increase the size a bit. This will take a bit of time and will
require a relatively fast internet connection.

```bash
./taxi-trip-execute.sh ${S3_BUCKET} YOUR_REGION_HERE
${DOEKS_HOME}/analytics/scripts/taxi-trip-execute.sh ${S3_BUCKET} YOUR_REGION_HERE
```
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@
sidebar_position: 3
sidebar_label: Observability Spark on EKS
---

import TaxiTripExec from './_taxi_trip_exec.md';

# Observability Spark on EKS

## Introduction
Expand All @@ -15,11 +18,10 @@ We will reuse the previous Spark on Operator example. Please follow [this link](
let's navigate to one example folder under spark-k8s-operator and run the shell script to upload data and py script to the S3 bucket created by terraform above.
```bash
cd data-on-eks/analytics/terraform/spark-k8s-operator/examples/cluster-autoscaler/nvme-ephemeral-storage

# replace \<S3_BUCKET\> with your S3 bucket and \<REGION\> with your region, then run
./taxi-trip-execute.sh
```

<TaxiTripExec />

## Spark Web UI
When you submit a Spark application, Spark context is created which ideally gives you [Spark Web UI](https://sparkbyexamples.com/spark/spark-web-ui-understanding/) to monitor the execution of the application. Monitoring includes the following.
- Spark configurations used
Expand Down

0 comments on commit 13b053d

Please sign in to comment.