Skip to content

Commit

Permalink
Take the default duplication down to 25
Browse files Browse the repository at this point in the history
At some point the duplication was bumped to 100. This takes forever to
complete with 4x4gb executors, which isn't a useful experience for
users.

With duplication at 25 the spark job takes about 5 minutes to complete
which is plenty of time for the user to poke around.
  • Loading branch information
raykrueger committed Apr 9, 2024
1 parent df4e9fe commit d922ead
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 6 deletions.
10 changes: 5 additions & 5 deletions analytics/scripts/taxi-trip-execute.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ fi
S3_BUCKET="$1"
REGION="$2"

INPUT_DATA_S3_PATH="s3://${S3_BUCKET}/taxi-trip/input/"
INPUT_DATA_S3_PATH="s3://${S3_BUCKET}/taxi-trip/input"

# Create a local input folder
mkdir input
Expand All @@ -29,15 +29,15 @@ aws s3 cp pyspark-taxi-trip.py s3://${S3_BUCKET}/taxi-trip/scripts/ --region ${R

# Copy Test Input data to S3 bucket
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet -O "input/yellow_tripdata_2022-0.parquet"
aws s3 cp "input/yellow_tripdata_2022-0.parquet" s3://${S3_BUCKET}/input/yellow_tripdata_2022-0.parquet
aws s3 cp "input/yellow_tripdata_2022-0.parquet" ${INPUT_DATA_S3_PATH}/yellow_tripdata_2022-0.parquet

pids=()

# Making duplicate copies to increase the size of the data.
max=100
max=25
for (( i=1; i <= $max; ++i ))
do
aws s3 cp s3://${S3_BUCKET}/input/yellow_tripdata_2022-0.parquet s3://${S3_BUCKET}/input/yellow_tripdata_2022-${i}.parquet &
aws s3 cp ${INPUT_DATA_S3_PATH}/yellow_tripdata_2022-0.parquet ${INPUT_DATA_S3_PATH}/yellow_tripdata_2022-${i}.parquet &
pids+=($!)
done

Expand All @@ -46,4 +46,4 @@ for pid in "${pids[@]}"; do
done

# Delete a local input folder
rm -rf input
rm -rf input
2 changes: 1 addition & 1 deletion website/docs/blueprints/data-analytics/_taxi_trip_exec.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ require a relatively fast internet connection.

```bash
${DOEKS_HOME}/analytics/scripts/taxi-trip-execute.sh ${S3_BUCKET} YOUR_REGION_HERE
```
```

0 comments on commit d922ead

Please sign in to comment.