-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to Spark 3.2.x #6
Conversation
… the outputs for one job as input for the other
…tect already does it for us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @srikar , it looks great, I have some minor points for discussion.
@@ -4,14 +4,14 @@ FROM python:$PYTHON_VERSION | |||
USER root | |||
WORKDIR /opt | |||
RUN wget -c --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz && \ | |||
wget https://downloads.lightbend.com/scala/2.13.5/scala-2.13.5.tgz && \ | |||
wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz && \ | |||
wget https://downloads.lightbend.com/scala/2.12.15/scala-2.12.15.tgz && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we downgrading version of scala?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are inconsistent versions of scala in our codebase. I was just making them consistent. I am happy to bump it up to 2.13.x if we want to do that @svishal9 @sambhav13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote for keeping Scala version as 2.13.x
since that's the version used for Spark 3.x
wget https://github.com/sbt/sbt/releases/download/v1.3.0/sbt-1.3.0.tgz | ||
RUN tar xzf jdk-8u131-linux-x64.tar.gz && \ | ||
tar xvf scala-2.13.5.tgz && \ | ||
tar xvf spark-3.1.1-bin-hadoop3.2.tgz && \ | ||
tar xvf scala-2.12.15.tgz && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we downgrading version of scala?
tar xvf sbt-1.3.0.tgz | ||
ENV PATH="/opt/jdk1.8.0_131/bin:/opt/scala-2.13.5/bin:/opt/spark-3.1.1-bin-hadoop3.2/bin:/opt/sbt/bin:/opt/sbt/bin$PATH" | ||
ENV PATH="/opt/jdk1.8.0_131/bin:/opt/scala-2.12.15/bin:/opt/spark-3.2.1-bin-hadoop3.2/bin:/opt/sbt/bin:/opt/sbt/bin$PATH" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we downgrading version of scala?
@@ -69,7 +69,7 @@ A single `*.csv` file containing data similar to: | |||
#### Run the job | |||
|
|||
```bash | |||
INPUT_FILE_PATH="src/test/resources/data/words.txt" JOB=thoughtworks.wordcount.WordCount ./batect run-job | |||
INPUT_FILE_PATH="src/test/resources/data/words.txt" OUTPUT_PATH="output" JOB=thoughtworks.wordcount.WordCount ./batect run-job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it implicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was implicit earlier. However because the output of one stage was the input for the other, the implicit path didn't work. And we had to call this out explicitly.
@@ -103,7 +103,7 @@ Historical bike ride `*.csv` file: | |||
##### Run the job | |||
|
|||
```bash | |||
INPUT_FILE_PATH="src/test/resources/data/citibike.csv" JOB=thoughtworks.ingest.DailyDriver ./batect run-job | |||
INPUT_FILE_PATH="src/test/resources/data/citibike.csv" OUTPUT_PATH="output_parquet_ingest" JOB=thoughtworks.ingest.DailyDriver ./batect run-job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it implicit ? like a pattern?
@@ -132,7 +132,7 @@ Historical bike ride `*.parquet` files | |||
##### Run the job | |||
|
|||
```bash | |||
INPUT_FILE_PATH=${output_parquest_ingest} JOB=thoughtworks.citibike.CitibikeTransformer ./batect run-job | |||
INPUT_FILE_PATH="output_parquet_ingest" OUTPUT_PATH="output" JOB=thoughtworks.citibike.CitibikeTransformer ./batect run-job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
@@ -1,14 +1,14 @@ | |||
|
|||
scalaVersion := "2.12.4" | |||
scalaVersion := "2.12.15" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting, so we installed 2.13 in previous commits but we were using 2.12?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We see the inconsistent versions of scala here. The build file originally contained 2.12.4. And the docker container was being built with 2.13.5.
I just brought them both down to 2.12.latest. I chose 2.12 as it had been around longer and still actively being used (unlike Spark 2.x)
@@ -2,7 +2,6 @@ | |||
|
|||
set -e | |||
|
|||
OUTPUT_PATH="./output" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can w have a pattern instead of removing it?
|
||
val sparkVersion = "2.4.0" | ||
val sparkVersion = "3.2.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that Spark 3.3 is already released and this PR isn't merged, would it be worthwhile to upgrade to Spark3.3.0
?
PR Details
Issue
Fixes #5
Contains
Upgrade to Spark 3.2.1
Upgrade to Scala version to be the latest patch version of 2.12.x (2.12.4 -> 2.12.15)
Fix to a typo in README referencing incorrect version of sbt
Test with Colima
Environment
Commands run
Status
All commands ran successfully
Remaining