-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
config executor cores #23
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #23 +/- ##
==========================================
+ Coverage 76.31% 76.92% +0.60%
==========================================
Files 1 1
Lines 38 39 +1
==========================================
+ Hits 29 30 +1
Misses 9 9 ☔ View full report in Codecov by Sentry. |
src/spark/utils.py
Outdated
@@ -10,6 +10,7 @@ | |||
HADOOP_AWS_VER = os.getenv('HADOOP_AWS_VER') | |||
DELTA_SPARK_VER = os.getenv('DELTA_SPARK_VER') | |||
SCALA_VER = os.getenv('SCALA_VER') | |||
DEFAULT_EXECUTOR_CORES = 1 # the default number of CPU cores that each Spark executor will use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this argument how many cores does each executor use? Can you document that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comments. I noticed Spark will take all cores on the worker if we don't set this.
src/spark/utils.py
Outdated
""" | ||
Helper function to get Delta Lake specific Spark configuration. | ||
|
||
:param jars_str: A comma-separated string of JAR file paths | ||
:param executor_cores: The number of CPU cores that each Spark executor will use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this only applies if Delta lake is true and the user doesn't call the get base conf method. It seems like it should always apply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
src/spark/utils.py
Outdated
@@ -58,32 +63,39 @@ def _stop_spark_session(spark): | |||
spark.stop() | |||
|
|||
|
|||
def get_base_spark_conf(app_name: str) -> SparkConf: | |||
def get_base_spark_conf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a private method? If not executor_cores should probably be a default arg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 i think it makes sense to make it a private method.
With the changes, we can control how many cores to use for the spark session.