Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read extra Spark submit parameters from cluster settings #2219

Conversation

dai-chen
Copy link
Collaborator

@dai-chen dai-chen commented Oct 5, 2023

Description

Add a new field sparkSubmitParameters in plugins.query.executionengine.spark.config setting. This will be treated as extra parameters to append to Spark submit parameters generated in code.

{
  "applicationId": "xxx",
  "executionRoleARN": "yyy",
  "region": "zzz",
  "sparkSubmitParameters": "--conf A=1 --conf B=2 ..."
}

Manual test locally:

# Without sparkSubmitParameters configured:
$ curl --request POST \
  --url localhost:9200/_plugins/_async_query \
  --header 'content-type: application/x-ndjson' \
  --data '{ "datasource": "mys3", "lang": "sql", "query": "select * from mys3.default.httplogs limit 5" } }'

Start job request: StartJobRequest(
query=select * from mys3.default.httplogs limit 5,
...  --conf spark.flint.datasource.name=mys3 ,
tags={cluster=integTest, datasource=mys3}, isStructuredStreaming=false, resultIndex=null)


# Add extra parameters to cluster setting:
$ curl --request PUT \
  --url localhost:9200/_cluster/settings \
  --header 'content-type: application/x-ndjson' \
  --data '{ "transient": { "plugins.query.executionengine.spark.config": "{\"applicationId\":\"XXX\",\"executionRoleARN\":\"YYY\",\"region\":\"ZZZ\", \
\"sparkSubmitParameter\": \"--conf spark.dynamicAllocation.enabled=false\"}" } }'

# SQL direct query
$ curl --request POST \
  --url localhost:9200/_plugins/_async_query \
  --header 'content-type: application/x-ndjson' \
  --data '{ "datasource": "mys3", "lang": "sql", "query": "select * from mys3.default.httplogs limit 5" } }'

Start job request: StartJobRequest(
query=select * from mys3.default.httplogs limit 5, 
jobName=integTest:non-index-query,
...  --conf spark.flint.datasource.name=mys3 --conf spark.dynamicAllocation.enabled=false,
tags={cluster=integTest, datasource=mys3}, isStructuredStreaming=false, resultIndex=null)

# SQL create DDL
$ curl --request POST \
  --url localhost:9200/_plugins/_async_query \
  --header 'content-type: application/x-ndjson' \
  --data '{ "datasource": "mys3", "lang": "sql", "query": "create skipping index on mys3.default.httplogs (status VALUE_SET) with (auto_refresh=true)" } }'

Start job request: StartJobRequest(
query=create skipping index on mys3.default.httplogs (status VALUE_SET) with (auto_refresh=true), 
jobName=integTest:index-query, 
...  --conf spark.flint.datasource.name=mys3  --conf spark.flint.job.type=streaming --conf spark.dynamicAllocation.enabled=false,
tags={schema=default, cluster=integTest, datasource=mys3, index=null, table=httplogs},
isStructuredStreaming=true, resultIndex=null)

# PPL query
$ curl --request POST \
  --url localhost:9200/_plugins/_async_query \
  --header 'content-type: application/x-ndjson' \
  --data '{ "datasource": "mys3", "lang": "ppl", "query": "source = mys3.default.httplogs | head" } }'

Start job request: StartJobRequest(
query=source = mys3.default.httplogs | head,
jobName=integTest:non-index-query,
...  --conf spark.flint.datasource.name=mys3 --conf spark.dynamicAllocation.enabled=false,
tags={cluster=integTest, datasource=mys3}, isStructuredStreaming=false, resultIndex=null)

Some enhancement may be useful in future:

  1. "Merge" rather than "append" so we can override parameter generated in code
  2. Configure different extra parameter in settings for different query
{
  ...
  "sparkSubmitParameters": {
      "all": ...
      "sql create": ...
      "sql select": ...
  }
}

Issues Resolved

#2192

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@dai-chen dai-chen added the enhancement New feature or request label Oct 5, 2023
@dai-chen dai-chen self-assigned this Oct 5, 2023
@codecov
Copy link

codecov bot commented Oct 5, 2023

Codecov Report

Merging #2219 (1a105a5) into main (cd9d768) will increase coverage by 0.00%.
Report is 1 commits behind head on main.
The diff coverage is 100.00%.

❗ Current head 1a105a5 differs from pull request most recent head 3128499. Consider uploading reports for the commit 3128499 to get more accurate results

@@            Coverage Diff            @@
##               main    #2219   +/-   ##
=========================================
  Coverage     96.34%   96.35%           
- Complexity     4724     4726    +2     
=========================================
  Files           439      439           
  Lines         12684    12689    +5     
  Branches        870      872    +2     
=========================================
+ Hits          12221    12227    +6     
  Misses          454      454           
+ Partials          9        8    -1     
Flag Coverage Δ
sql-engine 96.35% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...park/asyncquery/AsyncQueryExecutorServiceImpl.java 100.00% <100.00%> (ø)
.../spark/asyncquery/model/SparkSubmitParameters.java 100.00% <100.00%> (+1.40%) ⬆️
...h/sql/spark/config/SparkExecutionEngineConfig.java 100.00% <ø> (ø)
...rch/sql/spark/dispatcher/SparkQueryDispatcher.java 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

penghuo
penghuo previously approved these changes Oct 5, 2023
vamsimanohar
vamsimanohar previously approved these changes Oct 5, 2023
@vamsimanohar vamsimanohar dismissed stale reviews from penghuo and themself via 3128499 October 5, 2023 23:45
@vamsimanohar vamsimanohar force-pushed the add-spark-submit-parameters-to-cluster-settings branch from 1a105a5 to 3128499 Compare October 5, 2023 23:45
@penghuo penghuo merged commit 492982c into opensearch-project:main Oct 5, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 5, 2023
* Add default setting for Spark execution engine

Signed-off-by: Chen Dai <[email protected]>

* Pass extra parameters to Spark dispatcher

Signed-off-by: Chen Dai <[email protected]>

* Wrap read default setting file with previlege action

Signed-off-by: Chen Dai <[email protected]>

* Fix spotless format

Signed-off-by: Chen Dai <[email protected]>

* Use input stream to read default config file

Signed-off-by: Chen Dai <[email protected]>

* Add UT for dispatcher

Signed-off-by: Chen Dai <[email protected]>

* Add more UT

Signed-off-by: Chen Dai <[email protected]>

* Remove default config setting

Signed-off-by: Chen Dai <[email protected]>

* Fix spotless check in spark module

Signed-off-by: Chen Dai <[email protected]>

* Refactor test code

Signed-off-by: Chen Dai <[email protected]>

* Add more UT on config class

Signed-off-by: Chen Dai <[email protected]>

---------

Signed-off-by: Chen Dai <[email protected]>
(cherry picked from commit 492982c)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@dai-chen dai-chen deleted the add-spark-submit-parameters-to-cluster-settings branch October 5, 2023 23:58
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 5, 2023
* Add default setting for Spark execution engine

Signed-off-by: Chen Dai <[email protected]>

* Pass extra parameters to Spark dispatcher

Signed-off-by: Chen Dai <[email protected]>

* Wrap read default setting file with previlege action

Signed-off-by: Chen Dai <[email protected]>

* Fix spotless format

Signed-off-by: Chen Dai <[email protected]>

* Use input stream to read default config file

Signed-off-by: Chen Dai <[email protected]>

* Add UT for dispatcher

Signed-off-by: Chen Dai <[email protected]>

* Add more UT

Signed-off-by: Chen Dai <[email protected]>

* Remove default config setting

Signed-off-by: Chen Dai <[email protected]>

* Fix spotless check in spark module

Signed-off-by: Chen Dai <[email protected]>

* Refactor test code

Signed-off-by: Chen Dai <[email protected]>

* Add more UT on config class

Signed-off-by: Chen Dai <[email protected]>

---------

Signed-off-by: Chen Dai <[email protected]>
(cherry picked from commit 492982c)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
penghuo pushed a commit that referenced this pull request Oct 6, 2023
* Add default setting for Spark execution engine



* Pass extra parameters to Spark dispatcher



* Wrap read default setting file with previlege action



* Fix spotless format



* Use input stream to read default config file



* Add UT for dispatcher



* Add more UT



* Remove default config setting



* Fix spotless check in spark module



* Refactor test code



* Add more UT on config class



---------


(cherry picked from commit 492982c)

Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
penghuo pushed a commit that referenced this pull request Oct 6, 2023
* Add default setting for Spark execution engine



* Pass extra parameters to Spark dispatcher



* Wrap read default setting file with previlege action



* Fix spotless format



* Use input stream to read default config file



* Add UT for dispatcher



* Add more UT



* Remove default config setting



* Fix spotless check in spark module



* Refactor test code



* Add more UT on config class



---------


(cherry picked from commit 492982c)

Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants