-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Spark Connect not detected on Databricks #102
Comments
we could take inspiration from Spark 4.0 pre-release - it's unfortunate that this util is not available in earlier spark versions def is_remote() -> bool:
"""
Returns if the current running environment is for Spark Connect.
.. versionadded:: 4.0.0
Notes
-----
This will only return ``True`` if there is a remote session running.
Otherwise, it returns ``False``.
This API is unstable, and for developers.
Returns
-------
bool
Examples
--------
>>> from pyspark.sql import is_remote
>>> is_remote()
False
"""
return ("SPARK_CONNECT_MODE_ENABLED" in os.environ) or is_remote_only() |
Here is another Spark 4.0 nugget, which is showing how it is going to be changing in the next release: def is_remote_only() -> bool:
"""
Returns if the current running environment is only for Spark Connect.
If users install pyspark-connect alone, RDD API does not exist.
.. versionadded:: 4.0.0
Notes
-----
This will only return ``True`` if installed PySpark is only for Spark Connect.
Otherwise, it returns ``False``.
This API is unstable, and for developers.
Returns
-------
bool
Examples
--------
>>> from pyspark.sql import is_remote
>>> is_remote()
False
"""
global _is_remote_only
if "SPARK_SKIP_CONNECT_COMPAT_TESTS" in os.environ:
return True
if _is_remote_only is not None:
return _is_remote_only
try:
from pyspark import core # noqa: F401
_is_remote_only = False
return _is_remote_only
except ImportError:
_is_remote_only = True
return _is_remote_only source: https://spark.apache.org/docs/4.0.0-preview1/api/python/_modules/pyspark/util#is_remote_only Namely, you can see that pyspark-connect is a separate package with not immediate relationship to 'core'. This as an aside, yet related. my point... we should do what works for 3.4 and 3.5 spark which might mean we have to rely on there being an active sparksession to just determine this by the class instance type of the SparkSession; specifically because changes we can see coming to us in Spark 4 (when it comes out). |
Thank you for your reply. I have just checked this check is present on https://github.com/apache/spark/blob/d39f5ab99f67ce959b4379ecc3d6e262c10146cf/python/pyspark/sql/utils.py#L156 - 3.5 brach |
Neither of this options is set on Databricks Runtime :/ But also there is another bug to be fixed, after manually setting There are couple of them actually:
Given proliferation of this check in several places, it makes sense to make only once at instance initialisation or spin a separate class. |
<!--- Provide a general summary of your changes in the Title above --> ## Description <!--- Describe your changes in detail --> Additonal fixes for #101 and #102 ## Related Issue <!--- This project only accepts pull requests related to open issues --> <!--- If suggesting a new feature or change, please discuss it in an issue first --> <!--- If fixing a bug, there should be an issue describing it with steps to reproduce --> <!--- Please link to the issue here: --> #101, #102 ## Motivation and Context <!--- Why is this change required? What problem does it solve? --> ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> ## Screenshots (if appropriate): ## Types of changes <!--- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Checklist: <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> <!--- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> - [x] My code follows the code style of this project. - [ ] My change requires a change to the documentation. - [ ] I have updated the documentation accordingly. - [x] I have read the **CONTRIBUTING** document. - [ ] I have added tests to cover my changes. - [x] All new and existing tests passed. --------- Co-authored-by: Danny Meijer <[email protected]>
Validated on 0.9.0rc6, now Spark Connect session is correctly detected. |
Describe the bug
When running on Databricks in Spark Connect mode (for example Shared Isolation mode, Job Cluster, DBR 15.4) - Spark Connect isn't detected by koheesio, which leads to exceptions in code that can only be executed on regular
SparkSession
.Steps to Reproduce
Run code that involves for example use of
DeltaTableWriter
withmerge_builder
passed asDeltaMergeBuilder
, it will fail on 0.9.0rc versions.Expected behavior
Spark Connect is detected while using Koheesio 0.9.0 (future release branch).
Environment
Additional context
The issue is caused by relying on non-existing configuration parameter
spark.remote
-koheesio/src/koheesio/spark/utils/connect.py
Line 17 in abb435b
The text was updated successfully, but these errors were encountered: