[BUG] Spark Connect not detected on Databricks #102

YevIgn · 2024-11-13T13:39:55Z

Describe the bug

When running on Databricks in Spark Connect mode (for example Shared Isolation mode, Job Cluster, DBR 15.4) - Spark Connect isn't detected by koheesio, which leads to exceptions in code that can only be executed on regular SparkSession.

Steps to Reproduce

Run code that involves for example use of DeltaTableWriter with merge_builder passed as DeltaMergeBuilder, it will fail on 0.9.0rc versions.

Expected behavior

Spark Connect is detected while using Koheesio 0.9.0 (future release branch).

Environment

DBR 15.4
Koheesio 0.9.0rc2

Additional context

The issue is caused by relying on non-existing configuration parameter spark.remote -

koheesio/src/koheesio/spark/utils/connect.py

Line 17 in abb435b

    
           result = True if _spark.conf.get("spark.remote", None) else False  # type: ignore

The text was updated successfully, but these errors were encountered:

dannymeijer · 2024-11-13T15:09:10Z

we could take inspiration from Spark 4.0 pre-release - it's unfortunate that this util is not available in earlier spark versions
https://spark.apache.org/docs/4.0.0-preview1/api/python/_modules/pyspark/sql/utils.html#is_remote

def is_remote() -> bool:
    """
    Returns if the current running environment is for Spark Connect.

    .. versionadded:: 4.0.0

    Notes
    -----
    This will only return ``True`` if there is a remote session running.
    Otherwise, it returns ``False``.

    This API is unstable, and for developers.

    Returns
    -------
    bool

    Examples
    --------
    >>> from pyspark.sql import is_remote
    >>> is_remote()
    False
    """
    return ("SPARK_CONNECT_MODE_ENABLED" in os.environ) or is_remote_only()

dannymeijer · 2024-11-13T15:16:05Z

Here is another Spark 4.0 nugget, which is showing how it is going to be changing in the next release:

def is_remote_only() -> bool:
    """
    Returns if the current running environment is only for Spark Connect.
    If users install pyspark-connect alone, RDD API does not exist.

    .. versionadded:: 4.0.0

    Notes
    -----
    This will only return ``True`` if installed PySpark is only for Spark Connect.
    Otherwise, it returns ``False``.

    This API is unstable, and for developers.

    Returns
    -------
    bool

    Examples
    --------
    >>> from pyspark.sql import is_remote
    >>> is_remote()
    False
    """
    global _is_remote_only

    if "SPARK_SKIP_CONNECT_COMPAT_TESTS" in os.environ:
        return True

    if _is_remote_only is not None:
        return _is_remote_only
    try:
        from pyspark import core  # noqa: F401

        _is_remote_only = False
        return _is_remote_only
    except ImportError:
        _is_remote_only = True
        return _is_remote_only

source: https://spark.apache.org/docs/4.0.0-preview1/api/python/_modules/pyspark/util#is_remote_only

Namely, you can see that pyspark-connect is a separate package with not immediate relationship to 'core'. This as an aside, yet related.

my point... we should do what works for 3.4 and 3.5 spark which might mean we have to rely on there being an active sparksession to just determine this by the class instance type of the SparkSession; specifically because changes we can see coming to us in Spark 4 (when it comes out).

YevIgn · 2024-11-13T20:32:56Z

Thank you for your reply. I have just checked this check is present on https://github.com/apache/spark/blob/d39f5ab99f67ce959b4379ecc3d6e262c10146cf/python/pyspark/sql/utils.py#L156 - 3.5 brach

YevIgn · 2024-11-15T10:57:29Z

Neither of this options is set on Databricks Runtime :/

But also there is another bug to be fixed, after manually setting spark.remote the code continues to execute, but fails due to non-replaced isinstance(writer, DeltaMergeBuilder) check further on.

There are couple of them actually:

koheesio/src/koheesio/spark/writers/delta/batch.py

Line 251 in abb435b

if isinstance(merge_builder, DeltaMergeBuilder):
koheesio/src/koheesio/spark/writers/delta/batch.py

Line 326 in abb435b

elif not isinstance(merge_builder, DeltaMergeBuilder):
koheesio/src/koheesio/spark/writers/delta/batch.py

Line 381 in abb435b

if isinstance(_writer, DeltaMergeBuilder):

Given proliferation of this check in several places, it makes sense to make only once at instance initialisation or spin a separate class.

## Description  Additonal fixes for #101 and #102 ## Related Issue     #101, #102 ## Motivation and Context  ## How Has This Been Tested?    ## Screenshots (if appropriate): ## Types of changes  - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Checklist:   - [x] My code follows the code style of this project. - [ ] My change requires a change to the documentation. - [ ] I have updated the documentation accordingly. - [x] I have read the **CONTRIBUTING** document. - [ ] I have added tests to cover my changes. - [x] All new and existing tests passed. --------- Co-authored-by: Danny Meijer <[email protected]>

YevIgn · 2024-11-26T11:45:58Z

Validated on 0.9.0rc6, now Spark Connect session is correctly detected.

YevIgn added the bug Something isn't working label Nov 13, 2024

dannymeijer added this to the 0.9.0 milestone Nov 18, 2024

dannymeijer linked a pull request Nov 26, 2024 that will close this issue

Fix/delta merge builder instance check for connect + util fix #130

Merged

9 tasks

dannymeijer mentioned this issue Nov 26, 2024

Fix/delta merge builder instance check for connect + util fix #130

Merged

9 tasks

YevIgn closed this as completed Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Spark Connect not detected on Databricks #102

[BUG] Spark Connect not detected on Databricks #102

YevIgn commented Nov 13, 2024

dannymeijer commented Nov 13, 2024

dannymeijer commented Nov 13, 2024

YevIgn commented Nov 13, 2024

YevIgn commented Nov 15, 2024

YevIgn commented Nov 26, 2024

[BUG] Spark Connect not detected on Databricks #102

[BUG] Spark Connect not detected on Databricks #102

Comments

YevIgn commented Nov 13, 2024

Describe the bug

Steps to Reproduce

Expected behavior

Environment

Additional context

dannymeijer commented Nov 13, 2024

dannymeijer commented Nov 13, 2024

YevIgn commented Nov 13, 2024

YevIgn commented Nov 15, 2024

YevIgn commented Nov 26, 2024