Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Spark Connect not detected on Databricks #102

Closed
YevIgn opened this issue Nov 13, 2024 · 5 comments · Fixed by #130
Closed

[BUG] Spark Connect not detected on Databricks #102

YevIgn opened this issue Nov 13, 2024 · 5 comments · Fixed by #130
Labels
bug Something isn't working
Milestone

Comments

@YevIgn
Copy link

YevIgn commented Nov 13, 2024

Describe the bug

When running on Databricks in Spark Connect mode (for example Shared Isolation mode, Job Cluster, DBR 15.4) - Spark Connect isn't detected by koheesio, which leads to exceptions in code that can only be executed on regular SparkSession.

Steps to Reproduce

Run code that involves for example use of DeltaTableWriter with merge_builder passed as DeltaMergeBuilder, it will fail on 0.9.0rc versions.

Expected behavior

Spark Connect is detected while using Koheesio 0.9.0 (future release branch).

Environment

  • DBR 15.4
  • Koheesio 0.9.0rc2

Additional context

The issue is caused by relying on non-existing configuration parameter spark.remote -

result = True if _spark.conf.get("spark.remote", None) else False # type: ignore

@YevIgn YevIgn added the bug Something isn't working label Nov 13, 2024
@dannymeijer
Copy link
Member

we could take inspiration from Spark 4.0 pre-release - it's unfortunate that this util is not available in earlier spark versions
https://spark.apache.org/docs/4.0.0-preview1/api/python/_modules/pyspark/sql/utils.html#is_remote

def is_remote() -> bool:
    """
    Returns if the current running environment is for Spark Connect.

    .. versionadded:: 4.0.0

    Notes
    -----
    This will only return ``True`` if there is a remote session running.
    Otherwise, it returns ``False``.

    This API is unstable, and for developers.

    Returns
    -------
    bool

    Examples
    --------
    >>> from pyspark.sql import is_remote
    >>> is_remote()
    False
    """
    return ("SPARK_CONNECT_MODE_ENABLED" in os.environ) or is_remote_only()

@dannymeijer
Copy link
Member

Here is another Spark 4.0 nugget, which is showing how it is going to be changing in the next release:

def is_remote_only() -> bool:
    """
    Returns if the current running environment is only for Spark Connect.
    If users install pyspark-connect alone, RDD API does not exist.

    .. versionadded:: 4.0.0

    Notes
    -----
    This will only return ``True`` if installed PySpark is only for Spark Connect.
    Otherwise, it returns ``False``.

    This API is unstable, and for developers.

    Returns
    -------
    bool

    Examples
    --------
    >>> from pyspark.sql import is_remote
    >>> is_remote()
    False
    """
    global _is_remote_only

    if "SPARK_SKIP_CONNECT_COMPAT_TESTS" in os.environ:
        return True

    if _is_remote_only is not None:
        return _is_remote_only
    try:
        from pyspark import core  # noqa: F401

        _is_remote_only = False
        return _is_remote_only
    except ImportError:
        _is_remote_only = True
        return _is_remote_only

source: https://spark.apache.org/docs/4.0.0-preview1/api/python/_modules/pyspark/util#is_remote_only

Namely, you can see that pyspark-connect is a separate package with not immediate relationship to 'core'. This as an aside, yet related.

my point... we should do what works for 3.4 and 3.5 spark which might mean we have to rely on there being an active sparksession to just determine this by the class instance type of the SparkSession; specifically because changes we can see coming to us in Spark 4 (when it comes out).

@YevIgn
Copy link
Author

YevIgn commented Nov 13, 2024

Thank you for your reply. I have just checked this check is present on https://github.com/apache/spark/blob/d39f5ab99f67ce959b4379ecc3d6e262c10146cf/python/pyspark/sql/utils.py#L156 - 3.5 brach

@YevIgn
Copy link
Author

YevIgn commented Nov 15, 2024

Neither of this options is set on Databricks Runtime :/

But also there is another bug to be fixed, after manually setting spark.remote the code continues to execute, but fails due to non-replaced isinstance(writer, DeltaMergeBuilder) check further on.

There are couple of them actually:

Given proliferation of this check in several places, it makes sense to make only once at instance initialisation or spin a separate class.

@dannymeijer dannymeijer added this to the 0.9.0 milestone Nov 18, 2024
@dannymeijer dannymeijer linked a pull request Nov 26, 2024 that will close this issue
9 tasks
dannymeijer added a commit that referenced this issue Nov 26, 2024
<!--- Provide a general summary of your changes in the Title above -->

## Description
<!--- Describe your changes in detail -->
Additonal fixes for #101 and #102

## Related Issue
<!--- This project only accepts pull requests related to open issues -->
<!--- If suggesting a new feature or change, please discuss it in an
issue first -->
<!--- If fixing a bug, there should be an issue describing it with steps
to reproduce -->
<!--- Please link to the issue here: -->
#101, #102

## Motivation and Context
<!--- Why is this change required? What problem does it solve? -->

## How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->

## Screenshots (if appropriate):

## Types of changes
<!--- What types of changes does your code introduce? Put an `x` in all
the boxes that apply: -->
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)

## Checklist:
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->
<!--- If you're unsure about any of these, don't hesitate to ask. We're
here to help! -->
- [x] My code follows the code style of this project.
- [ ] My change requires a change to the documentation.
- [ ] I have updated the documentation accordingly.
- [x] I have read the **CONTRIBUTING** document.
- [ ] I have added tests to cover my changes.
- [x] All new and existing tests passed.

---------

Co-authored-by: Danny Meijer <[email protected]>
@YevIgn
Copy link
Author

YevIgn commented Nov 26, 2024

Validated on 0.9.0rc6, now Spark Connect session is correctly detected.

@YevIgn YevIgn closed this as completed Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants