Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle exception for invalid column names containing . #102

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sidharthbolar
Copy link

@sidharthbolar sidharthbolar commented Aug 29, 2022

Issue #, if available:
#99
Description of changes:
Columns with dot in their name are throwing an Analysis Exception
Based on analysis done , the exception is being thrown by the core scala deeque library
Have proposed to handle this exception to notify the caller that a column name with "." should not be passed
the exception handled is added to all check methods and implemented via a static method

Local Tests were passing and have added an additional UT test_invalidColumnException to verify the behaviour

*version = "1.0.1"
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

"""
:param column: column to which constraint is to be applied
"""
if "." in column:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is quite strict, perhaps simply enclosing the field with ticks will do, like f"`column.name`"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@herminio-iovio passing through ticks also does not work and returns the same error

Copy link
Contributor

@chenliu0831 chenliu0831 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use dot with column name in Spark/SQL is quite legitimate, basically to fetch values from fields that are nested. Deequ/PyDeequ as a library does not seem to be the right place to handle this since it's hard to guess customer's intent. For example, this might break some existing use-cases where people actually want to profile/check nested column types.

I think it's up to the caller or other higher level library / service to make the call (e.g. escape with "`" or rename)

@sidharthbolar
Copy link
Author

hi @chenliu0831 thanks for the review
In that case this PR handles this exception and notifies the caller to not pass a column with a "." so that he can handle it outside this library

@chenliu0831
Copy link
Contributor

hi @chenliu0831 thanks for the review In that case this PR handles this exception and notifies the caller to not pass a column with a "." so that he can handle it outside this library

@sidharthbolar right I think I understand the purpose, but column with dot is a valid usage for structured column (JSON type). Do you mind add a test to verify if that works?

Looks like you actually want a better Pythonic error (than the vanilla Spark error) when such column with dot does not exist? If so, how about add a helper to check column existence instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants