Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test_unsupported_fallback_regexp_replace #10093

Merged

Conversation

thirtiseven
Copy link
Collaborator

Fix #10064

Data that make this test fail is ('ôtª+\x1e\x02$&\x80v', '[a-z]+'), there is a $ in string a, which is a special character in regex replace. Java doc:

Note that backslashes () and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string; see Matcher.replaceAll. Use Matcher.quoteReplacement(java.lang.String) to suppress the special meaning of these characters, if desired.

This PR simply avoids generating \ and $ in regexp_test.py to fix the test.

@thirtiseven thirtiseven self-assigned this Dec 25, 2023
@thirtiseven
Copy link
Collaborator Author

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

351 PR check failure is unrelated #10094

integration_tests/src/main/python/regexp_test.py Outdated Show resolved Hide resolved
@sameerz sameerz requested a review from NVnavkumar December 26, 2023 00:28
@sameerz sameerz added the test Only impacts tests label Dec 26, 2023
Signed-off-by: Haoyang Li <[email protected]>
@revans2
Copy link
Collaborator

revans2 commented Dec 26, 2023

build

@@ -30,7 +30,7 @@
_regexp_conf = { 'spark.rapids.sql.regexp.enabled': True }

def mk_str_gen(pattern):
return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}')
return StringGen(pattern).with_special_case('').with_special_pattern(r'[^\\$]{0,10}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So backslashes and dollar signs can cause issues for the replacement string in regexp_replace on the CPU, but we fall back to the CPU properly. So the problem is really just with the data that we generated for this one test. Why then are we getting rid of backslashes and dollar signs for mk_str_gen by default instead of targeting it specifically to just the replacement string for regexp_replace?

i.e. on line 942 we have the following instead.

gen = StringGen('[abcdef]{0,2}').with_special_case('').withSpecialPattern(r'[^\\$]{0,10}')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When mk_str_gen is used to generate regular expressions, the .with_special_pattern('.{0,10}') in it sometimes generates valid or invalid regexps and fails some tests.

For this issue, if adding some special cases (which can be generated by '.{0,10}') in it:

def mk_str_gen(pattern):
    return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}').with_special_case(('ab$cd', 100)).with_special_case(('ab\$cd', 100))

and run:

./integration_tests/run_pyspark_from_build.sh -k regexp

we will get:

===== 45 failed, 25 passed, 9 warnings in 67.17s (0:01:07) ======

So I think we can leave it by default in regexp_test for now, to at least avoid some random IT failures.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think mk_str_gen is not very ideal when generating regex data, we might need a better way to generate random literal strings / valid regexps/ invalid regexps as test data.
filed #10098.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @thirtiseven

There could be a confusion here. Bobby's comment refers to the local var definition inside the test function test_unsupported_fallback_regexp_replace

the diff should be something like

diff --git a/integration_tests/src/main/python/regexp_test.py b/integration_tests/src/main/python/regexp_test.py
index 018736f9c9..f015d05502 100644
--- a/integration_tests/src/main/python/regexp_test.py
+++ b/integration_tests/src/main/python/regexp_test.py
@@ -939,7 +939,7 @@ def test_unsupported_fallback_regexp_extract_all():
 
 @allow_non_gpu('ProjectExec', 'RegExpReplace')
 def test_unsupported_fallback_regexp_replace():
-    gen = mk_str_gen('[abcdef]{0,2}')
+    gen = StringGen('[abcdef]{0,2}').with_special_case('').with_special_pattern(r'[^\\$]{0,10}')
     regex_gen = StringGen(r'\[a-z\]\+')
     def assert_gpu_did_fallback(sql_text):
         assert_gpu_fallback_collect(lambda spark:

and I see 0 failures with this change for -k regexp

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gerashegalov

Making the datagen inside the test function makes sense to me and will fix the issue with no failures. I'm trying to explain why I think it might be a general issue when generating regexps with mk_str_gen.

and I see 0 failures with this change for -k regexp.

My step is modifying the mk_str_gen inside the regexp_test.py file to add more special cases and run the -k regexp.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't generate regexps with mk_str_gen. Looking through regexp_test.py, every regexp I see is hard coded, except for test_unsupported_fallback_regexp_extract, test_unsupported_fallback_regexp_extract_all, and test_unsupported_fallback_regexp_replace which do not use mk_str_gen for the regexp. I don't see any gen_scalar* in there that would allow us to make a random scalar regular expression either. I think we should probably update those StringGens used to create the regular expressions so that they don't have any special cases in them. We don't need them to test what we are testing.

Copy link
Collaborator Author

@thirtiseven thirtiseven Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my bad, You are right, we don't generate regexps with mk_str_gen.

I tested with with_special_pattern in a wrong way.

Closed the issue and updated this fix inside the case.

Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

build

@@ -30,7 +30,7 @@
_regexp_conf = { 'spark.rapids.sql.regexp.enabled': True }

def mk_str_gen(pattern):
return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}')
return StringGen(pattern).with_special_case('')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change still here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Haoyang Li <[email protected]>
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thirtiseven
Copy link
Collaborator Author

build

@revans2 revans2 merged commit 9877a3f into NVIDIA:branch-24.02 Dec 29, 2023
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] test_unsupported_fallback_regexp_replace failed with DATAGEN_SEED=1702662063
4 participants