Fix test_unsupported_fallback_regexp_replace #10093

thirtiseven · 2023-12-25T10:29:25Z

Data that make this test fail is ('ôtª+\x1e\x02$&\x80v', '[a-z]+'), there is a $ in string a, which is a special character in regex replace. Java doc:

Note that backslashes () and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string; see Matcher.replaceAll. Use Matcher.quoteReplacement(java.lang.String) to suppress the special meaning of these characters, if desired.

This PR simply avoids generating \ and $ in regexp_test.py to fix the test.

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2023-12-25T10:30:56Z

build

gerashegalov

351 PR check failure is unrelated #10094

integration_tests/src/main/python/regexp_test.py

Signed-off-by: Haoyang Li <[email protected]>

revans2 · 2023-12-26T14:53:49Z

build

revans2 · 2023-12-26T15:16:22Z

integration_tests/src/main/python/regexp_test.py

@@ -30,7 +30,7 @@
 _regexp_conf = { 'spark.rapids.sql.regexp.enabled': True }

 def mk_str_gen(pattern):
-    return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}')
+    return StringGen(pattern).with_special_case('').with_special_pattern(r'[^\\$]{0,10}')


So backslashes and dollar signs can cause issues for the replacement string in regexp_replace on the CPU, but we fall back to the CPU properly. So the problem is really just with the data that we generated for this one test. Why then are we getting rid of backslashes and dollar signs for mk_str_gen by default instead of targeting it specifically to just the replacement string for regexp_replace?

i.e. on line 942 we have the following instead.

gen = StringGen('[abcdef]{0,2}').with_special_case('').withSpecialPattern(r'[^\\$]{0,10}')

When mk_str_gen is used to generate regular expressions, the .with_special_pattern('.{0,10}') in it sometimes generates valid or invalid regexps and fails some tests.

For this issue, if adding some special cases (which can be generated by '.{0,10}') in it:

def mk_str_gen(pattern): return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}').with_special_case(('ab$cd', 100)).with_special_case(('ab\$cd', 100))

and run:

./integration_tests/run_pyspark_from_build.sh -k regexp

we will get:

===== 45 failed, 25 passed, 9 warnings in 67.17s (0:01:07) ======

So I think we can leave it by default in regexp_test for now, to at least avoid some random IT failures.

And I think mk_str_gen is not very ideal when generating regex data, we might need a better way to generate random literal strings / valid regexps/ invalid regexps as test data.
filed #10098.

Hi @thirtiseven

There could be a confusion here. Bobby's comment refers to the local var definition inside the test function test_unsupported_fallback_regexp_replace

the diff should be something like

diff --git a/integration_tests/src/main/python/regexp_test.py b/integration_tests/src/main/python/regexp_test.py index 018736f9c9..f015d05502 100644 --- a/integration_tests/src/main/python/regexp_test.py +++ b/integration_tests/src/main/python/regexp_test.py @@ -939,7 +939,7 @@ def test_unsupported_fallback_regexp_extract_all(): @allow_non_gpu('ProjectExec', 'RegExpReplace') def test_unsupported_fallback_regexp_replace(): - gen = mk_str_gen('[abcdef]{0,2}') + gen = StringGen('[abcdef]{0,2}').with_special_case('').with_special_pattern(r'[^\\$]{0,10}') regex_gen = StringGen(r'\[a-z\]\+') def assert_gpu_did_fallback(sql_text): assert_gpu_fallback_collect(lambda spark:

and I see 0 failures with this change for -k regexp

Thanks @gerashegalov

Making the datagen inside the test function makes sense to me and will fix the issue with no failures. I'm trying to explain why I think it might be a general issue when generating regexps with mk_str_gen.

and I see 0 failures with this change for -k regexp.

My step is modifying the mk_str_gen inside the regexp_test.py file to add more special cases and run the -k regexp.

But we don't generate regexps with mk_str_gen. Looking through regexp_test.py, every regexp I see is hard coded, except for test_unsupported_fallback_regexp_extract, test_unsupported_fallback_regexp_extract_all, and test_unsupported_fallback_regexp_replace which do not use mk_str_gen for the regexp. I don't see any gen_scalar* in there that would allow us to make a random scalar regular expression either. I think we should probably update those StringGens used to create the regular expressions so that they don't have any special cases in them. We don't need them to test what we are testing.

Sorry my bad, You are right, we don't generate regexps with mk_str_gen.

I tested with with_special_pattern in a wrong way.

Closed the issue and updated this fix inside the case.

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2023-12-28T11:08:12Z

build

revans2 · 2023-12-28T16:45:21Z

integration_tests/src/main/python/regexp_test.py

@@ -30,7 +30,7 @@
 _regexp_conf = { 'spark.rapids.sql.regexp.enabled': True }

 def mk_str_gen(pattern):
-    return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}')
+    return StringGen(pattern).with_special_case('')


Why is this change still here?

Signed-off-by: Haoyang Li <[email protected]>

gerashegalov

LGTM

thirtiseven · 2023-12-29T00:08:09Z

build

Fix test_unsupported_fallback_regexp_replace

0c08996

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven self-assigned this Dec 25, 2023

gerashegalov reviewed Dec 25, 2023

View reviewed changes

integration_tests/src/main/python/regexp_test.py Outdated Show resolved Hide resolved

sameerz requested a review from NVnavkumar December 26, 2023 00:28

sameerz added the test Only impacts tests label Dec 26, 2023

Use raw string

155a8d7

Signed-off-by: Haoyang Li <[email protected]>

revans2 requested changes Dec 26, 2023

View reviewed changes

sameerz mentioned this pull request Dec 27, 2023

mk_str_gen in Integration test is misused to generate regular expressions #10098

Closed

address comment

93fe145

Signed-off-by: Haoyang Li <[email protected]>

revans2 reviewed Dec 28, 2023

View reviewed changes

Recover mk_str_gen

9fab29e

Signed-off-by: Haoyang Li <[email protected]>

gerashegalov approved these changes Dec 29, 2023

View reviewed changes

revans2 approved these changes Dec 29, 2023

View reviewed changes

revans2 merged commit 9877a3f into NVIDIA:branch-24.02 Dec 29, 2023
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix test_unsupported_fallback_regexp_replace #10093

Fix test_unsupported_fallback_regexp_replace #10093

thirtiseven commented Dec 25, 2023

thirtiseven commented Dec 25, 2023

gerashegalov left a comment •

edited

Loading

revans2 commented Dec 26, 2023

revans2 Dec 26, 2023

thirtiseven Dec 27, 2023

thirtiseven Dec 27, 2023

gerashegalov Dec 27, 2023

thirtiseven Dec 27, 2023

revans2 Dec 27, 2023

thirtiseven Dec 28, 2023 •

edited

Loading

thirtiseven commented Dec 28, 2023

revans2 Dec 28, 2023

thirtiseven Dec 28, 2023

gerashegalov left a comment

thirtiseven commented Dec 29, 2023

Fix test_unsupported_fallback_regexp_replace #10093

Fix test_unsupported_fallback_regexp_replace #10093

Conversation

thirtiseven commented Dec 25, 2023

thirtiseven commented Dec 25, 2023

gerashegalov left a comment • edited Loading

Choose a reason for hiding this comment

revans2 commented Dec 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thirtiseven Dec 28, 2023 • edited Loading

Choose a reason for hiding this comment

thirtiseven commented Dec 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerashegalov left a comment

Choose a reason for hiding this comment

thirtiseven commented Dec 29, 2023

gerashegalov left a comment •

edited

Loading

thirtiseven Dec 28, 2023 •

edited

Loading