-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix test_unsupported_fallback_regexp_replace #10093
Fix test_unsupported_fallback_regexp_replace #10093
Conversation
Signed-off-by: Haoyang Li <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
351 PR check failure is unrelated #10094
Signed-off-by: Haoyang Li <[email protected]>
build |
@@ -30,7 +30,7 @@ | |||
_regexp_conf = { 'spark.rapids.sql.regexp.enabled': True } | |||
|
|||
def mk_str_gen(pattern): | |||
return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}') | |||
return StringGen(pattern).with_special_case('').with_special_pattern(r'[^\\$]{0,10}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So backslashes and dollar signs can cause issues for the replacement string in regexp_replace on the CPU, but we fall back to the CPU properly. So the problem is really just with the data that we generated for this one test. Why then are we getting rid of backslashes and dollar signs for mk_str_gen
by default instead of targeting it specifically to just the replacement string for regexp_replace
?
i.e. on line 942 we have the following instead.
gen = StringGen('[abcdef]{0,2}').with_special_case('').withSpecialPattern(r'[^\\$]{0,10}')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When mk_str_gen
is used to generate regular expressions, the .with_special_pattern('.{0,10}')
in it sometimes generates valid or invalid regexps and fails some tests.
For this issue, if adding some special cases (which can be generated by '.{0,10}'
) in it:
def mk_str_gen(pattern):
return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}').with_special_case(('ab$cd', 100)).with_special_case(('ab\$cd', 100))
and run:
./integration_tests/run_pyspark_from_build.sh -k regexp
we will get:
===== 45 failed, 25 passed, 9 warnings in 67.17s (0:01:07) ======
So I think we can leave it by default in regexp_test for now, to at least avoid some random IT failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think mk_str_gen is not very ideal when generating regex data, we might need a better way to generate random literal strings / valid regexps/ invalid regexps as test data.
filed #10098.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @thirtiseven
There could be a confusion here. Bobby's comment refers to the local var definition inside the test function test_unsupported_fallback_regexp_replace
the diff should be something like
diff --git a/integration_tests/src/main/python/regexp_test.py b/integration_tests/src/main/python/regexp_test.py
index 018736f9c9..f015d05502 100644
--- a/integration_tests/src/main/python/regexp_test.py
+++ b/integration_tests/src/main/python/regexp_test.py
@@ -939,7 +939,7 @@ def test_unsupported_fallback_regexp_extract_all():
@allow_non_gpu('ProjectExec', 'RegExpReplace')
def test_unsupported_fallback_regexp_replace():
- gen = mk_str_gen('[abcdef]{0,2}')
+ gen = StringGen('[abcdef]{0,2}').with_special_case('').with_special_pattern(r'[^\\$]{0,10}')
regex_gen = StringGen(r'\[a-z\]\+')
def assert_gpu_did_fallback(sql_text):
assert_gpu_fallback_collect(lambda spark:
and I see 0 failures with this change for -k regexp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gerashegalov
Making the datagen inside the test function makes sense to me and will fix the issue with no failures. I'm trying to explain why I think it might be a general issue when generating regexps with mk_str_gen
.
and I see 0 failures with this change for
-k regexp
.
My step is modifying the mk_str_gen
inside the regexp_test.py
file to add more special cases and run the -k regexp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't generate regexps with mk_str_gen
. Looking through regexp_test.py, every regexp I see is hard coded, except for test_unsupported_fallback_regexp_extract
, test_unsupported_fallback_regexp_extract_all
, and test_unsupported_fallback_regexp_replace
which do not use mk_str_gen
for the regexp. I don't see any gen_scalar* in there that would allow us to make a random scalar regular expression either. I think we should probably update those StringGens used to create the regular expressions so that they don't have any special cases in them. We don't need them to test what we are testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry my bad, You are right, we don't generate regexps with mk_str_gen.
I tested with with_special_pattern
in a wrong way.
Closed the issue and updated this fix inside the case.
Signed-off-by: Haoyang Li <[email protected]>
build |
@@ -30,7 +30,7 @@ | |||
_regexp_conf = { 'spark.rapids.sql.regexp.enabled': True } | |||
|
|||
def mk_str_gen(pattern): | |||
return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}') | |||
return StringGen(pattern).with_special_case('') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change still here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Haoyang Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
build |
Fix #10064
Data that make this test fail is
('ôtª+\x1e\x02$&\x80v', '[a-z]+')
, there is a$
in stringa
, which is a special character in regex replace. Java doc:This PR simply avoids generating
\
and$
in regexp_test.py to fix the test.