Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 0 range regexp appear to be broken #17065

Closed
revans2 opened this issue Oct 11, 2024 · 0 comments · Fixed by #17067
Closed

[BUG] 0 range regexp appear to be broken #17065

revans2 opened this issue Oct 11, 2024 · 0 comments · Fixed by #17067
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@revans2
Copy link
Contributor

revans2 commented Oct 11, 2024

Describe the bug
I recently was testing some spark cases and ran into some failures related to 0 range patterns.

  • A{0,} for replaceRegexp NON_CAPTURE
  • A{0,5} for replaceRegexp NON_CAPTURE
  • [a0-9]{0,2} for replaceRegexp NON_CAPTURE
  • (?:ab){0,3} for containsRe NON_CAPTURE

These were for the java APIs, but it should apply to python too. The patch #16798 appears to have caused this some how.

The differences in replace appear to show that it no longer honors the 0 in the range some of the time. For example the pattern A{0,} being replaced with PROD for an input of 'TEST A' produces 'TEST PROD'. But before it would match everywhere and produce 'PRODTPRODEPRODSPRODTPROD PROD PRODPROD'. I think that is an issue for python too

>>> re.sub("A{0,}","PROD","TEST A")
'PRODTPRODEPRODSPRODTPROD PRODPROD'

Steps/Code to reproduce bug
The tests failing in Spark are

FAILED ../../src/main/python/regexp_test.py::test_regexp_replace_digit[DATAGEN_SEED=1728593263, TZ=UTC] - AssertionError: GPU and CPU string values are different at [0, 'regexp_repl...
FAILED ../../src/main/python/regexp_test.py::test_re_replace_repetition[DATAGEN_SEED=1728593263, TZ=UTC] - AssertionError: GPU and CPU string values are different at [0, 'regexp_repl...
FAILED ../../src/main/python/regexp_test.py::test_regexp_memory_ok[DATAGEN_SEED=1728593263, TZ=UTC, INJECT_OOM] - AssertionError: GPU and CPU boolean values are different at [0, 'RLIKE(a, (...

But the examples above are the cleaned up versions of the tests.

Expected behavior
It should behave like python or java regular expressions.

@revans2 revans2 added the bug Something isn't working label Oct 11, 2024
@davidwendt davidwendt self-assigned this Oct 11, 2024
@davidwendt davidwendt added the libcudf Affects libcudf (C++/CUDA) code. label Oct 11, 2024
@rapids-bot rapids-bot bot closed this as completed in 7bcfc87 Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants