Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mk_str_gen in Integration test is misused to generate regular expressions #10098

Closed
thirtiseven opened this issue Dec 27, 2023 · 3 comments
Closed
Labels
invalid This doesn't seem right test Only impacts tests

Comments

@thirtiseven
Copy link
Collaborator

In Integration tests, we often use mk_str_gen to generate test strings:

def mk_str_gen(pattern):
    return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}')

However, it has also been widely used to generate regular expressions, which will lead to some failures after random_seed is introduced in IT, like #10064. It is because it generated some valid/invalid regular expressions, and caused unexpected results.

I wish we can have a better function or even DataGen class to generate random regexp data, which could include:

  • random literal strings
  • some random valid regular expressions
  • some random invalid regular expressions

So the test data will not randomly break regex IT and will be stronger than letting it generate regexps by chance.

Some tests:

If adding some special cases (which can be generated by '.{0,10}') in mk_str_gen:

def mk_str_gen(pattern):
    return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}').with_special_case(('ab$cd', 100)).with_special_case(('ab\$cd', 100))

and run:

./integration_tests/run_pyspark_from_build.sh -k regexp

we will get:

===== 45 failed, 25 passed, 9 warnings in 67.17s (0:01:07) ======
@thirtiseven thirtiseven added ? - Needs Triage Need team to review and classify test Only impacts tests labels Dec 27, 2023
@thirtiseven
Copy link
Collaborator Author

I'm also not sure why we need this special case '.{0,10}' when generating regexps, if we don't need it, an easy way is to just remove it and use stringGen or literal directly.

@sameerz
Copy link
Collaborator

sameerz commented Dec 27, 2023

Given @revans2 comment in #10093 (comment) , it seems like we are not using mk_str_gen to generate regular expressions. Can you point out where we might actually be generating a regular expression?

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Dec 27, 2023
@thirtiseven thirtiseven added the invalid This doesn't seem right label Dec 28, 2023
@thirtiseven
Copy link
Collaborator Author

Yes sorry, Bobby is correct. Closing it as invalid

@thirtiseven thirtiseven closed this as not planned Won't fix, can't repro, duplicate, stale Dec 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right test Only impacts tests
Projects
None yet
Development

No branches or pull requests

2 participants