Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Better scaling for simple regular expressions on long strings #14087

Open
revans2 opened this issue Sep 12, 2023 · 2 comments
Open

[FEA] Better scaling for simple regular expressions on long strings #14087

revans2 opened this issue Sep 12, 2023 · 2 comments
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request Performance Performance related issue Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)

Comments

@revans2
Copy link
Contributor

revans2 commented Sep 12, 2023

Is your feature request related to a problem? Please describe.
In Spark we have had multiple customers that try to process really long strings with simple regular expressions. The reality is that in most cases they don't need a regular expression but the API/expression Spark exposes is for a regular expression so they use it. An example of this is string split, where what is being split on is a regular expression. They will split on things like a comma , or try to parse JSON like formatted strings by splitting on } or }} sequences. But they do this on very large strings. Strings that are over 120KiB in size. When this happens we see really bad performance on the GPU. Even worse than single threaded performance on the CPU to process the same data. Here is an example where we are essentially doing an explode(split(column_name, "}}")) on 500 rows. It is 500 rows, because the length of the strings involved end up making that about 1 row group in parquet, so this is the data that a single Spark task sees.

String Length GPU Median Time CPU Median Time Hacked GPU Median Time
10 316 249 395
100 321 298 371
1,000 333 753 377
10,000 1,352 5,401 407
20,000 4,914 10,630 459
40,000 15,810 21,261 564
80,000 66,409 43,487 773
100,000 111,781 54,989 902
120,000 134,409 66,066 1,035
140,000 212,333 76,900 1,232

chart(1)

In this the Hacked GPU Median Time is when I hacked the Spark plugin to ignore the regular expression and instead use the non-regular expression CUDF API to split the string.

Describe the solution you'd like
In the Rapids Plugin we have put in place a number of optimizations where we will parse the regular expression and if possible transpile it to a string that we can do a non regular expression split on. We think it is worth pushing this type of optimization into CUDF itself and not just for splits. It would really be nice if CUDF could spend time to look for alternative ways to execute a regular expression, especially for really long string, that don't need a single thread per string to work properly.

Examples include

  • Seeing if a regular expression can be transpiled to static string and using an alternative string operation instead. This could apply to split, matches, etc.
  • When just checking if a regular expression matches a string we could match things like FOO.* and convert it into a starts with operation instead. This could also apply to contains or ends with.

I would like it in CUDF because I think it would benefit everyone, not just the Spark plugin, but also I think the RAPIDS team could do a better job in many cases of finding these optimizations than we are doing.

Describe alternatives you've considered
Update our own regular expression checker code to start doing more of these optimizations.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Performance Performance related issue Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python) labels Sep 12, 2023
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Sep 18, 2023

Thank you @revans2 for raising this issue. It's true that often the faster regex accelerator is the one that doesn't use regex at all.

I would like it in CUDF because I think it would benefit everyone, not just the Spark plugin, but also I think the RAPIDS team could do a better job in many cases of finding these optimizations than we are doing.

We'll have to think more about the design of this. It doesn't seem like a great solution for libcudf to expose an option in the regex engine to try and avoid the regex engine. The best idea I have for now is that Regex transformation to Strings API could happen in a JIT submodule.

@revans2
Copy link
Contributor Author

revans2 commented Sep 18, 2023

It doesn't seem like a great solution for libcudf to expose an option in the regex engine to try and avoid the regex engine.

I don't see this as an option, I see it as an optimization. Why would an end user ever know or care that when they asked to split a string on the regular expression "}}" that CUDF called one kernel vs another and got the answer back much faster. We already do this kind of thing for long strings where we have separate kernels and look at the average length of the string to decide which to call. How is this any different?

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment and removed Needs Triage Need team to review and classify labels Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request Performance Performance related issue Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

2 participants