Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use parse_url kernel for QUERY literal and column key #10211

Merged
merged 5 commits into from
Jan 26, 2024

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Jan 18, 2024

This PR adds plugin support for QUERY parsing in parse_url (both literal key and column key).

Contributes to #8963

Depends on: NVIDIA/spark-rapids-jni#1704

Depends on: NVIDIA/spark-rapids-jni#1719

Test results is updating in jni pr, perf test results tbd.

@sameerz sameerz added the feature request New feature or request label Jan 22, 2024
url_gen = StringGen(url_pattern_with_key)
assert_gpu_fallback_collect(
lambda spark: unary_op_df(spark, url_gen)
.selectExpr("a", "parse_url(a, 'QUERY', 'a?c')", "parse_url(a, 'QUERY', '*')"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It would probably be best to separate these out into different test runs, as the project will fall back to the CPU if any of them say they are not supported instead of if all of them do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

extractLit(a.children(2)).foreach { key =>
if (key.value != null) {
val keyStr = key.value.asInstanceOf[UTF8String].toString
val specialCharacters = List("\\", "[", "]", "{", "}", "^", "-", "$",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need ( and ) in here too for the check.

@NVnavkumar and/or @andygrove is this a good enough check for a possible regexp problem? Be aware that Spark is creating the regexp with a prefix and postfix

https://github.com/apache/spark/blob/fa60a7e216e63b1edb199b1610b26197815c656b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala#L132-L133

private val REGEXPREFIX = "(&|^)"
private val REGEXSUBFIX = "=([^&]*)"

And it is trying to capture the second capture group. So it is kind of a mess.

Copy link
Collaborator

@NVnavkumar NVnavkumar Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a list of regex meta characters here:

private val regexMetaChars = ".$^[]\\|?*+(){}"

I think we should pull this out of a private val (maybe as a constant in GpuOverrides) and use it in both places.

Plus I think - could be part of a valid key to extract so there is no need to put it in a regex check here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@thirtiseven thirtiseven changed the title Use parse_url kernel for QUERY literal Use parse_url kernel for QUERY literal and column key Jan 24, 2024
@hyperbolic2346
Copy link
Collaborator

JNI dependencies are merged now

@@ -747,7 +747,7 @@ class CudfRegexTranspiler(mode: RegexMode) {
e match {
case RegexEscaped(ch) if escapeChars.contains(ch) => Some(escapeChars(ch).toString)
case RegexEscaped(ch) if regexPunct.contains(ch) => Some(ch.toString)
case RegexChar(ch) if !regexMetaChars.contains(ch) => Some(ch.toString)
case RegexChar(ch) if regexMetaChars.contains(ch) => Some(ch.toString)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the negative condition removed here? This should be negated in this part.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

Signed-off-by: Haoyang Li <[email protected]>
@NVnavkumar
Copy link
Collaborator

build

1 similar comment
@NVnavkumar
Copy link
Collaborator

build

@thirtiseven thirtiseven marked this pull request as ready for review January 25, 2024 20:58
@thirtiseven
Copy link
Collaborator Author

Found some more mismatched cases with df.selectExpr("*", "parse_url(url, 'QUERY', 'query') as pr") yesterday:

|bab.bg/en_page.php?id=33 |null |cpu     |
|bab.bg/en_page.php?id=33 |33   |gpu     |

Seems cpu will always return null for urls without protocol. I think it’s a small bug and we probably could let this pr merge in first and fix it next week as a bug.

@NVnavkumar help review this again thanks

@thirtiseven
Copy link
Collaborator Author

follow-up issue: NVIDIA/spark-rapids-jni#1735

@thirtiseven
Copy link
Collaborator Author

NVIDIA/spark-rapids-jni#1740 fixes the bug above, better to merge this after jni merged.

@thirtiseven
Copy link
Collaborator Author

JNI fix merged, it was not a break change and parse_url kernel is correct enough now. Merging this…

@thirtiseven thirtiseven merged commit 24001fa into NVIDIA:branch-24.02 Jan 26, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants