-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use parse_url kernel for QUERY literal and column key #10211
Use parse_url kernel for QUERY literal and column key #10211
Conversation
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
url_gen = StringGen(url_pattern_with_key) | ||
assert_gpu_fallback_collect( | ||
lambda spark: unary_op_df(spark, url_gen) | ||
.selectExpr("a", "parse_url(a, 'QUERY', 'a?c')", "parse_url(a, 'QUERY', '*')"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It would probably be best to separate these out into different test runs, as the project will fall back to the CPU if any of them say they are not supported instead of if all of them do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
extractLit(a.children(2)).foreach { key => | ||
if (key.value != null) { | ||
val keyStr = key.value.asInstanceOf[UTF8String].toString | ||
val specialCharacters = List("\\", "[", "]", "{", "}", "^", "-", "$", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need (
and )
in here too for the check.
@NVnavkumar and/or @andygrove is this a good enough check for a possible regexp problem? Be aware that Spark is creating the regexp with a prefix and postfix
private val REGEXPREFIX = "(&|^)"
private val REGEXSUBFIX = "=([^&]*)"
And it is trying to capture the second capture group. So it is kind of a mess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a list of regex meta characters here:
spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
Line 687 in 1acc8e0
private val regexMetaChars = ".$^[]\\|?*+(){}" |
I think we should pull this out of a private val (maybe as a constant in GpuOverrides) and use it in both places.
Plus I think -
could be part of a valid key to extract so there is no need to put it in a regex check here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Signed-off-by: Haoyang Li <[email protected]>
JNI dependencies are merged now |
@@ -747,7 +747,7 @@ class CudfRegexTranspiler(mode: RegexMode) { | |||
e match { | |||
case RegexEscaped(ch) if escapeChars.contains(ch) => Some(escapeChars(ch).toString) | |||
case RegexEscaped(ch) if regexPunct.contains(ch) => Some(ch.toString) | |||
case RegexChar(ch) if !regexMetaChars.contains(ch) => Some(ch.toString) | |||
case RegexChar(ch) if regexMetaChars.contains(ch) => Some(ch.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the negative condition removed here? This should be negated in this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
Signed-off-by: Haoyang Li <[email protected]>
build |
1 similar comment
build |
Found some more mismatched cases with
Seems cpu will always return null for urls without protocol. I think it’s a small bug and we probably could let this pr merge in first and fix it next week as a bug. @NVnavkumar help review this again thanks |
follow-up issue: NVIDIA/spark-rapids-jni#1735 |
NVIDIA/spark-rapids-jni#1740 fixes the bug above, better to merge this after jni merged. |
JNI fix merged, it was not a break change and parse_url kernel is correct enough now. Merging this… |
This PR adds plugin support for QUERY parsing in parse_url (both literal key and column key).
Contributes to #8963
Depends on: NVIDIA/spark-rapids-jni#1704
Depends on: NVIDIA/spark-rapids-jni#1719
Test results is updating in jni pr, perf test results tbd.