Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use parse_url kernel for QUERY parsing #10061

Merged
merged 2 commits into from
Dec 27, 2023

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Dec 15, 2023

This PR adds plugin support for QUERY parsing in parse_url (returning the entire query).

Contributes to #8963

Depends on: NVIDIA/spark-rapids-jni#1652

All cases from the kaggle dataset passed.

Perf test results

dataset: kaggle dataset repeated 10 times.

spark.time(df.selectExpr("COUNT(parse_url(url, 'QUERY')) as pr1", "COUNT(parse_url(url, 'QUERY')) as pr2", "COUNT(parse_url(url, 'QUERY')) as pr3", "COUNT(parse_url(url, 'QUERY')) as pr4", "COUNT(parse_url(url, 'QUERY')) as pr5", "COUNT(parse_url(url, 'QUERY')) as pr6", "COUNT(parse_url(url, 'QUERY')) as pr7", "COUNT(parse_url(url, 'QUERY')) as pr8", "COUNT(parse_url(url, 'QUERY')) as pr9", "COUNT(parse_url(url, 'QUERY')) as pr0").show())
GPU Time (ms) CPU Time (ms) Speed up
966 14,583 15.09x

@sameerz sameerz added the feature request New feature or request label Dec 18, 2023
@thirtiseven thirtiseven self-assigned this Dec 19, 2023
@thirtiseven thirtiseven marked this pull request as ready for review December 19, 2023 05:15
@thirtiseven
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I did some manual testing and this falls over badly if I include the third parameter for query.

spark.time(spark.range(0, 300000000L, 1, 24).selectExpr("CONCAT('http://a.com?foo=', CAST(id AS STRING)) as h").selectExpr("SUM(length(parse_url(h, 'QUERY', 'foo'))) as q").show())
Caused by: java.lang.UnsupportedOperationException: parse_url(input[0, string, false](h#340), QUERY, foo) is not supported partToExtract=QUERY. Only PROTOCOL and HOST are supported
  at org.apache.spark.sql.rapids.GpuParseUrl.doColumnar(GpuParseUrl.scala:85)
  at org.apache.spark.sql.rapids.GpuParseUrl.$anonfun$columnarEval$5(GpuParseUrl.scala:111)
  at com.nvidia.spark.rapids.Arm$.withResourceIfAllowed(Arm.scala:74)
  at org.apache.spark.sql.rapids.GpuParseUrl.$anonfun$columnarEval$4(GpuParseUrl.scala:108)
  at com.nvidia.spark.rapids.Arm$.withResourceIfAllowed(Arm.scala:74)
  at org.apache.spark.sql.rapids.GpuParseUrl.$anonfun$columnarEval$3(GpuParseUrl.scala:107)
  at com.nvidia.spark.rapids.Arm$.withResourceIfAllowed(Arm.scala:74)

Can we please add tests that verify that we fall back to the CPU if we see the extra parameter for 'QUERY'?

Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

Sorry I did some manual testing and this falls over badly if I include the third parameter for query.

spark.time(spark.range(0, 300000000L, 1, 24).selectExpr("CONCAT('http://a.com?foo=', CAST(id AS STRING)) as h").selectExpr("SUM(length(parse_url(h, 'QUERY', 'foo'))) as q").show())
Caused by: java.lang.UnsupportedOperationException: parse_url(input[0, string, false](h#340), QUERY, foo) is not supported partToExtract=QUERY. Only PROTOCOL and HOST are supported
  at org.apache.spark.sql.rapids.GpuParseUrl.doColumnar(GpuParseUrl.scala:85)
  at org.apache.spark.sql.rapids.GpuParseUrl.$anonfun$columnarEval$5(GpuParseUrl.scala:111)
  at com.nvidia.spark.rapids.Arm$.withResourceIfAllowed(Arm.scala:74)
  at org.apache.spark.sql.rapids.GpuParseUrl.$anonfun$columnarEval$4(GpuParseUrl.scala:108)
  at com.nvidia.spark.rapids.Arm$.withResourceIfAllowed(Arm.scala:74)
  at org.apache.spark.sql.rapids.GpuParseUrl.$anonfun$columnarEval$3(GpuParseUrl.scala:107)
  at com.nvidia.spark.rapids.Arm$.withResourceIfAllowed(Arm.scala:74)

Can we please add tests that verify that we fall back to the CPU if we see the extra parameter for 'QUERY'?

Done, thanks!

@revans2
Copy link
Collaborator

revans2 commented Dec 27, 2023

build

@thirtiseven thirtiseven merged commit a9c1d68 into NVIDIA:branch-24.02 Dec 27, 2023
39 checks passed
@thirtiseven thirtiseven deleted the parse_url_query branch December 27, 2023 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants