Support parse_url #8761

thirtiseven · 2023-07-20T05:35:14Z

Closes #6969
Closes #8512

This PR part supports parse_url.

It will be difficult to follow what java's URI lib does from plugin side, which is used by spark's implementation, because the java code is very long and procedural oriented code with a lot of use of variables and if else. So this PR tries to use regex to cover most cases.

It has passed all cases I know from internet, spark, and java URI lib. Note that the implementation is test-driven, so it is very possible that there are more cases producing different results between cpu/gpu. But I believe it covers most cases.

Fully match all Spark's behaviors will make the regex very long (can see this post of many solutions to valid a url). Here is a list of things I plan to add to this PR:

Parsing UFT-8 special characters
ipv4 validation
ipv6 validation
Hostname validation
Better url validation before parsing
Clean up test cases
Separate regexes for different parts
SPARK-44500 handling and compatibility update
overall parse_url compatibility doc update

Some performance test results:

Test code:

spark-shell --master local[*]   --jars ${SPARK_RAPIDS_PLUGIN_JAR}   --conf spark.plugins=com.nvidia.spark.SQLPlugin   --conf spark.rapids.sql.enabled=true   --conf spark.rapids.sql.explain=ALL --conf spark.executor.memory=10g --conf spark.driver.memory=18g
scala> val df = spark.read.parquet("repeated_data.parquet")
scala> spark.time(df.selectExpr("url","parse_url(url, 'HOST', '')","parse_url(url, 'PATH', '')","parse_url(url, 'REF', '')","parse_url(url, 'PROTOCOL', '')","parse_url(url, 'FILE', '')","parse_url(url, 'AUTHORITY', '')","parse_url(url, 'USERINFO', '')").foreach(_ => ()))

Dataset: 651191 urls from kaggle:

GPU: 314ms (5 times average)
CPU: 1147ms (5 times average)

Dataset: the same kaggle dataset, repeated 100 times:

GPU: 17402ms (5 times average)
CPU: 97978ms (5 times average)

Signed-off-by: Haoyang Li <[email protected]>

revans2

Looks like a really good start.

revans2 · 2023-07-20T13:38:24Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

+          // Should really be an OptionalParam
+          Some(RepeatingParamCheck("key", TypeSig.lit(TypeEnum.STRING), TypeSig.STRING))),
+      (a, conf, p, r) => new ExprMeta[ParseUrl](a, conf, p, r) {
+        val failOnError = SQLConf.get.ansiEnabled


This should be a.failOnError because if someone somehow set just this expression to ANSI mode or not we don't want to change it.

Updated. And I also marked the failOnError not worked on GPU for now because it will be difficult to check if a regex is fail or just return null, and throw error messages matched CPUs'.

revans2 · 2023-07-20T14:00:16Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/urlFunctions.scala

+  private val FILE = "FILE"
+  private val AUTHORITY = "AUTHORITY"
+  private val USERINFO = "USERINFO"
+  private val REGEXPREFIX = """(&|^|\?)"""


Could you add a comment about why this pattern is different from the one that spark uses?

private val REGEXPREFIX = "(&|^)"

Because if it is to remove the ? at the beginning of a key, then wouldn't it mess up something like "http://foo/bar?a=b&?c=d". I think this code will mess it up and will return a result for c when the CPU does not.

revans2 · 2023-07-20T14:14:03Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/urlFunctions.scala

+  }
+
+  private def reValid(url: ColumnVector): ColumnVector = {
+    // TODO: Validite the url


This is a really big TODO. We probably should fall back to the CPU by default on this, at least until this is done, but it does look like a lot of validation is done when we try to parse the regular expression.

Yes, almost all validation is done when extracting.

revans2 · 2023-07-20T14:17:51Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/urlFunctions.scala

+  }
+
+  private def getPattern(key: UTF8String): RegexProgram = {
+    val regex = REGEXPREFIX + key.toString + REGEXSUBFIX


This brings up a kind of interesting corner case. Technically the key is a regular expression. I could mess things up by putting in a .* in the name of the key and it would match. Our regular expressions are not bit for bit compatible with Sparks. I am not sure if it is better for us to escape the regular expression and try to be sure that we treat it as a literal value, or if we try to transpile the resulting regular expression, or if we just document that this might be different.

I personally think that this is a bug so I filed https://issues.apache.org/jira/browse/SPARK-44500 for it. I would prefer that we fix the problem in our code and document that it is different.

cuDF regex does not support \Q and \E, so we cannot use Pattern.quote directly here.

We also can't add \s to escape all special characters, because the literal character set in Characters (any character except [\^$.⎮?*+()) and Character Classes (any character except \^-]) are different.

I suggest creating a follow-up issue on this if you think it is ok.

#8963 should be a good enough follow on. For now I think this is okay.

revans2 · 2023-07-20T14:42:30Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/urlFunctions.scala

+
+  private def reValid(url: ColumnVector): ColumnVector = {
+    // TODO: Validite the url
+    val regex = """^[^ ]*$"""


If all we are doing is checking to see if the URL has white space in it, there are faster ways to do that. stringContains is the most obvious example I can think of.

After I fixed all cases I know, the only thing I want to check before extracting is still checking if the URL has white space in it. Updated to stringContains.

revans2 · 2023-07-20T14:46:13Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/urlFunctions.scala

+  }
+
+  private def reMatch(url: ColumnVector, partToExtract: String): ColumnVector = {
+    val regex = """^(([^:/?#]+):)(//((([^:]*:?[^\@]*)\@)?(\[[0-9A-Za-z%.:]*\]|[^/?#:]*)""" + 


From a performance standpoint capturing all of the groups to turn them into a table, and then only selecting one of them is a real performance problem. You don't have to do it now, but we should at least file a follow on issue to build up the regular expression so we only capture the group that matters. It would also allow us to document the different parts of the pattern and explain how they are used.

integration_tests/src/main/python/url_test.py

thirtiseven · 2023-08-09T11:32:20Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/urlFunctions.scala

+  // 2001:db8:3:4::192.0.2.33  64:ff9b::192.0.2.33 (IPv4-Embedded IPv6 Address)
+  val ipv6Regex13 = """(0:0:0:0:0:(0|FFFF|ffff):((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]))"""
+  // 0:0:0:0:0:0:13.1.68.3
+  // scalastyle:on


The regex part contains many workarounds for cuDF compatibility and (maybe) bugs. I'll try to collect these cases to see if they are bugs later.

I filed #8963 instead. Really this regexp is so long that there is no way the CUDF regexp parser is going to run it will. We really should just have a custom kernel for all of this.

revans2

My main concerns with the current implementation are around scale and performance. The testing appears to be fairly complete, although it might be nice to have a fuzz tester to help validate it more.

revans2 · 2023-08-09T13:32:07Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/urlFunctions.scala

+  // 2001:db8:3:4::192.0.2.33  64:ff9b::192.0.2.33 (IPv4-Embedded IPv6 Address)
+  val ipv6Regex13 = """(0:0:0:0:0:(0|FFFF|ffff):((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]))"""
+  // 0:0:0:0:0:0:13.1.68.3
+  // scalastyle:on


I filed #8963 instead. Really this regexp is so long that there is no way the CUDF regexp parser is going to run it will. We really should just have a custom kernel for all of this.

revans2 · 2023-08-09T13:33:09Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/urlFunctions.scala

+  }
+
+  private def getPattern(key: UTF8String): RegexProgram = {
+    val regex = REGEXPREFIX + key.toString + REGEXSUBFIX


#8963 should be a good enough follow on. For now I think this is okay.

revans2

Sorry I did some of my own testing, and I am not done with it, but I don't think that this is doing the right thing in enough cases for me to approve it at this time. I still need to run more performance tests too.

revans2 · 2023-08-09T16:01:52Z

The following are 749 URLs that I found that do not parse correctly. They are ones where the CPU returned a non-null value for at least one of the URL fields but the GPU got it wrong. bad_urls.csv

But be careful these came from some malicious URL datasets so don't click on any of them!!!

for example with grbj.lyjq.org:8989/server.exe the GPU thinks that the path is 8989/server.exe, but the CPU sees it as null.

Or the second one http://3bromley.boys-brigade.org.uk/index.php?action=viewday&time=1172577600&module=calendarmodule&src=@random43f8a0594ead8 The GPU thinks that the path is an empty string, but the CPU sees it as /index.php. I don't know if we can ship with this, especially not enabled by default.

There are also a lot of problems with performance. The data set I used of about 600,000 URLs took twice as long on my GPU compared to a 12 core CPU, which is getting rather old now. 4 seconds vs 2 seconds. I am willing to look past that if we would get the correct answers for things, but we are not so I don't feel good with checking this in as is.

revans2 · 2023-08-11T14:04:54Z

tests/src/test/scala/com/nvidia/spark/rapids/UrlFunctionsSuite.scala

+    // scalastyle:off line.size.limit
+    import session.sqlContext.implicits._
+    Seq[String](
+      "grbj.lyjq.org:8989/server.exe",


Because many of these URLs are malicious I am not 100% sure that we want to check them in. @sameerz are you okay with these being checked in? I could try and go back and just get the non-malicious URLs in the data sets I found and see if that still repros the problems. Another option would be to check these in as a parquet file so at least they are not in plain text.

Ok, I've figured out why these cases fail, I can generate some random urls and can reproduce the problems in the final version.

@sameerz are you okay with these being checked in?

No, do not check in the malicious urls.

sameerz · 2023-08-13T01:57:41Z

Can we use a standards document like https://datatracker.ietf.org/doc/html/rfc3986#page-49 to help us create test cases for valid or invalid URLs?

thirtiseven · 2023-08-13T03:01:08Z

@sameerz Thanks! it's very helpful for generating valid test cases and checking if we've considered everything. There are many possibilities for invalid urls and edge cases, I think it still relies on testing on many real world urls.

revans2 · 2023-08-14T13:46:16Z

I did another look for more URL datasets, but this time I was looking for a crawler dataset. I found https://commoncrawl.org/the-data/ so that might be an interesting alternative to get some URLs from. Another alternative could be to look for unit/integration tests for other URL/URI parsers with a permissible license that we could adopt.

thirtiseven · 2023-08-14T15:00:00Z

@revans2 Thank you for your testing and data. These failed test cases contain two types:

Cases contains @ in some part other than userinfo. Fixed.
Cases have a port but no protocol. They are actually invalid urls, but somehow Spark will parse them and treat them as opaque URIs. I almost matched this case and noted some compatibility in the doc.

I also did some performance tests, results added in PR description. It seems it really cost much memory, I met some OOM during testing. But after increasing memory, GPU version runs much faster than CPU on my dev machine.

revans2 · 2023-08-14T15:04:23Z

I also did some performance tests, results added in PR description. It seems it really cost much memory, I met some OOM during testing. But after increasing memory, GPU version runs much faster than CPU on my dev machine.

I would love to see the test code that you used, because I see the opposite. The GPU is faster than the CPU for a single core, but not for multiple cores.

thirtiseven · 2023-08-14T15:11:28Z

I also did some performance tests, results added in PR description. It seems it really cost much memory, I met some OOM during testing. But after increasing memory, GPU version runs much faster than CPU on my dev machine.

I would love to see the test code that you used, because I see the opposite. The GPU is faster than the CPU for a single core, but not for multiple cores.

It's in PR description. I think the difference is I increased some memory config. I also can reproduced the result that cpu is faster than gpu with default memory config.

revans2 · 2023-08-14T19:01:50Z

@thirtiseven could you please send me the memory configs that you changed, because I am still not able to reproduce your results. At least not fully.

I ran the query in the PR description, but there were several problems with it.

First all of the parse_url calls included the third parameter. This resulted in a null being returned for all of the URLs. This bypassed the regular expressions so it was much faster compared to not including that third parameter. The second problem I ran into was that the data was skewed (a single row group). This resulted in all of the data being processed by a single task instead of spread among all the tasks. Third the foreach call done causes the data to be copied to the host, which has some overhead, but turns out not that much. Finally the numbers you have for 100x vs 1x look really off. How is it that 100x the data is 300ms, but 17,500 ms for 1x the data?

I fixed some of this by using 100x the data spread evenly between 12 files, didn't include the third parameter and I did a COUNT on the results instead of a foreach.

spark.time(df.selectExpr("COUNT(parse_url(url, 'HOST')) as h","COUNT(parse_url(url, 'PATH')) as p","COUNT(parse_url(url, 'REF')) as r","COUNT(parse_url(url, 'PROTOCOL')) as pr","COUNT(parse_url(url, 'FILE')) as f","COUNT(parse_url(url, 'AUTHORITY')) as a","COUNT(parse_url(url, 'USERINFO')) as u").show())

op	CPU	GPU	GPU fallback	notes
1x parquet foreach => nothing	2,180	369	2,055	data skew
1x parquet + count	1,774	207	1,731	data skew
1x parquet foreach => nothing, no third param	4,543	1,909	4,480	data skew
1x parquet + count, no third param	4,211	1,453	4,213	data skew
100x parquet foreach => nothing, no third param	62,555	101,175	62,435
100x parquet + count, no third param	59,644	90,152	58,570

For these I calculated the median of 3 to 5 runs, depending on how long they took to run. The GPU runs for the 100x data had a lot of variability between runs because my GPU would overheat.

What is more I see results that are different for the CPU and the GPU. I don't know if these correspond to the changes that are documented or not. I have not dug into them yet. But on the surface, the fact that the user info is different for at least one url in the examples provided is not great. (This is for the 100x data case for the count, so just divide by 100 to get the regular data example)

CPU:
+--------+--------+-----+--------+--------+--------+---+                        
|       h|       p|    r|      pr|       f|       a|  u|
+--------+--------+-----+--------+--------+--------+---+
|18694100|64613600|43300|18712200|64613600|18695700|300|
+--------+--------+-----+--------+--------+--------+---+

GPU:
+--------+--------+-----+--------+--------+--------+---+                        
|       h|       p|    r|      pr|       f|       a|  u|
+--------+--------+-----+--------+--------+--------+---+
|18793900|65091000|44200|18814700|65091000|18795600|400|
+--------+--------+-----+--------+--------+--------+---+

thirtiseven · 2023-08-15T01:18:02Z

@thirtiseven could you please send me the memory configs that you changed, because I am still not able to reproduce your results. At least not fully.

I used --conf spark.executor.memory=10g --conf spark.driver.memory=18g when starting the spark-shell. =

Finally the numbers you have for 100x vs 1x look really off. How is it that 100x the data is 300ms, but 17,500 ms for 1x the data?

Sorry, the results are reversed.

Thanks a lot for the testing, I'm checking the issues you mentioned.

update I can reproduce your results now, GPU is really slower in big dataset. I also found more failed cases, working on solving them.

+--------+--------+-----+--------+--------+--------+---+
|       h|       p|    r|      pr|       f|       a|  u|
+--------+--------+-----+--------+--------+--------+---+
|18750600|64690300|44100|18770700|64690300|18752300|300|
+--------+--------+-----+--------+--------+--------+---+

thirtiseven · 2023-08-16T11:57:09Z

I have matched almost all cases (12/651191 left) in the kaggle dataset and recorded the unmatched ones in compatibility doc.

I think it is possible to get enough correctness here, although it won't be able to get faster than cpu with regex solution. We need to rewrite it in jni to get good performance.

Will we have chance to merge this pr in if it can cover enough cases but slower than cpu? @revans2

thirtiseven · 2023-08-16T12:01:47Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/urlFunctions.scala

+  }
+
+  private def reValid(url: ColumnVector): ColumnVector = {
+    val regex = """([^\s]*\s|([^[]*|[^[]*\[.*].*)%([^0-9a-fA-F]|[0-9a-fA-F][^0-9a-fA-F]|$))"""


It is used to check if each %?? is a valid escape, but it will take much more time to do another regex. We may want to consider not supporting this check to improve performance.

revans2 · 2023-08-16T15:59:30Z

Will we have chance to merge this pr in if it can cover enough cases but slower than cpu?

I have a really hard time merging this in if it is slower than the CPU. The differences are documented and the code works really a lot better then I expected a large regular expression to work. But I something that is 2x slower than the CPU is not worth it in my opinion. If there was a clear use case where it wins to have this in, then I would reconsider. Instead I think we just want to put this PR on hold until we get a spark rapids jni custom kernel for it.

thirtiseven and others added 2 commits July 20, 2023 13:11

WIP: Support parse_url

8235c95

Signed-off-by: Haoyang Li <[email protected]>

Merge branch 'NVIDIA:branch-23.08' into prase_url

9f17539

thirtiseven force-pushed the prase_url branch from a1da4af to 37b67c4 Compare July 20, 2023 12:54

fix build failures

729fe35

thirtiseven force-pushed the prase_url branch from 37b67c4 to 729fe35 Compare July 20, 2023 12:57

revans2 reviewed Jul 20, 2023

View reviewed changes

thirtiseven and others added 11 commits August 3, 2023 09:27

regex refactor

85c3284

Merge branch 'NVIDIA:branch-23.08' into prase_url

6819214

Separate regexes and UTF-8 special characters support

4166362

hostname validation

43acceb

hostname validation

64d8373

ipv4 validation

e6a45d3

verify

8c4dc7a

wip ipv6 and SPARK-44500

fee5a3d

optional protocol and ref validation

e81d8a3

IPV6 VALIDATION

93a9342

clean up

1ad665f

sameerz added the feature request New feature or request label Aug 9, 2023

thirtiseven added 2 commits August 9, 2023 17:42

Fix ipv6 validation, it is still wip

3edb929

Fix ipv6 validation and some clean up

daa61ea

thirtiseven commented Aug 9, 2023

View reviewed changes

thirtiseven self-assigned this Aug 9, 2023

thirtiseven marked this pull request as ready for review August 9, 2023 11:37

thirtiseven changed the title ~~WIP: Support parse_url~~ Support parse_url Aug 9, 2023

revans2 mentioned this pull request Aug 9, 2023

[FEA] custom kernel for URI/URL parsing NVIDIA/spark-rapids-jni#1326

Closed

revans2 approved these changes Aug 9, 2023

View reviewed changes

revans2 requested changes Aug 9, 2023

View reviewed changes

thirtiseven changed the base branch from branch-23.08 to branch-23.10 August 11, 2023 01:26

250/749

e6c388f

revans2 reviewed Aug 11, 2023

View reviewed changes

thirtiseven marked this pull request as draft August 11, 2023 14:24

improve URI parsing, updated compatibility doc, add test cases

2ba267b

thirtiseven and others added 7 commits August 15, 2023 16:00

all edge cases passed

b444a9c

clean up, found new failed cases

4a5a7c4

315/651191 left

6a066bd

49/651191 left

603332b

12/651191 left

a70fdcf

clean up and add doc

0408543

Merge branch 'NVIDIA:branch-23.10' into prase_url

59c9d85

thirtiseven commented Aug 16, 2023

View reviewed changes

thirtiseven closed this Sep 25, 2023

thirtiseven mentioned this pull request Oct 19, 2023

Use parse_url kernel for PROTOCOL parsing #9481

Merged

jlowe mentioned this pull request Oct 23, 2023

[FEA] Support parse_url function #8512

Closed

Support parse_url #8761

Support parse_url #8761

Conversation

thirtiseven commented Jul 20, 2023 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thirtiseven Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

revans2 commented Aug 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sameerz commented Aug 13, 2023

thirtiseven commented Aug 13, 2023

revans2 commented Aug 14, 2023

thirtiseven commented Aug 14, 2023

revans2 commented Aug 14, 2023

thirtiseven commented Aug 14, 2023 • edited Loading

revans2 commented Aug 14, 2023

thirtiseven commented Aug 15, 2023 • edited Loading

thirtiseven commented Aug 16, 2023

Choose a reason for hiding this comment

revans2 commented Aug 16, 2023

thirtiseven commented Jul 20, 2023 •

edited

Loading

thirtiseven Aug 9, 2023 •

edited

Loading

thirtiseven commented Aug 14, 2023 •

edited

Loading

thirtiseven commented Aug 15, 2023 •

edited

Loading