Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parse_url #8761

Closed
wants to merge 25 commits into from
Closed

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Jul 20, 2023

Closes #6969
Closes #8512

This PR part supports parse_url.

It will be difficult to follow what java's URI lib does from plugin side, which is used by spark's implementation, because the java code is very long and procedural oriented code with a lot of use of variables and if else. So this PR tries to use regex to cover most cases.

It has passed all cases I know from internet, spark, and java URI lib. Note that the implementation is test-driven, so it is very possible that there are more cases producing different results between cpu/gpu. But I believe it covers most cases.

Fully match all Spark's behaviors will make the regex very long (can see this post of many solutions to valid a url). Here is a list of things I plan to add to this PR:

  • Parsing UFT-8 special characters
  • ipv4 validation
  • ipv6 validation
  • Hostname validation
  • Better url validation before parsing
  • Clean up test cases
  • Separate regexes for different parts
  • SPARK-44500 handling and compatibility update
  • overall parse_url compatibility doc update

Some performance test results:

Test code:

spark-shell --master local[*]   --jars ${SPARK_RAPIDS_PLUGIN_JAR}   --conf spark.plugins=com.nvidia.spark.SQLPlugin   --conf spark.rapids.sql.enabled=true   --conf spark.rapids.sql.explain=ALL --conf spark.executor.memory=10g --conf spark.driver.memory=18g
scala> val df = spark.read.parquet("repeated_data.parquet")
scala> spark.time(df.selectExpr("url","parse_url(url, 'HOST', '')","parse_url(url, 'PATH', '')","parse_url(url, 'REF', '')","parse_url(url, 'PROTOCOL', '')","parse_url(url, 'FILE', '')","parse_url(url, 'AUTHORITY', '')","parse_url(url, 'USERINFO', '')").foreach(_ => ()))

Dataset: 651191 urls from kaggle:

GPU: 314ms (5 times average)
CPU: 1147ms (5 times average)

Dataset: the same kaggle dataset, repeated 100 times:

GPU: 17402ms (5 times average)
CPU: 97978ms (5 times average)

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a really good start.

// Should really be an OptionalParam
Some(RepeatingParamCheck("key", TypeSig.lit(TypeEnum.STRING), TypeSig.STRING))),
(a, conf, p, r) => new ExprMeta[ParseUrl](a, conf, p, r) {
val failOnError = SQLConf.get.ansiEnabled
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a.failOnError because if someone somehow set just this expression to ANSI mode or not we don't want to change it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. And I also marked the failOnError not worked on GPU for now because it will be difficult to check if a regex is fail or just return null, and throw error messages matched CPUs'.

private val FILE = "FILE"
private val AUTHORITY = "AUTHORITY"
private val USERINFO = "USERINFO"
private val REGEXPREFIX = """(&|^|\?)"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment about why this pattern is different from the one that spark uses?

private val REGEXPREFIX = "(&|^)"

Because if it is to remove the ? at the beginning of a key, then wouldn't it mess up something like "http://foo/bar?a=b&?c=d". I think this code will mess it up and will return a result for c when the CPU does not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}

private def reValid(url: ColumnVector): ColumnVector = {
// TODO: Validite the url
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really big TODO. We probably should fall back to the CPU by default on this, at least until this is done, but it does look like a lot of validation is done when we try to parse the regular expression.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, almost all validation is done when extracting.

}

private def getPattern(key: UTF8String): RegexProgram = {
val regex = REGEXPREFIX + key.toString + REGEXSUBFIX
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This brings up a kind of interesting corner case. Technically the key is a regular expression. I could mess things up by putting in a .* in the name of the key and it would match. Our regular expressions are not bit for bit compatible with Sparks. I am not sure if it is better for us to escape the regular expression and try to be sure that we treat it as a literal value, or if we try to transpile the resulting regular expression, or if we just document that this might be different.

I personally think that this is a bug so I filed https://issues.apache.org/jira/browse/SPARK-44500 for it. I would prefer that we fix the problem in our code and document that it is different.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuDF regex does not support \Q and \E, so we cannot use Pattern.quote directly here.

We also can't add \s to escape all special characters, because the literal character set in Characters (any character except [\^$.⎮?*+()) and Character Classes (any character except \^-]) are different.

I suggest creating a follow-up issue on this if you think it is ok.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#8963 should be a good enough follow on. For now I think this is okay.


private def reValid(url: ColumnVector): ColumnVector = {
// TODO: Validite the url
val regex = """^[^ ]*$"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all we are doing is checking to see if the URL has white space in it, there are faster ways to do that. stringContains is the most obvious example I can think of.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After I fixed all cases I know, the only thing I want to check before extracting is still checking if the URL has white space in it. Updated to stringContains.

}

private def reMatch(url: ColumnVector, partToExtract: String): ColumnVector = {
val regex = """^(([^:/?#]+):)(//((([^:]*:?[^\@]*)\@)?(\[[0-9A-Za-z%.:]*\]|[^/?#:]*)""" +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a performance standpoint capturing all of the groups to turn them into a table, and then only selecting one of them is a real performance problem. You don't have to do it now, but we should at least file a follow on issue to build up the regular expression so we only capture the group that matters. It would also allow us to document the different parts of the pattern and explain how they are used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

integration_tests/src/main/python/url_test.py Show resolved Hide resolved
@sameerz sameerz added the feature request New feature or request label Aug 9, 2023
// 2001:db8:3:4::192.0.2.33 64:ff9b::192.0.2.33 (IPv4-Embedded IPv6 Address)
val ipv6Regex13 = """(0:0:0:0:0:(0|FFFF|ffff):((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]))"""
// 0:0:0:0:0:0:13.1.68.3
// scalastyle:on
Copy link
Collaborator Author

@thirtiseven thirtiseven Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex part contains many workarounds for cuDF compatibility and (maybe) bugs. I'll try to collect these cases to see if they are bugs later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #8963 instead. Really this regexp is so long that there is no way the CUDF regexp parser is going to run it will. We really should just have a custom kernel for all of this.

@thirtiseven thirtiseven self-assigned this Aug 9, 2023
@thirtiseven thirtiseven marked this pull request as ready for review August 9, 2023 11:37
@thirtiseven thirtiseven changed the title WIP: Support parse_url Support parse_url Aug 9, 2023
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concerns with the current implementation are around scale and performance. The testing appears to be fairly complete, although it might be nice to have a fuzz tester to help validate it more.

// 2001:db8:3:4::192.0.2.33 64:ff9b::192.0.2.33 (IPv4-Embedded IPv6 Address)
val ipv6Regex13 = """(0:0:0:0:0:(0|FFFF|ffff):((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]))"""
// 0:0:0:0:0:0:13.1.68.3
// scalastyle:on
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #8963 instead. Really this regexp is so long that there is no way the CUDF regexp parser is going to run it will. We really should just have a custom kernel for all of this.

}

private def getPattern(key: UTF8String): RegexProgram = {
val regex = REGEXPREFIX + key.toString + REGEXSUBFIX
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#8963 should be a good enough follow on. For now I think this is okay.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I did some of my own testing, and I am not done with it, but I don't think that this is doing the right thing in enough cases for me to approve it at this time. I still need to run more performance tests too.

@revans2
Copy link
Collaborator

revans2 commented Aug 9, 2023

The following are 749 URLs that I found that do not parse correctly. They are ones where the CPU returned a non-null value for at least one of the URL fields but the GPU got it wrong. bad_urls.csv

But be careful these came from some malicious URL datasets so don't click on any of them!!!

for example with grbj.lyjq.org:8989/server.exe the GPU thinks that the path is 8989/server.exe, but the CPU sees it as null.

Or the second one http://3bromley.boys-brigade.org.uk/index.php?action=viewday&time=1172577600&module=calendarmodule&src=@random43f8a0594ead8 The GPU thinks that the path is an empty string, but the CPU sees it as /index.php. I don't know if we can ship with this, especially not enabled by default.

There are also a lot of problems with performance. The data set I used of about 600,000 URLs took twice as long on my GPU compared to a 12 core CPU, which is getting rather old now. 4 seconds vs 2 seconds. I am willing to look past that if we would get the correct answers for things, but we are not so I don't feel good with checking this in as is.

@thirtiseven thirtiseven changed the base branch from branch-23.08 to branch-23.10 August 11, 2023 01:26
// scalastyle:off line.size.limit
import session.sqlContext.implicits._
Seq[String](
"grbj.lyjq.org:8989/server.exe",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because many of these URLs are malicious I am not 100% sure that we want to check them in. @sameerz are you okay with these being checked in? I could try and go back and just get the non-malicious URLs in the data sets I found and see if that still repros the problems. Another option would be to check these in as a parquet file so at least they are not in plain text.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've figured out why these cases fail, I can generate some random urls and can reproduce the problems in the final version.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sameerz are you okay with these being checked in?

No, do not check in the malicious urls.

@thirtiseven thirtiseven marked this pull request as draft August 11, 2023 14:24
@sameerz
Copy link
Collaborator

sameerz commented Aug 13, 2023

Can we use a standards document like https://datatracker.ietf.org/doc/html/rfc3986#page-49 to help us create test cases for valid or invalid URLs?

@thirtiseven
Copy link
Collaborator Author

@sameerz Thanks! it's very helpful for generating valid test cases and checking if we've considered everything. There are many possibilities for invalid urls and edge cases, I think it still relies on testing on many real world urls.

@revans2
Copy link
Collaborator

revans2 commented Aug 14, 2023

I did another look for more URL datasets, but this time I was looking for a crawler dataset. I found https://commoncrawl.org/the-data/ so that might be an interesting alternative to get some URLs from. Another alternative could be to look for unit/integration tests for other URL/URI parsers with a permissible license that we could adopt.

@thirtiseven
Copy link
Collaborator Author

@revans2 Thank you for your testing and data. These failed test cases contain two types:

  • Cases contains @ in some part other than userinfo. Fixed.
  • Cases have a port but no protocol. They are actually invalid urls, but somehow Spark will parse them and treat them as opaque URIs. I almost matched this case and noted some compatibility in the doc.

I also did some performance tests, results added in PR description. It seems it really cost much memory, I met some OOM during testing. But after increasing memory, GPU version runs much faster than CPU on my dev machine.

@revans2
Copy link
Collaborator

revans2 commented Aug 14, 2023

I also did some performance tests, results added in PR description. It seems it really cost much memory, I met some OOM during testing. But after increasing memory, GPU version runs much faster than CPU on my dev machine.

I would love to see the test code that you used, because I see the opposite. The GPU is faster than the CPU for a single core, but not for multiple cores.

@thirtiseven
Copy link
Collaborator Author

thirtiseven commented Aug 14, 2023

I also did some performance tests, results added in PR description. It seems it really cost much memory, I met some OOM during testing. But after increasing memory, GPU version runs much faster than CPU on my dev machine.

I would love to see the test code that you used, because I see the opposite. The GPU is faster than the CPU for a single core, but not for multiple cores.

It's in PR description. I think the difference is I increased some memory config. I also can reproduced the result that cpu is faster than gpu with default memory config.

@revans2
Copy link
Collaborator

revans2 commented Aug 14, 2023

@thirtiseven could you please send me the memory configs that you changed, because I am still not able to reproduce your results. At least not fully.

I ran the query in the PR description, but there were several problems with it.

First all of the parse_url calls included the third parameter. This resulted in a null being returned for all of the URLs. This bypassed the regular expressions so it was much faster compared to not including that third parameter. The second problem I ran into was that the data was skewed (a single row group). This resulted in all of the data being processed by a single task instead of spread among all the tasks. Third the foreach call done causes the data to be copied to the host, which has some overhead, but turns out not that much. Finally the numbers you have for 100x vs 1x look really off. How is it that 100x the data is 300ms, but 17,500 ms for 1x the data?

I fixed some of this by using 100x the data spread evenly between 12 files, didn't include the third parameter and I did a COUNT on the results instead of a foreach.

spark.time(df.selectExpr("COUNT(parse_url(url, 'HOST')) as h","COUNT(parse_url(url, 'PATH')) as p","COUNT(parse_url(url, 'REF')) as r","COUNT(parse_url(url, 'PROTOCOL')) as pr","COUNT(parse_url(url, 'FILE')) as f","COUNT(parse_url(url, 'AUTHORITY')) as a","COUNT(parse_url(url, 'USERINFO')) as u").show())
op CPU GPU GPU fallback notes
1x parquet foreach => nothing 2,180 369 2,055 data skew
1x parquet + count 1,774 207 1,731 data skew
1x parquet foreach => nothing, no third param 4,543 1,909 4,480 data skew
1x parquet + count, no third param 4,211 1,453 4,213 data skew
100x parquet foreach => nothing, no third param 62,555 101,175 62,435  
100x parquet + count, no third param 59,644 90,152 58,570  

For these I calculated the median of 3 to 5 runs, depending on how long they took to run. The GPU runs for the 100x data had a lot of variability between runs because my GPU would overheat.

What is more I see results that are different for the CPU and the GPU. I don't know if these correspond to the changes that are documented or not. I have not dug into them yet. But on the surface, the fact that the user info is different for at least one url in the examples provided is not great. (This is for the 100x data case for the count, so just divide by 100 to get the regular data example)

CPU:
+--------+--------+-----+--------+--------+--------+---+                        
|       h|       p|    r|      pr|       f|       a|  u|
+--------+--------+-----+--------+--------+--------+---+
|18694100|64613600|43300|18712200|64613600|18695700|300|
+--------+--------+-----+--------+--------+--------+---+

GPU:
+--------+--------+-----+--------+--------+--------+---+                        
|       h|       p|    r|      pr|       f|       a|  u|
+--------+--------+-----+--------+--------+--------+---+
|18793900|65091000|44200|18814700|65091000|18795600|400|
+--------+--------+-----+--------+--------+--------+---+

@thirtiseven
Copy link
Collaborator Author

thirtiseven commented Aug 15, 2023

@thirtiseven could you please send me the memory configs that you changed, because I am still not able to reproduce your results. At least not fully.

I used --conf spark.executor.memory=10g --conf spark.driver.memory=18g when starting the spark-shell. =

Finally the numbers you have for 100x vs 1x look really off. How is it that 100x the data is 300ms, but 17,500 ms for 1x the data?

Sorry, the results are reversed.

Thanks a lot for the testing, I'm checking the issues you mentioned.

update I can reproduce your results now, GPU is really slower in big dataset. I also found more failed cases, working on solving them.

+--------+--------+-----+--------+--------+--------+---+
|       h|       p|    r|      pr|       f|       a|  u|
+--------+--------+-----+--------+--------+--------+---+
|18750600|64690300|44100|18770700|64690300|18752300|300|
+--------+--------+-----+--------+--------+--------+---+

@thirtiseven
Copy link
Collaborator Author

I have matched almost all cases (12/651191 left) in the kaggle dataset and recorded the unmatched ones in compatibility doc.

I think it is possible to get enough correctness here, although it won't be able to get faster than cpu with regex solution. We need to rewrite it in jni to get good performance.

Will we have chance to merge this pr in if it can cover enough cases but slower than cpu? @revans2

}

private def reValid(url: ColumnVector): ColumnVector = {
val regex = """([^\s]*\s|([^[]*|[^[]*\[.*].*)%([^0-9a-fA-F]|[0-9a-fA-F][^0-9a-fA-F]|$))"""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used to check if each %?? is a valid escape, but it will take much more time to do another regex. We may want to consider not supporting this check to improve performance.

@revans2
Copy link
Collaborator

revans2 commented Aug 16, 2023

Will we have chance to merge this pr in if it can cover enough cases but slower than cpu?

I have a really hard time merging this in if it is slower than the CPU. The differences are documented and the code works really a lot better then I expected a large regular expression to work. But I something that is 2x slower than the CPU is not worth it in my opinion. If there was a clear use case where it wins to have this in, then I would reconsider. Instead I think we just want to put this PR on hold until we get a spark rapids jni custom kernel for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support parse_url function [FEA] Support parse_url
3 participants