-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support parse_url #8761
Support parse_url #8761
Conversation
Signed-off-by: Haoyang Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a really good start.
// Should really be an OptionalParam | ||
Some(RepeatingParamCheck("key", TypeSig.lit(TypeEnum.STRING), TypeSig.STRING))), | ||
(a, conf, p, r) => new ExprMeta[ParseUrl](a, conf, p, r) { | ||
val failOnError = SQLConf.get.ansiEnabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a.failOnError
because if someone somehow set just this expression to ANSI mode or not we don't want to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. And I also marked the failOnError not worked on GPU for now because it will be difficult to check if a regex is fail or just return null, and throw error messages matched CPUs'.
private val FILE = "FILE" | ||
private val AUTHORITY = "AUTHORITY" | ||
private val USERINFO = "USERINFO" | ||
private val REGEXPREFIX = """(&|^|\?)""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment about why this pattern is different from the one that spark uses?
private val REGEXPREFIX = "(&|^)"
Because if it is to remove the ? at the beginning of a key, then wouldn't it mess up something like "http://foo/bar?a=b&?c=d". I think this code will mess it up and will return a result for c when the CPU does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
} | ||
|
||
private def reValid(url: ColumnVector): ColumnVector = { | ||
// TODO: Validite the url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really big TODO. We probably should fall back to the CPU by default on this, at least until this is done, but it does look like a lot of validation is done when we try to parse the regular expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, almost all validation is done when extracting.
} | ||
|
||
private def getPattern(key: UTF8String): RegexProgram = { | ||
val regex = REGEXPREFIX + key.toString + REGEXSUBFIX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This brings up a kind of interesting corner case. Technically the key is a regular expression. I could mess things up by putting in a .* in the name of the key and it would match. Our regular expressions are not bit for bit compatible with Sparks. I am not sure if it is better for us to escape the regular expression and try to be sure that we treat it as a literal value, or if we try to transpile the resulting regular expression, or if we just document that this might be different.
I personally think that this is a bug so I filed https://issues.apache.org/jira/browse/SPARK-44500 for it. I would prefer that we fix the problem in our code and document that it is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuDF regex does not support \Q and \E, so we cannot use Pattern.quote
directly here.
We also can't add \
s to escape all special characters, because the literal character set in Characters (any character except [\^$.⎮?*+()
) and Character Classes (any character except \^-]
) are different.
I suggest creating a follow-up issue on this if you think it is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#8963 should be a good enough follow on. For now I think this is okay.
|
||
private def reValid(url: ColumnVector): ColumnVector = { | ||
// TODO: Validite the url | ||
val regex = """^[^ ]*$""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all we are doing is checking to see if the URL has white space in it, there are faster ways to do that. stringContains
is the most obvious example I can think of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After I fixed all cases I know, the only thing I want to check before extracting is still checking if the URL has white space in it. Updated to stringContains.
} | ||
|
||
private def reMatch(url: ColumnVector, partToExtract: String): ColumnVector = { | ||
val regex = """^(([^:/?#]+):)(//((([^:]*:?[^\@]*)\@)?(\[[0-9A-Za-z%.:]*\]|[^/?#:]*)""" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a performance standpoint capturing all of the groups to turn them into a table, and then only selecting one of them is a real performance problem. You don't have to do it now, but we should at least file a follow on issue to build up the regular expression so we only capture the group that matters. It would also allow us to document the different parts of the pattern and explain how they are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// 2001:db8:3:4::192.0.2.33 64:ff9b::192.0.2.33 (IPv4-Embedded IPv6 Address) | ||
val ipv6Regex13 = """(0:0:0:0:0:(0|FFFF|ffff):((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]))""" | ||
// 0:0:0:0:0:0:13.1.68.3 | ||
// scalastyle:on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex part contains many workarounds for cuDF compatibility and (maybe) bugs. I'll try to collect these cases to see if they are bugs later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #8963 instead. Really this regexp is so long that there is no way the CUDF regexp parser is going to run it will. We really should just have a custom kernel for all of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concerns with the current implementation are around scale and performance. The testing appears to be fairly complete, although it might be nice to have a fuzz tester to help validate it more.
// 2001:db8:3:4::192.0.2.33 64:ff9b::192.0.2.33 (IPv4-Embedded IPv6 Address) | ||
val ipv6Regex13 = """(0:0:0:0:0:(0|FFFF|ffff):((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]))""" | ||
// 0:0:0:0:0:0:13.1.68.3 | ||
// scalastyle:on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #8963 instead. Really this regexp is so long that there is no way the CUDF regexp parser is going to run it will. We really should just have a custom kernel for all of this.
} | ||
|
||
private def getPattern(key: UTF8String): RegexProgram = { | ||
val regex = REGEXPREFIX + key.toString + REGEXSUBFIX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#8963 should be a good enough follow on. For now I think this is okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I did some of my own testing, and I am not done with it, but I don't think that this is doing the right thing in enough cases for me to approve it at this time. I still need to run more performance tests too.
The following are 749 URLs that I found that do not parse correctly. They are ones where the CPU returned a non-null value for at least one of the URL fields but the GPU got it wrong. bad_urls.csv But be careful these came from some malicious URL datasets so don't click on any of them!!! for example with Or the second one There are also a lot of problems with performance. The data set I used of about 600,000 URLs took twice as long on my GPU compared to a 12 core CPU, which is getting rather old now. 4 seconds vs 2 seconds. I am willing to look past that if we would get the correct answers for things, but we are not so I don't feel good with checking this in as is. |
// scalastyle:off line.size.limit | ||
import session.sqlContext.implicits._ | ||
Seq[String]( | ||
"grbj.lyjq.org:8989/server.exe", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because many of these URLs are malicious I am not 100% sure that we want to check them in. @sameerz are you okay with these being checked in? I could try and go back and just get the non-malicious URLs in the data sets I found and see if that still repros the problems. Another option would be to check these in as a parquet file so at least they are not in plain text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've figured out why these cases fail, I can generate some random urls and can reproduce the problems in the final version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sameerz are you okay with these being checked in?
No, do not check in the malicious urls.
Can we use a standards document like https://datatracker.ietf.org/doc/html/rfc3986#page-49 to help us create test cases for valid or invalid URLs? |
@sameerz Thanks! it's very helpful for generating valid test cases and checking if we've considered everything. There are many possibilities for invalid urls and edge cases, I think it still relies on testing on many real world urls. |
I did another look for more URL datasets, but this time I was looking for a crawler dataset. I found https://commoncrawl.org/the-data/ so that might be an interesting alternative to get some URLs from. Another alternative could be to look for unit/integration tests for other URL/URI parsers with a permissible license that we could adopt. |
@revans2 Thank you for your testing and data. These failed test cases contain two types:
I also did some performance tests, results added in PR description. It seems it really cost much memory, I met some OOM during testing. But after increasing memory, GPU version runs much faster than CPU on my dev machine. |
I would love to see the test code that you used, because I see the opposite. The GPU is faster than the CPU for a single core, but not for multiple cores. |
It's in PR description. I think the difference is I increased some memory config. I also can reproduced the result that cpu is faster than gpu with default memory config. |
@thirtiseven could you please send me the memory configs that you changed, because I am still not able to reproduce your results. At least not fully. I ran the query in the PR description, but there were several problems with it. First all of the I fixed some of this by using 100x the data spread evenly between 12 files, didn't include the third parameter and I did a COUNT on the results instead of a foreach.
For these I calculated the median of 3 to 5 runs, depending on how long they took to run. The GPU runs for the 100x data had a lot of variability between runs because my GPU would overheat. What is more I see results that are different for the CPU and the GPU. I don't know if these correspond to the changes that are documented or not. I have not dug into them yet. But on the surface, the fact that the user info is different for at least one url in the examples provided is not great. (This is for the 100x data case for the count, so just divide by 100 to get the regular data example)
|
I used --conf spark.executor.memory=10g --conf spark.driver.memory=18g when starting the spark-shell. =
Sorry, the results are reversed. Thanks a lot for the testing, I'm checking the issues you mentioned. update I can reproduce your results now, GPU is really slower in big dataset. I also found more failed cases, working on solving them.
|
I have matched almost all cases (12/651191 left) in the kaggle dataset and recorded the unmatched ones in compatibility doc. I think it is possible to get enough correctness here, although it won't be able to get faster than cpu with regex solution. We need to rewrite it in jni to get good performance. Will we have chance to merge this pr in if it can cover enough cases but slower than cpu? @revans2 |
} | ||
|
||
private def reValid(url: ColumnVector): ColumnVector = { | ||
val regex = """([^\s]*\s|([^[]*|[^[]*\[.*].*)%([^0-9a-fA-F]|[0-9a-fA-F][^0-9a-fA-F]|$))""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used to check if each %?? is a valid escape, but it will take much more time to do another regex. We may want to consider not supporting this check to improve performance.
I have a really hard time merging this in if it is slower than the CPU. The differences are documented and the code works really a lot better then I expected a large regular expression to work. But I something that is 2x slower than the CPU is not worth it in my opinion. If there was a clear use case where it wins to have this in, then I would reconsider. Instead I think we just want to put this PR on hold until we get a spark rapids jni custom kernel for it. |
Closes #6969
Closes #8512
This PR part supports
parse_url
.It will be difficult to follow what java's URI lib does from plugin side, which is used by spark's implementation, because the java code is very long and procedural oriented code with a lot of use of variables and if else. So this PR tries to use regex to cover most cases.
It has passed all cases I know from internet, spark, and java URI lib. Note that the implementation is test-driven, so it is very possible that there are more cases producing different results between cpu/gpu. But I believe it covers most cases.
Fully match all Spark's behaviors will make the regex very long (can see this post of many solutions to valid a url). Here is a list of things I plan to add to this PR:
Some performance test results:
Test code:
Dataset: 651191 urls from kaggle:
GPU: 314ms (5 times average)
CPU: 1147ms (5 times average)
Dataset: the same kaggle dataset, repeated 100 times:
GPU: 17402ms (5 times average)
CPU: 97978ms (5 times average)