Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use parse_url kernel for PROTOCOL parsing #9481

Merged
merged 35 commits into from
Dec 12, 2023
Merged
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
8235c95
WIP: Support parse_url
thirtiseven Jul 20, 2023
9f17539
Merge branch 'NVIDIA:branch-23.08' into prase_url
thirtiseven Jul 20, 2023
729fe35
fix build failures
thirtiseven Jul 20, 2023
85c3284
regex refactor
thirtiseven Aug 3, 2023
6819214
Merge branch 'NVIDIA:branch-23.08' into prase_url
thirtiseven Aug 3, 2023
4166362
Separate regexes and UTF-8 special characters support
thirtiseven Aug 3, 2023
43acceb
hostname validation
thirtiseven Aug 3, 2023
64d8373
hostname validation
thirtiseven Aug 3, 2023
e6a45d3
ipv4 validation
thirtiseven Aug 4, 2023
8c4dc7a
verify
thirtiseven Aug 4, 2023
fee5a3d
wip ipv6 and SPARK-44500
thirtiseven Aug 4, 2023
e81d8a3
optional protocol and ref validation
thirtiseven Aug 7, 2023
93a9342
IPV6 VALIDATION
thirtiseven Aug 8, 2023
1ad665f
clean up
thirtiseven Aug 8, 2023
3edb929
Fix ipv6 validation, it is still wip
thirtiseven Aug 9, 2023
daa61ea
Fix ipv6 validation and some clean up
thirtiseven Aug 9, 2023
70a5d88
Merge branch 'prase_url' into parse_url_protocol
thirtiseven Oct 19, 2023
b3abaf6
Use parse_url kernel for PROTOCOL parsing
thirtiseven Oct 19, 2023
592c642
verify
thirtiseven Oct 19, 2023
9db1b2a
edit compatibility and update IT
thirtiseven Oct 19, 2023
d09f06d
update integration tests
thirtiseven Oct 20, 2023
3b71c4d
address comments
thirtiseven Oct 24, 2023
46527f3
remove unnecessary error handling
thirtiseven Oct 24, 2023
6161fa4
clean up
thirtiseven Oct 24, 2023
e16fe1e
Merge branch 'parse_url_protocol' of https://github.com/thirtiseven/s…
thirtiseven Nov 16, 2023
8e7ed44
Merge branch 'thirtiseven-parse_url_protocol' into parse_url_protocol
thirtiseven Nov 16, 2023
f93b944
Merge branch 'NVIDIA:branch-23.12' into parse_url_protocol
thirtiseven Nov 16, 2023
8f4990c
Revert scala tests temporarily for easier testing
thirtiseven Nov 16, 2023
3376376
Fix two nits
thirtiseven Nov 16, 2023
4e98888
Updated results
thirtiseven Nov 22, 2023
6d916c4
clean up
thirtiseven Nov 22, 2023
1b36090
rename urlFunctions to GpuParseUrl
thirtiseven Nov 28, 2023
7eca922
Merge branch 'branch-23.12' into parse_url_protocol
thirtiseven Dec 1, 2023
e4fdf13
Merge branch 'NVIDIA:branch-24.02' into parse_url_protocol
thirtiseven Dec 4, 2023
3ace124
verify
thirtiseven Dec 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fix ipv6 validation, it is still wip
thirtiseven committed Aug 9, 2023
commit 3edb9298fd392f0e12104c3422101a7bbed9667c
10 changes: 10 additions & 0 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
@@ -456,6 +456,16 @@ Spark stores timestamps internally relative to the JVM time zone. Converting an
between time zones is not currently supported on the GPU. Therefore operations involving timestamps
will only be GPU-accelerated if the time zone used by the JVM is UTC.

## URL parsing

In Spark, parse_url is based on java's URI library, while the implementation in the RAPIDS Accelerator is based on regex extraction. Therefore, the results may be different in some edge cases.

These are the known cases where running on the GPU will produce different results to the CPU:

- Spark allow an empty authority component only when it's followed by a non-empty path,
query component, or fragment component. But in plugin, parse_url just simply allow empty
authority component without checking if it is followed something or not. So `parse_url('http://', 'HOST')` will return `null` in Spark, but return `""` in plugin.

## Windowing

### Window Functions
Original file line number Diff line number Diff line change
@@ -42,9 +42,6 @@ object GpuParseUrl {
private val REGEXPREFIX = """(&|^)("""
private val REGEXSUBFIX = "=)([^&]*)"
// scalastyle:off line.size.limit
// val regex = """^(?:(?:([^:/?#]+):)?(?://((?:(?:([^:]*:?[^\@]*)@)?(\[[0-9A-Za-z%.:]*\]|[^/#:?]*))(?::[0-9]+)?))?([^?#]*)(\?[^#]*)?(#.*)?)$"""
// a 0 0 b 1 c d 2 2 d 3 3ce e 1b 4 45 5 6 6 a
// val regex = """^(?:(?:(?:[^:/?#]+):)?(?://(?:(?:(?:(?:[^:]*:?[^\@]*)@)?(?:\[[0-9A-Za-z%.:]+\]|[^/#:?]*))(?::[0-9]+)?))?(?:[^?#]*)(?:\?[^#]*)?(?:#.*)?)$"""
private val HOST_REGEX = """^(?:(?:(?:[^:/?#]+):)?(?://(?:(?:(?:(?:[^:]*:?[^\@]*)@)?(\[[0-9A-Za-z%.:]+\]|[^/#:?]*))(?::[0-9]+)?))?(?:[^?#]*)(?:\?[^#]*)?(?:#[a-zA-Z0-9\-_.!~*'();/?:@&=+$,[\]%]*)?)$"""
private val PATH_REGEX = """^(?:(?:(?:[^:/?#]+):)?(?://(?:(?:(?:(?:[^:]*:?[^\@]*)@)?(?:\[[0-9A-Za-z%.:]+\]|[^/#:?]*))(?::[0-9]+)?))?([^?#]*)(?:\?[^#]*)?(?:#[a-zA-Z0-9\-_.!~*'();/?:@&=+$,[\]%]*)?)$"""
private val QUERY_REGEX = """^(?:(?:(?:[^:/?#]+):)?(?://(?:(?:(?:(?:[^:]*:?[^\@]*)@)?(?:\[[0-9A-Za-z%.:]+\]|[^/#:?]*))(?::[0-9]+)?))?(?:[^?#]*)(\?[^#]*)?(?:#[a-zA-Z0-9\-_.!~*'();/?:@&=+$,[\]%]*)?)$"""
@@ -79,17 +76,8 @@ case class GpuParseUrl(children: Seq[Expression],
super[ExpectsInputTypes].checkInputDataTypes()
}

private def escapeRegex(str: String): String = {
// Escape all regex special characters in \^$.⎮?*+(){}-[]
// It is a workaround for /Q and /E not working
// in cudf regex, can use Pattern.quote(str) instead after they are supported.
str.replaceAll("""[\^$.|?*+()\[\]-]""", "\\$0")
}

private def getPattern(key: UTF8String): RegexProgram = {
// SPARK-44500: in spark, the key is treated as a regex.
// In plugin we quote the key to be sure that we treat it as a literal value.
val regex = REGEXPREFIX + escapeRegex(key.toString) + REGEXSUBFIX
val regex = REGEXPREFIX + key.toString + REGEXSUBFIX
new RegexProgram(regex)
}

@@ -136,14 +124,9 @@ case class GpuParseUrl(children: Seq[Expression],
// hostname = domainlabel [ "." ] | 1*( domainlabel "." ) toplabel [ "." ]
// domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
// toplabel = alpha | alpha *( alphanum | "-" ) alphanum

// Note: Spark allow an empty authority component only when it's followed by a non-empty path,
// query component, or fragment component. But in plugin, parse_url just simply allow empty
// authority component without checking if it is followed something or not.
val hostnameRegex = """((([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])|(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)+([a-zA-Z]|[a-zA-Z][a-zA-Z0-9\-]*[a-zA-Z]))\.?)"""
// ipv4_regex
val ipv4Regex = """(((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]))"""
val simpleIpv6Regex = """(\[[0-9A-Za-z%.:]*\])"""
val simpleIpv6Regex = """(\[[0-9A-Za-z%.:]+])"""
// scalastyle:on
val regex = "^(" + hostnameRegex + "|" + ipv4Regex + "|" + simpleIpv6Regex + ")$"
val prog = new RegexProgram(regex)
@@ -172,33 +155,51 @@ case class GpuParseUrl(children: Seq[Expression],
// scalastyle:off line.size.limit
// regex basically copied from https://stackoverflow.com/questions/53497/regular-expression-that-matches-valid-ipv6-addresses
// spilt the ipv6 regex into 8 parts to avoid the regex size limit
val ipv6Regex1 = """(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4})""" // 1:2:3:4:5:6:7:8
val ipv6Regex2 = """(([0-9a-fA-F]{1,4}:){1,7}:)""" // 1:: 1:2:3:4:5:6:7::
val ipv6Regex3 = """(([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4})""" // 1::8 1:2:3:4:5:6::8 1:2:3:4:5:6::8
val ipv6Regex4 = """(([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2})""" // 1::7:8 1:2:3:4:5::7:8 1:2:3:4:5::8
val ipv6Regex5 = """(([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3})""" // 1::6:7:8 1:2:3:4::6:7:8 1:2:3:4::8
val ipv6Regex6 = """(([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4})""" // 1::5:6:7:8 1:2:3::5:6:7:8 1:2:3::8
val ipv6Regex7 = """(([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5})""" // 1::4:5:6:7:8 1:2::4:5:6:7:8 1:2::8
val ipv6Regex8 = """([0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6}))""" // 1::3:4:5:6:7:8 1::3:4:5:6:7:8 1::8
val ipv6Regex9 = """(:((:[0-9a-fA-F]{1,4}){1,7}|:))""" // ::2:3:4:5:6:7:8 ::2:3:4:5:6:7:8 ::8 ::
val ipv6Regex10 = """fe80:((:([0-9a-fA-F]{1,4})?){1,4})?%[0-9a-zA-Z]+|""" // fe80::7:8%eth0 fe80::7:8%1 (link-local IPv6 addresses with zone index)
val ipv6Regex11 = """::(ffff(:0{1,4})?:)?((25[0-5]|(2[0-4]|1?[0-9])?[0-9])\.){3}(25[0-5]|(2[0-4]|1?[0-9])?[0-9])"""
val ipv6Regex1 = """([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?"""
// 1:2:3:4:5:6:7:8
val ipv6Regex2 = """([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?:"""
// 1:: 1:2:3:4:5:6:7::
val ipv6Regex3 = """(([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)"""
// 1::8 1:2:3:4:5:6::8 1:2:3:4:5:6::8
val ipv6Regex4 = """(([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?)"""
// 1::7:8 1:2:3:4:5::7:8 1:2:3:4:5::8
val ipv6Regex5 = """(([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?)"""
// 1::6:7:8 1:2:3:4::6:7:8 1:2:3:4::8
val ipv6Regex6 = """(([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?)"""
// 1::5:6:7:8 1:2:3::5:6:7:8 1:2:3::8
val ipv6Regex7 = """(([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?)"""
// 1::4:5:6:7:8 1:2::4:5:6:7:8 1:2::8
val ipv6Regex8 = """([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:((:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?))"""
// 1::3:4:5:6:7:8 1::3:4:5:6:7:8 1::8
val ipv6Regex9 = """(:((:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?(:[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?|:))"""
// ::2:3:4:5:6:7:8 ::2:3:4:5:6:7:8 ::8 ::
val ipv6Regex10 = """(fe80:((:([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?)(:([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?)?(:([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?)?(:([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)?)?)?%[0-9a-zA-Z]+)"""
// fe80::7:8%eth0 fe80::7:8%1 (link-local IPv6 addresses with zone index)
val ipv6Regex11 = """(::(ffff(:00?0?0?)?:)?((25[0-5]|(2[0-4]|1?[0-9])?[0-9])\.)((25[0-5]|(2[0-4]|1?[0-9])?[0-9])\.)?((25[0-5]|(2[0-4]|1?[0-9])?[0-9])\.)?(25[0-5]|(2[0-4]|1?[0-9])?[0-9]))"""
// ::255.255.255.255 ::ffff:255.255.255.255 ::ffff:0:255.255.255.255 (IPv4-mapped IPv6 addresses and IPv4-translated addresses)
val ipv6Regex12 = """([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1?[0-9])?[0-9])\.){3}(25[0-5]|(2[0-4]|1?[0-9])?[0-9])"""
val ipv6Regex12 = """([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:)?:((25[0-5]|(2[0-4]|1?[0-9])?[0-9])\.)((25[0-5]|(2[0-4]|1?[0-9])?[0-9])\.)((25[0-5]|(2[0-4]|1?[0-9])?[0-9])\.)(25[0-5]|(2[0-4]|1?[0-9])?[0-9])"""
// 2001:db8:3:4::192.0.2.33 64:ff9b::192.0.2.33 (IPv4-Embedded IPv6 Address)
// scalastyle:on
val regex = "^" + ipv6Regex1 + "|" + ipv6Regex2 + "|" + ipv6Regex3 + "|" + ipv6Regex4 + "|" +
ipv6Regex5 + ipv6Regex6 + "|" + ipv6Regex7 + "|" + ipv6Regex8 + "|" + ipv6Regex9 +
ipv6Regex10 + ipv6Regex11 + ipv6Regex12 + "$"

val regex = """^\[(""" + ipv6Regex1 + "|" + ipv6Regex2 + "|" + ipv6Regex3 + "|" + ipv6Regex4 + "|" +
ipv6Regex5 + "|" + ipv6Regex6 + "|" + ipv6Regex7 + "|" + ipv6Regex8 + "|" + ipv6Regex9 + "|" +
ipv6Regex10 + "|" + ipv6Regex11 + "|" + ipv6Regex12 + """)]$"""
// ^\[((([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:){7,7}[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?)|(([0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?:){1,7}:))]$
val prog = new RegexProgram(regex)

GpuColumnVector.debug("cv", cv)
GpuColumnVector.debug("simpleMatched", simpleMatched)

val invalidIpv6 = withResource(cv.matchesRe(prog)) { matched =>
matched.not()
withResource(matched.not()) { invalid =>
GpuColumnVector.debug("invalid", invalid)
simpleMatched.and(invalid)
}
}
withResource(invalidIpv6) { _ =>
withResource(Scalar.fromNull(DType.STRING)) { nullScalar =>
invalidIpv6.ifElse(cv, nullScalar)
val x = invalidIpv6.ifElse(nullScalar, cv)
GpuColumnVector.debug("unsetinvalidIpv6", x)
x
}
}
}
@@ -210,8 +211,8 @@ case class GpuParseUrl(children: Seq[Expression],
}

def doColumnar(url: GpuColumnVector, partToExtract: GpuScalar): ColumnVector = {
val valid = reValid(url.getBase)
val part = partToExtract.getValue.asInstanceOf[UTF8String].toString
val valid = reValid(url.getBase)
val matched = withResource(valid) { _ =>
reMatch(valid, part)
}
Original file line number Diff line number Diff line change
@@ -156,17 +156,10 @@ class UrlFunctionsSuite extends SparkQueryCompareTestSuite {
).toDF("urls")
}

def urlWithRegexLikeQuery(session: SparkSession): DataFrame = {
import session.sqlContext.implicits._
Seq[String](
"http://foo/bar?abc=BAD&a.c=GOOD",
"http://foo/bar?a.c=GOOD&abc=BAD"
).toDF("urls")
}

def urlIpv6Host(session: SparkSession): DataFrame = {
import session.sqlContext.implicits._
Seq[String](
"http://[1:2:3:4:5:6:7:8:9:10]",
"http://[1:2:3:4:5:6:7:8]",
"http://[1::]",
"http://[1:2:3:4:5:6:7::]",
@@ -202,16 +195,16 @@ class UrlFunctionsSuite extends SparkQueryCompareTestSuite {
).toDF("urls")
}

// def unsupportedUrlCases(session: SparkSession): DataFrame = {
// // Spark allow an empty authority component only when it's followed by a non-empty path,
// // query component, or fragment component. But in plugin, parse_url just simply allow
// // empty authority component without checking if it is followed something or not.
// import session.sqlContext.implicits._
// Seq[String](
// "http://",
// "//"
// ).toDF("urls")
// }
def unsupportedUrlCases(session: SparkSession): DataFrame = {
// Spark allow an empty authority component only when it's followed by a non-empty path,
// query component, or fragment component. But in plugin, parse_url just simply allow
// empty authority component without checking if it is followed something or not.
import session.sqlContext.implicits._
Seq[String](
"http://",
"//"
).toDF("urls")
}

def parseUrls(frame: DataFrame): DataFrame = {
frame.selectExpr(
@@ -246,29 +239,14 @@ class UrlFunctionsSuite extends SparkQueryCompareTestSuite {
parseUrls
}

// testSparkResultsAreEqual("Test parse_url unsupport cases", unsupportedUrlCases) {
// parseUrls
// }
testSparkResultsAreEqual("Test parse_url unsupport cases", unsupportedUrlCases) {
parseUrls
}

testSparkResultsAreEqual("Test parse_url with query and key", urlWithQueryKey) {
frame => frame.selectExpr(
"urls",
"parse_url(urls, 'QUERY', 'foo') as QUERY",
"parse_url(urls, 'QUERY', 'baz') as QUERY")
}

test("Test parse_url with regex like query") {
withGpuSparkSession(spark => {
val frame = urlWithRegexLikeQuery(spark)
val result = frame.selectExpr(
"urls",
"parse_url(urls, 'QUERY', 'a.c') as QUERY")
import spark.implicits._
val expected = Seq(
("http://foo/bar?abc=BAD&a.c=GOOD", "GOOD"),
("http://foo/bar?a.c=GOOD&abc=BAD", "GOOD")
).toDF("urls", "QUERY")
assert(result.collect().deep == expected.collect().deep)
})
}
}