Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use parse_url kernel for PROTOCOL parsing #9481

Merged
merged 35 commits into from
Dec 12, 2023

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Oct 19, 2023

Contributes to #8963

This PR add plugin support for PROTOCOL parsing in parse_url.

Related JNI PR: NVIDIA/spark-rapids-jni#1502

Most of the code is reused from the old PR #8761

Performance test:

spark.time(df.selectExpr("COUNT(parse_url(url, 'PROTOCOL')) as pr1", "COUNT(parse_url(url, 'PROTOCOL')) as pr2", "COUNT(parse_url(url, 'PROTOCOL')) as pr3", "COUNT(parse_url(url, 'PROTOCOL')) as pr4", "COUNT(parse_url(url, 'PROTOCOL')) as pr5", "COUNT(parse_url(url, 'PROTOCOL')) as pr6", "COUNT(parse_url(url, 'PROTOCOL')) as pr7", "COUNT(parse_url(url, 'PROTOCOL')) as pr8", "COUNT(parse_url(url, 'PROTOCOL')) as pr9", "COUNT(parse_url(url, 'PROTOCOL')) as pr0").show())

against 651191 urls' dataset, repeated 100 times.

CPU: 10752 ms
GPU: 1032 ms

It's about 10x faster

docs/compatibility.md Outdated Show resolved Hide resolved
Signed-off-by: Haoyang Li <[email protected]>
@sameerz sameerz added the feature request New feature or request label Oct 23, 2023
),
'ParseUrl')

def test_parse_url_too_many_args():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a test of Spark itself rather than the plugin? IIUC this error is going to be thrown not because the plugin throws it but during the query analysis which occurs before the plugin runs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right! I removed this case and related unnecessary error handling code.

Comment on lines 149 to 161
def test_parse_url_protocol():
assert_gpu_and_cpu_are_equal_collect(
lambda spark : unary_op_df(spark, url_gen).selectExpr(
"a",
"parse_url(a, 'PROTOCOL')"
))

def test_parse_url_protocol_edge_cases():
assert_gpu_and_cpu_are_equal_collect(
lambda spark : unary_op_df(spark, edge_cases_gen).selectExpr(
"a",
"parse_url(a, 'PROTOCOL')"
))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Parameterize this test on the data generation rather than duplicate it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 163 to 197
@allow_non_gpu('ProjectExec', 'ParseUrl')
def test_parse_url_host_fallback():
assert_gpu_fallback_collect(
lambda spark : unary_op_df(spark, url_gen).selectExpr(
"a",
"parse_url(a, 'HOST')"
),
'ParseUrl')

@allow_non_gpu('ProjectExec', 'ParseUrl')
def test_parse_url_path_fallback():
assert_gpu_fallback_collect(
lambda spark : unary_op_df(spark, url_gen).selectExpr(
"a",
"parse_url(a, 'PATH')"
),
'ParseUrl')

@allow_non_gpu('ProjectExec', 'ParseUrl')
def test_parse_url_query_fallback():
assert_gpu_fallback_collect(
lambda spark : unary_op_df(spark, url_gen).selectExpr(
"a",
"parse_url(a, 'QUERY')"
),
'ParseUrl')

@allow_non_gpu('ProjectExec', 'ParseUrl')
def test_parse_url_ref_fallback():
assert_gpu_fallback_collect(
lambda spark : unary_op_df(spark, url_gen).selectExpr(
"a",
"parse_url(a, 'REF')"
),
'ParseUrl')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these should be refactored to a single test parameterized on the literal rather than duplicated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 3234 to 3235
ParamCheck("partToExtract", TypeSig.lit(TypeEnum.STRING), TypeSig.STRING)),
// Should really be an OptionalParam
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a withPsNote denoting that it only supports PROTOCOL so that limitation shows up in the documentation. Otherwise the documentation implies we fully support this as long as a literal is supplied.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

"http://foo.com/blah_blah_(wikipedia)_(again)",
"http://www.example.com/wpstyle/?p=364",
"https://www.example.com/foo/?bar=baz&inga=42&quux",
# "http://✪df.ws/123",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failing cases need to be uncommented and passing before this should be merged.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@thirtiseven
Copy link
Collaborator Author

Depends on NVIDIA/spark-rapids-jni#1554

Signed-off-by: Haoyang Li <[email protected]>
Copy link
Collaborator

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests can be enabled now. Question about URL vs URI, but I'm fine either way.

docs/compatibility.md Outdated Show resolved Hide resolved
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven thirtiseven self-assigned this Nov 22, 2023
@thirtiseven thirtiseven marked this pull request as ready for review November 22, 2023 07:20
@thirtiseven
Copy link
Collaborator Author

Thanks all for the review! I think all comments have been addressed and it's ready to merge once jni PR is merged.

@sameerz
Copy link
Collaborator

sameerz commented Dec 1, 2023

Please retarget to 24.02.

@thirtiseven thirtiseven changed the base branch from branch-23.12 to branch-24.02 December 1, 2023 16:49
@thirtiseven
Copy link
Collaborator Author

Verified locally that all protocol tests passed after NVIDIA/spark-rapids-jni#1554 merged, both for IT and the kaggle dataset.

@thirtiseven
Copy link
Collaborator Author

build

Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven
Copy link
Collaborator Author

Hi @revans2 @jlowe Could you please take another look at this?

@revans2
Copy link
Collaborator

revans2 commented Dec 12, 2023

@hyperbolic2346 cold you also take a look as your change request is blocking this from going in.

@thirtiseven I think @jlowe is out on vacation so you probably should not wait for him to come back.

On a side note, this is crazy fast!

I ran

spark.time(spark.range(0, 1000000000L, 1, 48).selectExpr("CONCAT('htpp://', CAST(id as STRING)) as u").selectExpr("parse_url(u, 'PROTOCOL') as p").selectExpr("MAX(p)", "MIN(p)").show())

both on the GPU and the CPU and the GPU ended up being 452x faster than 12 CPU cores. This is a crappy test that is not real world at all, but still. That is really fast.

Copy link
Collaborator

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this, but I am not a java/scala reviewer.

@revans2
Copy link
Collaborator

revans2 commented Dec 12, 2023

I'm fine with this, but I am not a java/scala reviewer.

I get that, but I just don't like to use my admin merge powers to push a red merge button unless I have to.

@revans2 revans2 merged commit c6dee9b into NVIDIA:branch-24.02 Dec 12, 2023
38 checks passed
@thirtiseven thirtiseven deleted the parse_url_protocol branch December 19, 2023 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants