-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async write support for ORC #11865
base: branch-25.02
Are you sure you want to change the base?
Async write support for ORC #11865
Conversation
8d4a241
to
3b01026
Compare
build |
build |
Signed-off-by: Jihoon Son <[email protected]>
99d0c24
to
a4b1f65
Compare
build |
Any indication that this is a performance improvement for Orc? I assume so, but asking if you have data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but want to hear about the coalesce question from @abellina
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, barring the accidental inclusion of operatorsScore.csv
and supportedExprs.csv
.
@abellina I haven't run any performance test yet, but agree it will be useful. Unfortunately, I'm quite stuck in other work right now. Do you mind if I run some benchmark later and post the results here? |
build |
gen_list = [('_c' + str(i), gen) for i, gen in enumerate(orc_gen)] | ||
assert_gpu_and_cpu_writes_are_equal_collect( | ||
lambda spark, path: gen_df(spark, gen_list, length=num_rows).write.orc(path), | ||
lambda spark, path: spark.read.orc(path).orderBy([('_c' + str(i)) for i in range(num_cols)]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the GPU and CPU orc readers can return the same rows in different orders. Added an orderBy
to make the test deterministic.
build |
The test failure is filed in #11897. |
build |
Follow up to #11730.
This PR adds the async write support for the ORC format as well as an integration test for it. This PR is marked as a draft since it depends on #11855.