Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GPU physical plan missing ReusedExchange execs #10005

Open
rongou opened this issue Dec 8, 2023 · 6 comments
Open

[BUG] GPU physical plan missing ReusedExchange execs #10005

rongou opened this issue Dec 8, 2023 · 6 comments
Labels
bug Something isn't working performance A performance related task/issue

Comments

@rongou
Copy link
Collaborator

rongou commented Dec 8, 2023

Describe the bug
In an effort to improve the accuracy of the qualification tool, I'm trying to match up CPU and GPU execs in the physical plans from event logs for NDS queries, but they don't quite match up. One cause seems to be that the GPU plan for some queries has fewer ReusedExchange nodes. This causes SparkPlanGraph to interpret the nodes and edges slightly differently between CPU and GPU, leading to a mismatch.

Steps/Code to reproduce bug
Run NDS queries with the Spark RAPIDS plugin disabled and enabled, capture the event logs, and run something like these over the event logs:

grep "isFinalPlan=true" cpu_logs | grep -o -w '"ReusedExchange"' | wc
grep "isFinalPlan=true" gpu_logs | grep -o -w '"ReusedExchange"' | wc

They should show different counts.

Expected behavior
The GPU plan should have the same number of reused exchange execs.

Environment details (please complete the following information)

  • Environment location: Both Standalone and Dataproc
  • Spark configuration settings related to the issue: seems to happen with various configurations

Additional context
Here are the final plans for query 24a (I removed the textual plan info to make it slightly easier to read).
cpu.txt
gpu.txt

@rongou rongou added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 8, 2023
@sameerz sameerz added the performance A performance related task/issue label Dec 11, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 14, 2023
@rongou
Copy link
Collaborator Author

rongou commented Dec 14, 2023

Another maybe related issue is even for a ReusedExchange, it sometimes has a different id and different set of accumulators, causing SparkPlanGraph to treat it as different from the original Exchange, which causes a discrepancy with the corresponding cpu plan.

Here is an example from query 56. The subquery contains the original GpuBroadcastExchange, while the ReusedExchange should be reusing it, but it seems to creating a new GpuBroadcastExchange.

q56_subquery.json
q56_reused_exchange.json

@winningsix
Copy link
Collaborator

This bug may be fixed by https://github.com/firestarman/spark-rapids/pull/16/files.

@abellina
Copy link
Collaborator

@winningsix I am confused about the PR from @firestarman. Was it created to address this issue (#10004)? It was PRed and merged to @firestarman's private repo, is it going to be PRed against our regular repo??

@firestarman
Copy link
Collaborator

firestarman commented Dec 25, 2023

That PR tried to address a customer issue on reused exchanges, and we wanted to verify it quickly so merged to my own repo.
But it does not work according to customer's feedback. I do not reproduce this missing ReusedExchange issue locally by running query 56, so i do not know whether it is a fix.

@firestarman
Copy link
Collaborator

firestarman commented Jan 2, 2024

This may be related to #10136, but i am not 100% sure.

@rongou You can do a quick verfication by setting spark.rapids.sql.fileScanPrunePartition.enabled to false. Since I can not repro this locally.
If no repro, then #10136 should be the root cause.

@abellina
Copy link
Collaborator

@rongou do you still see the issue after #10136 was merged as @firestarman suggests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

6 participants