Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] unsupportedoperators.csv shows stageID=-1 for certain unsupported operator #1156

Closed
viadea opened this issue Jul 1, 2024 · 3 comments · Fixed by #1437
Closed

[BUG] unsupportedoperators.csv shows stageID=-1 for certain unsupported operator #1156

viadea opened this issue Jul 1, 2024 · 3 comments · Fixed by #1437
Assignees
Labels
bug Something isn't working core_tools Scope the core module (scala)

Comments

@viadea
Copy link
Collaborator

viadea commented Jul 1, 2024

Describe the bug
unsupportedoperators.csv shows stageID=-1 for certain unsupported operator.

Does it mean Qual tool could not figure out which stage is associated with certain unsupported operators?
As a result, Qual tool thinks the % of unsupported duration is very low which could be wrong.

@viadea viadea added bug Something isn't working ? - Needs Triage labels Jul 1, 2024
@amahussein amahussein added the core_tools Scope the core module (scala) label Jul 3, 2024
@amahussein amahussein self-assigned this Jul 3, 2024
@amahussein
Copy link
Collaborator

After taking a look at the eventlog.
The get_json_object appears as an expression of project
Currently, we can link project to a stageID iff it is contained inside WholestageCodeGen because the ltter has metrics that can be linked into stageID.

There is no clear path to work around this. We can try adding some heuristics that link an exec to a stage based on the neighboring expressions, but then we need to come up with a well defined strategy for that. Otherwise, it will be come a big mess of heuristics that's hard to understand.

@amahussein
Copy link
Collaborator

We need to investigate further by checking the SHS code that parses the RDD information inside a stage.
There might be some further information about linkage between the execs and their stages.
This concern has been raised before in #794

@amahussein
Copy link
Collaborator

amahussein commented Jul 9, 2024

Bug identified in the tools code

  • The heuristics that looks at stageIDs of the neighbours node (getStageToExec link ) does not update the ExecInfo objects. As a result, the final output report won't show the linked stages.
  • The implementation of the heuristics flatten the entire execs into a sequence without taking into considerations the (sink-source) nature of the edges. This leads to looking into pairs of nodes using indices even though they have no edges.
  • The heuristics introduce extra overhead as it reprocesses everything which has been already processed in previous steps.
  • The Profiling tool also has a bug that it does only stageID assignments based on accumulatorIDs. This affects the logic but it is invisible to the user because the profiling tool only generates reports related to metrics.

The suggested plan to fix this bug is:

  • Create a standalone Parser to the SparkGraph that can build the map between nodes to stages. The implementation should be a very simple Graph Traversal
  • The new code should be usable by both Qual/Profiling tools to be consistent with the logic.
  • ETA (3-5 days to account for removing dead code across both tools)

CC: @mattahrens

@mattahrens mattahrens assigned tgravescs and unassigned amahussein Oct 9, 2024
@amahussein amahussein assigned amahussein and unassigned tgravescs Oct 22, 2024
amahussein added a commit to amahussein/spark-rapids-tools that referenced this issue Nov 26, 2024
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Fixes NVIDIA#1156

This adds logic to walk the SparkGraph in order to assign execs to
stages. For nodes that have no AccumIDs, the clusterization processes
relies on adjacent nodes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants