-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make proxy dump print out more meaningful information. #1412
base: develop
Are you sure you want to change the base?
Conversation
fixed: HPEXA-63
for (int s=0; s<op->nsubs; s++) { | ||
struct ncclProxySubArgs* sub = op->subs+s; | ||
if (op->state == ncclProxyOpProgress) { | ||
char status = ' '; | ||
if (op->pattern == ncclPatternRecv) { | ||
if (! op->send) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's wrong with original "op->pattern == ncclPatternRecv" checking?
if (sub->posted < sub->nsteps && sub->posted < sub->done + NCCL_STEPS) status = 'I'; // Init | ||
else if (sub->received < sub->posted) status = 'R'; // Receiving | ||
else if (sub->received < sub->transmitted) status = 'R'; // Receiving | ||
else if (sub->transmitted < sub->received) status = 'F'; // Flushing | ||
else if (sub->done < sub->transmitted) status = 'G'; // Waiting on GPU | ||
else status = 'D'; // Done | ||
} else if (op->pattern == ncclPatternSend) { | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many "patterns". Why remove the original checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ncclPatternSend/Recv is used in only ncclSend/ncclRecv. In this PR, we want to see all the send/recv inside a collective operation (such as allreduce) and this pattern check prevents it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand the reason. If this PR is about dump proxy information, it should not change proxy code handling itself. Unless original proxy code was wrong here? We can add proxy dump at all if/else if/else conditions without changing logic of proxy code.
Details
Proxy dump is a debugging tool to print out the current state and outstanding operations of the proxy. Proxy dump was not working properly prior to this patch. This PR made some modification to proxy dump code. This does not affect the communication logic in anyway.
Example:
Note that NCCL 2.23.4 made improvements in this area. We should revisit this when we eventually sync to that NCCL version.
This proves to be useful for debugging large scale job for us, and our partner are asking us to put this in as soon as possible.
Work item: : HPEXA-63 / LWPCOMMLIBS-559
What were the changes?
Change how proxy dump prints out information (added more info).
Why were the changes made?
This proxy dump is useful in debugging large scale multi-node hang.
How was the outcome achieved?
printf
Additional Documentation:
Note that NCCL 2.23.4 made improvements in this area. We should revisit this when we eventually sync to that NCCL version.
Proxy dump is not used in normal circumstance. This only affects debug instance when we set NCCL_DUMP_PROXY_SIGNAL.
Approval Checklist
Do not approve until these items are satisfied.