-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log all gpu rank stdout/err to MosaicML platform #2839
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Platform parts look good if the stdout / stderr change is acceptable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change mean that runs on the mosaic platform will no longer show the run logs via mcli logs
?
@dakinggg We will still be able to see the gpu 0 rank logs (+ failed gpu rank logs when the run fails) using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of killing the existing flags, can we just capture the stream and duplicate it if platform env exists?
… into jane/log-all-gpu-ranks
… into jane/log-all-gpu-ranks
… into jane/log-all-gpu-ranks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Co-authored-by: Mihir Patel <[email protected]>
Log all gpu rank stdout/err to MosaicML platform
We are no longer differentiating between stderr and stdout for gpu rank logs.
If run is sent by Mosaic platform, redirect GPU local rank 1-7 logs to
gpu_x.txt
in Mosaic logs directory for each node rank.On mosaic platform
When mosaic env var is false