Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log all gpu rank stdout/err to MosaicML platform #2839

Merged
merged 39 commits into from
Feb 5, 2024

Conversation

jjanezhang
Copy link
Contributor

@jjanezhang jjanezhang commented Jan 11, 2024

Log all gpu rank stdout/err to MosaicML platform

We are no longer differentiating between stderr and stdout for gpu rank logs.

If run is sent by Mosaic platform, redirect GPU local rank 1-7 logs to gpu_x.txt in Mosaic logs directory for each node rank.

image

On mosaic platform
image

When mosaic env var is false
image

@jjanezhang jjanezhang marked this pull request as ready for review January 24, 2024 21:07
@jjanezhang jjanezhang requested review from eracah, dakinggg and a team as code owners January 24, 2024 21:07
@jjanezhang jjanezhang marked this pull request as draft January 24, 2024 22:36
Copy link
Contributor

@siriuslee siriuslee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Platform parts look good if the stdout / stderr change is acceptable

@jjanezhang jjanezhang marked this pull request as ready for review January 26, 2024 21:58
Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change mean that runs on the mosaic platform will no longer show the run logs via mcli logs?

@jjanezhang
Copy link
Contributor Author

jjanezhang commented Jan 31, 2024

Does this change mean that runs on the mosaic platform will no longer show the run logs via mcli logs?

@dakinggg We will still be able to see the gpu 0 rank logs (+ failed gpu rank logs when the run fails) using mcli logs for non finetuning runs. The only diff is that when your run fails, you will get the full chronological logs of all failed gpu ranks instead of separated stdout and stderr. Additionally, you will be able to get individual gpu rank logs regardless of whether the gpu rank failed or not.

@dakinggg dakinggg requested a review from mvpatel2000 January 31, 2024 23:28
composer/cli/launcher.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of killing the existing flags, can we just capture the stream and duplicate it if platform env exists?

composer/cli/launcher.py Outdated Show resolved Hide resolved
composer/cli/launcher.py Outdated Show resolved Hide resolved
composer/loggers/mosaicml_logger.py Outdated Show resolved Hide resolved
composer/cli/launcher.py Outdated Show resolved Hide resolved
composer/cli/launcher.py Show resolved Hide resolved
composer/cli/launcher.py Show resolved Hide resolved
composer/cli/launcher.py Outdated Show resolved Hide resolved
composer/cli/launcher.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

composer/cli/launcher.py Outdated Show resolved Hide resolved
@jjanezhang jjanezhang merged commit 12261d6 into dev Feb 5, 2024
14 checks passed
@jjanezhang jjanezhang deleted the jane/log-all-gpu-ranks branch February 5, 2024 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants