Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OOM observer with memory visualizations #2958

Merged
merged 34 commits into from
Feb 2, 2024
Merged

Conversation

cli99
Copy link
Contributor

@cli99 cli99 commented Feb 1, 2024

This PR add a callback to generate visualizations of the state of allocated memory during an OutOfMemory exception. This callback registers an observer with the allocator that will be called everytime it is about to raise an OutOfMemoryError before any memory has been release while unwinding the exception. OOMObserver is attached to the Trainer at init stage.

To enable this, in the yaml config, do below, also need this change in the foundry mosaicml/llm-foundry#932

callbacks:
    oom_observer:
        {
            "folder": "traces",
            "overwrite": true,
            "filename": "rank{rank}_oom",
            "remote_file_name": "oci://bucket_name/{run_name}/oom_traces/rank{rank}_oom",
        }

The visualizations include

@cli99 cli99 marked this pull request as ready for review February 1, 2024 21:26
@j316chuck
Copy link
Contributor

Thank you! This is much needed!

composer/callbacks/oom_observer.py Outdated Show resolved Hide resolved
tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved
tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved
tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved
@cli99 cli99 requested a review from mvpatel2000 February 1, 2024 22:19
@j316chuck
Copy link
Contributor

j316chuck commented Feb 2, 2024

@cli99 mind adding a unit test to make sure we can add both a MemorySnapshot and an OOMSnapshot callback. Just want to test the scenario where someone calls torch.cuda._memory.snapshot twice.

Copy link
Contributor

@j316chuck j316chuck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @cli99 barring a couple of nits and the test case above 🎉 !

Also if you could figure out how to mock/add the unit test Mihir described that would be awesome but imo not blocking

composer/callbacks/oom_observer.py Show resolved Hide resolved
composer/callbacks/oom_observer.py Show resolved Hide resolved
composer/callbacks/oom_observer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some minor concerns :)

composer/callbacks/oom_observer.py Outdated Show resolved Hide resolved
composer/callbacks/oom_observer.py Outdated Show resolved Hide resolved
tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved
tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved
@cli99
Copy link
Contributor Author

cli99 commented Feb 2, 2024

@cli99 mind adding a unit test to make sure we can add both a MemorySnapshot and an OOMSnapshot callback. Just want to test the scenario where someone calls torch.cuda._memory.snapshot twice.

added a test to show the two callbacks are compatible. Both torch.cuda._memory.snapshot and torch.cuda.memory._record_memory_history allows multiple calls. @j316chuck

@cli99
Copy link
Contributor Author

cli99 commented Feb 2, 2024

@j316chuck, using warning.warn causes issues in a bunch of cpu callback tests, where they don't catch user warnings, so I switch to log.warning

Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is really nice!

I have some minor clean up comments but PR LGTM

composer/callbacks/oom_observer.py Outdated Show resolved Hide resolved
composer/callbacks/oom_observer.py Outdated Show resolved Hide resolved
composer/callbacks/oom_observer.py Outdated Show resolved Hide resolved
composer/callbacks/oom_observer.py Outdated Show resolved Hide resolved
composer/callbacks/oom_observer.py Outdated Show resolved Hide resolved
tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved
tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved
tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved
tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved
tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sorry for being a bit nit heavy on reviews 😂

@cli99 cli99 merged commit 21bc3db into mosaicml:dev Feb 2, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants