-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OOM observer with memory visualizations #2958
Conversation
Thank you! This is much needed! |
Co-authored-by: Mihir Patel <[email protected]>
@cli99 mind adding a unit test to make sure we can add both a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @cli99 barring a couple of nits and the test case above 🎉 !
Also if you could figure out how to mock/add the unit test Mihir described that would be awesome but imo not blocking
Co-authored-by: Charles Tang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just some minor concerns :)
added a test to show the two callbacks are compatible. Both |
@j316chuck, using warning.warn causes issues in a bunch of cpu callback tests, where they don't catch user warnings, so I switch to log.warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is really nice!
I have some minor clean up comments but PR LGTM
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Mihir Patel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Sorry for being a bit nit heavy on reviews 😂
This PR add a callback to generate visualizations of the state of allocated memory during an OutOfMemory exception. This callback registers an observer with the allocator that will be called everytime it is about to raise an OutOfMemoryError before any memory has been release while unwinding the exception. OOMObserver is attached to the Trainer at init stage.
To enable this, in the yaml config, do below, also need this change in the foundry mosaicml/llm-foundry#932
The visualizations include
snapshot of the memory state,
filename_snapshot.pickle
, can be used later for analysis with https://github.com/pytorch/pytorch/blob/main/torch/cuda/_memory_viz.pytrace plot,
filename_trace_plot.html
segment plot,
filename_segment_plot.html
segment flamegraph,
filename_segment_flamegraph.svg
memory flamegrap, filename_memory_flamegraph.svg`