Add OOM observer with memory visualizations #2958

cli99 · 2024-02-01T21:24:18Z

This PR add a callback to generate visualizations of the state of allocated memory during an OutOfMemory exception. This callback registers an observer with the allocator that will be called everytime it is about to raise an OutOfMemoryError before any memory has been release while unwinding the exception. OOMObserver is attached to the Trainer at init stage.

To enable this, in the yaml config, do below, also need this change in the foundry mosaicml/llm-foundry#932

callbacks:
    oom_observer:
        {
            "folder": "traces",
            "overwrite": true,
            "filename": "rank{rank}_oom",
            "remote_file_name": "oci://bucket_name/{run_name}/oom_traces/rank{rank}_oom",
        }

The visualizations include

snapshot of the memory state, filename_snapshot.pickle, can be used later for analysis with https://github.com/pytorch/pytorch/blob/main/torch/cuda/_memory_viz.py
trace plot, filename_trace_plot.html
segment plot, filename_segment_plot.html
segment flamegraph, filename_segment_flamegraph.svg
memory flamegrap, filename_memory_flamegraph.svg`

j316chuck · 2024-02-01T21:29:07Z

Thank you! This is much needed!

composer/callbacks/oom_observer.py

tests/callbacks/test_oom_observer.py

Co-authored-by: Mihir Patel <[email protected]>

composer/callbacks/oom_observer.py

j316chuck · 2024-02-02T00:13:01Z

@cli99 mind adding a unit test to make sure we can add both a MemorySnapshot and an OOMSnapshot callback. Just want to test the scenario where someone calls torch.cuda._memory.snapshot twice.

composer/callbacks/oom_observer.py

j316chuck

Looks great @cli99 barring a couple of nits and the test case above 🎉 !

Also if you could figure out how to mock/add the unit test Mihir described that would be awesome but imo not blocking

composer/callbacks/oom_observer.py

Co-authored-by: Charles Tang <[email protected]>

mvpatel2000

LGTM, just some minor concerns :)

composer/callbacks/oom_observer.py

tests/callbacks/test_oom_observer.py

Co-authored-by: Mihir Patel <[email protected]>

cli99 · 2024-02-02T04:30:06Z

@cli99 mind adding a unit test to make sure we can add both a MemorySnapshot and an OOMSnapshot callback. Just want to test the scenario where someone calls torch.cuda._memory.snapshot twice.

added a test to show the two callbacks are compatible. Both torch.cuda._memory.snapshot and torch.cuda.memory._record_memory_history allows multiple calls. @j316chuck

cli99 · 2024-02-02T19:21:12Z

@j316chuck, using warning.warn causes issues in a bunch of cpu callback tests, where they don't catch user warnings, so I switch to log.warning

mvpatel2000

The test is really nice!

I have some minor clean up comments but PR LGTM

composer/callbacks/oom_observer.py

tests/callbacks/test_oom_observer.py

Co-authored-by: Mihir Patel <[email protected]>

tests/callbacks/test_oom_observer.py

Co-authored-by: Mihir Patel <[email protected]>

mvpatel2000

LGTM!

Sorry for being a bit nit heavy on reviews 😂

cli99 added 3 commits February 1, 2024 21:01

add oomobserver

bb75f65

update docstring

0054b52

Merge branch 'dev' into oom-observer

2e96d9c

cli99 marked this pull request as ready for review February 1, 2024 21:26

cli99 requested review from j316chuck, mvpatel2000 and dakinggg February 1, 2024 21:26

mvpatel2000 reviewed Feb 1, 2024

View reviewed changes

composer/callbacks/oom_observer.py Outdated Show resolved Hide resolved

tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved

tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved

tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved

Update composer/callbacks/oom_observer.py

92faf4f

Co-authored-by: Mihir Patel <[email protected]>

cli99 mentioned this pull request Feb 1, 2024

add oom observer callback mosaicml/llm-foundry#932

Merged

cli99 added 3 commits February 1, 2024 21:45

use pyskip

a0c6696

call trainer fit

99395a8

fix ci

875546a

cli99 requested a review from mvpatel2000 February 1, 2024 22:19

j316chuck reviewed Feb 2, 2024

View reviewed changes

composer/callbacks/oom_observer.py Outdated Show resolved Hide resolved

j316chuck reviewed Feb 2, 2024

View reviewed changes

composer/callbacks/oom_observer.py Show resolved Hide resolved

j316chuck reviewed Feb 2, 2024

View reviewed changes

composer/callbacks/oom_observer.py Show resolved Hide resolved

composer/callbacks/oom_observer.py Show resolved Hide resolved

composer/callbacks/oom_observer.py Outdated Show resolved Hide resolved

Update composer/callbacks/oom_observer.py

66d093c

Co-authored-by: Charles Tang <[email protected]>

mvpatel2000 reviewed Feb 2, 2024

View reviewed changes

composer/callbacks/oom_observer.py Outdated Show resolved Hide resolved

composer/callbacks/oom_observer.py Outdated Show resolved Hide resolved

tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved

tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved

cli99 and others added 4 commits February 2, 2024 03:59

addresss comments

f2f94d3

Merge branch 'dev' into oom-observer

43118ca

Update composer/callbacks/oom_observer.py

5a74d34

Co-authored-by: Mihir Patel <[email protected]>

add test wiht snapshot

ba6c859

cli99 added 4 commits February 2, 2024 05:12

update doc

637208d

fix typo

cc23887

use log info

c314a5c

fix format

1d0553b

cli99 added 5 commits February 2, 2024 05:52

fix format

1f2bf43

fix ci

95104b5

fix cpu test

1faf75b

Merge branch 'dev' into oom-observer

bddca6c

fix ci

9d4e02d

cli99 requested review from mvpatel2000 and j316chuck February 2, 2024 19:19

mvpatel2000 reviewed Feb 2, 2024

View reviewed changes

cli99 and others added 5 commits February 2, 2024 11:28

Update tests/callbacks/test_oom_observer.py

1e8c98f

Co-authored-by: Mihir Patel <[email protected]>

Update composer/callbacks/oom_observer.py

f5d6db7

Co-authored-by: Mihir Patel <[email protected]>

Update composer/callbacks/oom_observer.py

b48a720

Co-authored-by: Mihir Patel <[email protected]>

Update composer/callbacks/oom_observer.py

b860bc0

Co-authored-by: Mihir Patel <[email protected]>

update test

07a8bec

mvpatel2000 reviewed Feb 2, 2024

View reviewed changes

tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved

tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved

tests/callbacks/test_oom_observer.py Outdated Show resolved Hide resolved

cli99 and others added 6 commits February 2, 2024 11:45

Update tests/callbacks/test_oom_observer.py

f91b854

Co-authored-by: Mihir Patel <[email protected]>

Update tests/callbacks/test_oom_observer.py

7b7f30c

Co-authored-by: Mihir Patel <[email protected]>

Update tests/callbacks/test_oom_observer.py

78bce44

Co-authored-by: Mihir Patel <[email protected]>

Update composer/callbacks/oom_observer.py

818f772

Co-authored-by: Mihir Patel <[email protected]>

Update composer/callbacks/oom_observer.py

c0ca7aa

Co-authored-by: Mihir Patel <[email protected]>

use warnings

5a07ae4

mvpatel2000 approved these changes Feb 2, 2024

View reviewed changes

cli99 added 2 commits February 2, 2024 21:14

add pytest filter user warnings in cpu callback tests

74c66ce

fix typo

fe3dd2c

cli99 merged commit 21bc3db into mosaicml:dev Feb 2, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OOM observer with memory visualizations #2958

Add OOM observer with memory visualizations #2958

cli99 commented Feb 1, 2024 •

edited

Loading

j316chuck commented Feb 1, 2024

j316chuck commented Feb 2, 2024 •

edited

Loading

j316chuck left a comment •

edited

Loading

mvpatel2000 left a comment

cli99 commented Feb 2, 2024

cli99 commented Feb 2, 2024 •

edited

Loading

mvpatel2000 left a comment

mvpatel2000 left a comment

Add OOM observer with memory visualizations #2958

Add OOM observer with memory visualizations #2958

Conversation

cli99 commented Feb 1, 2024 • edited Loading

j316chuck commented Feb 1, 2024

j316chuck commented Feb 2, 2024 • edited Loading

j316chuck left a comment • edited Loading

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

cli99 commented Feb 2, 2024

cli99 commented Feb 2, 2024 • edited Loading

mvpatel2000 left a comment

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

cli99 commented Feb 1, 2024 •

edited

Loading

j316chuck commented Feb 2, 2024 •

edited

Loading

j316chuck left a comment •

edited

Loading

cli99 commented Feb 2, 2024 •

edited

Loading