-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Llama] first run with generating positional rotation matrix caches segfaults and OOMs #9837
Comments
When I run with async disabled, I see a variety of errors.
|
|
|
Repro instructions:
gdb --args python -m pytest -svv models/demos/t3000/llama3_70b/demo/demo.py::test_LlamaModel_demo[wormhole_b0-True-check_disabled-greedy-tt-70b-T3000-80L-decode_only-text_completion-llama3] Expected output:
|
On a new T3000 machine to get the first run to 2816 tokens generated in a single sequence I got 6 crashes:
I did a soft reset crash 1:
2:
crash 3 (same stack trace as above):
crash 4 (same stack trace as above):
crash 5:
crash 6 (hang)
Rerunning after this crash got to 2816 tokens and gets to the known issue #9839. This completes the first run and generation for 2k context is relatively reliable. |
- This handles cases where a device tensor is reassigned to a host tensor - Exposed during model cache generation which uses the following pattern: device_tensor = device_tensor.cpu()
Hey @tstescoTT would you mind running with this commit cherry-picked: 4558673. It resolved the segfault for me locally |
- This handles cases where a device tensor is reassigned to a host tensor - Exposed during model cache generation which uses the following pattern: device_tensor = device_tensor.cpu()
- This handles cases where a device tensor is reassigned to a host tensor - Exposed during model cache generation which uses the following pattern: device_tensor = device_tensor.cpu()
@tstescoTT - can you help repro and confirm? |
@mbahnasTT confirms tested - can be closed |
Describe the bug
With a fresh tt-metal weights cache for llama2 and llama3 on first run the rotation matrices (rot mats) are cached for later use. For example:
During this first run caching segfaults typically occur as the token position is increased.
Workaround: a soft reset (tt-smi -r 0,1,2,3) can be used to reset the device and run again to generate caches for a higher token position until the entire max seq len is reached.
Without doing first-run generation for the entire max seq len, the segfaults or hangs may occur during applications if the current seq len does not have cached rot mats. Ideally this would be part of the inital model set up for applications to avoid unpredictable caching during application runtime.
To Reproduce
Steps to reproduce the behavior:
demo_first_run_4k.py
script (https://gist.github.com/tstescoTT/86e31370590666e0edb920bd6bf615aa#file-demo_first_run_4k-py) forcing 4k token generation.pytest -svv demo_first_run_4k.py::test_LlamaModel_demo[wormhole_b0-True-check_disabled-greedy-tt-70b-T3000-80L-decode_only-text_completion-llama3]
Expected behavior
The rot mat cache generation should not cause segfaults or OOMs.
Ideally there should be a way to optionally pre-compute all the rot mats ahead of application runtime to avoid unexpected caching and resulting issues, e.g. with read-only file systems.
Example traces
Example segfault:
At higher token positions DRAM OOM occured:
Please complete the following environment information:
The text was updated successfully, but these errors were encountered: