Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Llama] first run with generating positional rotation matrix caches segfaults and OOMs #9837

Closed
tstescoTT opened this issue Jun 28, 2024 · 10 comments
Assignees
Labels
bug Something isn't working llama2-70b P1

Comments

@tstescoTT
Copy link
Contributor

Describe the bug

With a fresh tt-metal weights cache for llama2 and llama3 on first run the rotation matrices (rot mats) are cached for later use. For example:

2024-05-02 18:06:09.559 | DEBUG    | ttnn.operations.core:from_torch_and_dump:676 - Generating cache for /home/tt-admin/.cache/tt-metal-llama2-70b/llama2_cache/tt-metal-weights-cache/rot_mat_decode_69_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1, 128, 128]), dtype BFLOAT16, layout TILE

During this first run caching segfaults typically occur as the token position is increased.

Workaround: a soft reset (tt-smi -r 0,1,2,3) can be used to reset the device and run again to generate caches for a higher token position until the entire max seq len is reached.

Without doing first-run generation for the entire max seq len, the segfaults or hangs may occur during applications if the current seq len does not have cached rot mats. Ideally this would be part of the inital model set up for applications to avoid unpredictable caching during application runtime.

To Reproduce
Steps to reproduce the behavior:

  1. build tt-metal on commit a053bc8
  2. setup llama3 experimental demo (see https://github.com/tenstorrent/tt-metal/tree/main/models/experimental/llama2_70b#how-to-run)
  3. run demo_first_run_4k.py script (https://gist.github.com/tstescoTT/86e31370590666e0edb920bd6bf615aa#file-demo_first_run_4k-py) forcing 4k token generation. pytest -svv demo_first_run_4k.py::test_LlamaModel_demo[wormhole_b0-True-check_disabled-greedy-tt-70b-T3000-80L-decode_only-text_completion-llama3]

Expected behavior
The rot mat cache generation should not cause segfaults or OOMs.
Ideally there should be a way to optionally pre-compute all the rot mats ahead of application runtime to avoid unexpected caching and resulting issues, e.g. with read-only file systems.

Example traces
Example segfault:

2024-06-27 16:34:04.410 | INFO     | demo:run_decode:199 - Loop 86 user 31: 

Fatal Python error: Segmentation fault

Thread 0x00007f0ac5ffb700 (most recent call first):
  File "/usr/lib/python3.8/threading.py", line 306 in wait
  File "/usr/lib/python3.8/threading.py", line 558 in wait
  File "/tt-metal/python_env/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f0e2bd01740 (most recent call first):
  File "/usr/lib/python3.8/pathlib.py", line 704 in _format_parsed_parts
  File "/usr/lib/python3.8/pathlib.py", line 722 in __str__
  File "/tt-metal/ttnn/ttnn/operations/core.py", line 619 in load_tensor
  File "/tt-metal/ttnn/ttnn/decorators.py", line 782 in wrapper
  File "/tt-metal/ttnn/ttnn/operations/core.py", line 765 in as_tensor
  File "/tt-metal/ttnn/ttnn/decorators.py", line 782 in wrapper
  File "/home/user/tt-metal-llama3-70b/src/tt_metal_impl/tt/llama_model_optimized.py", line 247 in prepare_inputs
  File "/home/user/tt-metal-llama3-70b/src/tt_metal_impl/tt/llama_generation.py", line 79 in decode_forward
  File "/home/user/tt-metal-llama3-70b/src/tt_metal_impl/tt/llama_generation.py", line 66 in forward
  File "/home/user/tt-metal-llama3-70b/src/tt_metal_impl/demo/demo.py", line 176 in run_decode
  File "/home/user/tt-metal-llama3-70b/src/tt_metal_impl/demo/demo.py", line 56 in main
  File "/home/user/tt-metal-llama3-70b/src/tt_metal_impl/demo/demo.py", line 450 in test_LlamaModel_demo
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 1789 in runtest
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 260 in <lambda>
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 339 in from_call
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 220 in call_and_report
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 131 in runtestprotocol
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 324 in _main
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 167 in main
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 190 in console_main
  File "/tt-metal/python_env/bin/pytest", line 8 in <module>
Segmentation fault (core dumped)

At higher token positions DRAM OOM occured:

2024-06-27 19:02:51.022 | INFO     | __main__:run_decode:199 - Loop 1001

2024-06-27 19:02:51.127 | INFO     | __main__:run_decode:199 - Loop 1002

                 Always | FATAL    | Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception
info:
Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
backtrace:
 --- tt::tt_metal::allocator::BankManager::allocate_buffer(unsigned int, unsigned int, bool, tt::umd::xy_pair, std::__1::optional<unsigned int>)
 --- tt::tt_metal::allocator::base_alloc(tt::tt_metal::AllocatorConfig const&, tt::tt_metal::allocator::BankManager&, unsigned long, unsigned long, bool, std::__1::optional<unsigned int>)
 --- tt::tt_metal::allocator::allocate_buffer(tt::tt_metal::Allocator&, unsigned int, unsigned int, tt::tt_metal::BufferType const&, bool, std::__1::optional<unsigned int>)
 --- tt::tt_metal::EnqueueAllocateBufferImpl(tt::tt_metal::AllocBufferMetadata)
 --- tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&)
 --- tt::tt_metal::EnqueueAllocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer*, bool, bool)
 --- tt::tt_metal::Buffer::allocate()
 --- tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, bool)
 --- /tt-metal/build/lib/libtt_eager.so(+0x648958) [0x7fe17f4ad958]
 --- tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&)
 --- tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&)
 --- /tt-metal/build/lib/libtt_eager.so(_ZN2tt8tt_metal9operation29generic_create_output_tensorsINS_10operations7primary6MatmulEEENS1_21program_output_helperIT_Xsr18has_create_programIS7_EE5valueEE4typeERKS7_RKNSt3__16vectorINS0_6TensorENSC_9allocatorISE_EEEENSC_8optionalINS0_8DataTypeEEENS0_6LayoutERKNSK_INS0_12MemoryConfigEEE+0x178) [0x7fe17f0b1928]
 --- tt::operations::primary::Matmul::create_output_tensors(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&) const
 --- /tt-metal/build/lib/libtt_eager.so(+0x1ee299) [0x7fe17f053299]
 --- std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> tt::tt_metal::operation::detail::run_device_operation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>(std::__1::reference_wrapper<tt::tt_metal::CommandQueue>, tt::tt_metal::operation::DeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>> const&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&)
 --- std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> tt::tt_metal::operation::run<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>(tt::tt_metal::operation::DeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>> const&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&, unsigned char)
 --- /tt-metal/build/lib/libtt_eager.so(+0x1ed88f) [0x7fe17f05288f]
 --- /tt-metal/build/lib/libtt_eager.so(+0x1ecf7d) [0x7fe17f051f7d]
 --- tt::tt_metal::operation::launch_op(std::__1::function<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> (std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&)>&&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>>, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>>, bool)
 --- /tt-metal/build/lib/libtt_eager.so(+0x1eb994) [0x7fe17f050994]
 --- /tt-metal/build/lib/libtt_eager.so(+0x24b709) [0x7fe17f0b0709]
 --- /tt-metal/build/lib/libtt_eager.so(+0x516e39) [0x7fe17f37be39]
 --- /tt-metal/build/lib/libtt_eager.so(+0x517b6f) [0x7fe17f37cb6f]
 --- /tt-metal/build/lib/libtt_metal.so(+0x1579eb) [0x7fe17ec9b9eb]
 --- /tt-metal/build/lib/libtt_metal.so(+0x157c5b) [0x7fe17ec9bc5b]
 --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fe1eb72a609]
 --- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fe1eb864353]

Aborted (core dumped)

Please complete the following environment information:

  • OS: Ubuntu 20.04
  • tt-metal commit: a053bc8
  • Machine: T3000 with 4x n300 in 2x4 Mesh
  • Firmware bundle: 80.8.12.0
  • tt-kmd: 1.28
@cglagovichTT
Copy link
Contributor

I was able to repro this segfault on sjc-snva-t3002.

Config:

  • Llama3
  • decode-only
  • greedy decode
image
2024-07-02 16:45:32.806 | DEBUG    | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_25_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1,│······
 32, 32]), dtype BFLOAT16, layout TILE                                                                                                                                                                                                │······
--Type <RET> for more, q to quit, c to continue without paging--                                                                                                                                                                      │······
                                                                                                                                                                                                                                      │······
Thread 223 "python" received signal SIGSEGV, Segmentation fault.                                                                                                                                                                      │······
[Switching to Thread 0x7ffe82fd7700 (LWP 773289)]                                                                                                                                                                                     │······
0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                         │······
(gdb) bt                                                                                                                                                                                                                              │······
#0  0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                     │······
#1  0x00007fff88875f17 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                              │······
#2  0x00007fff88873b82 in tt::tt_metal::EnqueueDeallocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Allocator&, unsigned int, tt::tt_metal::BufferType, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so       │······
#3  0x00007fff89071d70 in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda(tt::tt_metal::Device*)#1}, std::__1│······
::allocator<{lambda(tt::tt_metal::Device*)#1}>, void (tt::tt_metal::Device*)>::operator()(tt::tt_metal::Device*&&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                         │······
#4  0x00007fff8907200f in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda()#1}, std::__1::allocator<{lambda()│······
#1}>, void ()>::operator()() () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                               │······
#5  0x00007fff8881429b in tt::WorkExecutor::run_worker() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                   │······
#6  0x00007fff8881450b in void* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (tt::WorkExecutor::*)(), tt::WorkEx│······
ecutor*> >(void*) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                          │······
#7  0x00007ffff7db5609 in start_thread (arg=<optimized out>) at pthread_create.c:477                                                                                                                                                  │······
#8  0x00007ffff7eef353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95                                                                                                                                                    │······
(gdb) list                                                                                                                                                                                                                            │······
1       <built-in>: No such file or directory.                                                                                                                                                                                        │······
(gdb)

@cglagovichTT
Copy link
Contributor

When I run with async disabled, I see a variety of errors.

Case 1:
image

2024-07-02 17:59:25.628 | DEBUG    | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_109_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1│·, 32, 128]), dtype BFLOAT16, layout TILE                                                                                                                                                                                              │·
--Type <RET> for more, q to quit, c to continue without paging--                                                                                                                                                                      │·
                                                                                                                                                                                                                                      │·Thread 218 "python" received signal SIGSEGV, Segmentation fault.                                                                                                                                                                      │·
[Switching to Thread 0x7ffe857dc700 (LWP 820677)]                                                                                                                                                                                     │·
0x00007fff88833ff5 in tt::tt_metal::allocator::FreeList::update_left_aligned_allocated_block_connections(boost::local_shared_ptr<tt::tt_metal::allocator::FreeList::Block>, boost::local_shared_ptr<tt::tt_metal::allocator::FreeList:│·
:Block>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                   │·
(gdb) bt                                                                                                                                                                                                                              │·
#0  0x00007fff88833ff5 in tt::tt_metal::allocator::FreeList::update_left_aligned_allocated_block_connections(boost::local_shared_ptr<tt::tt_metal::allocator::FreeList::Block>, boost::local_shared_ptr<tt::tt_metal::allocator::FreeL│·
ist::Block>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                               │·#1  0x00007fff8883432f in tt::tt_metal::allocator::FreeList::allocate_slice_of_free_block(boost::local_shared_ptr<tt::tt_metal::allocator::FreeList::Block>, unsigned long, unsigned long) ()                                         │·   from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                            │·#2  0x00007fff88834754 in tt::tt_metal::allocator::FreeList::allocate(unsigned long, bool, unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                  │·
#3  0x00007fff88837075 in tt::tt_metal::allocator::BankManager::allocate_buffer(unsigned int, unsigned int, bool, tt::umd::xy_pair, std::__1::optional<unsigned int>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so      │·#4  0x00007fff88838f17 in tt::tt_metal::allocator::base_alloc(tt::tt_metal::AllocatorConfig const&, tt::tt_metal::allocator::BankManager&, unsigned long, unsigned long, bool, std::__1::optional<unsigned int>) ()                   │·   from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                            │·#5  0x00007fff8883905f in tt::tt_metal::allocator::allocate_buffer(tt::tt_metal::Allocator&, unsigned int, unsigned int, tt::tt_metal::BufferType const&, bool, std::__1::optional<unsigned int>) ()                                  │·
   from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                            │·#6  0x00007fff8887398e in tt::tt_metal::EnqueueAllocateBufferImpl(tt::tt_metal::AllocBufferMetadata) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                       │·#7  0x00007fff888760c4 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                              │·#8  0x00007fff88873a85 in tt::tt_metal::EnqueueAllocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer*, bool, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                              │·
#9  0x00007fff88822c63 in tt::tt_metal::Buffer::allocate() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                 │·#10 0x00007fff888215d8 in tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, bool│·) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                          │·#11 0x00007fff8905c6a8 in std::__1::shared_ptr<tt::tt_metal::Buffer> std::__1::allocate_shared[abi:ue170006]<tt::tt_metal::Buffer, std::__1::allocator<tt::tt_metal::Buffer>, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt│·::tt_metal::BufferType const&, void>(std::__1::allocator<tt::tt_metal::Buffer> const&, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt::tt_metal::BufferType const&) ()                                                      │·   from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                                                            │·#12 0x00007fff88fa6e4a in tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::│·__1::optional<tt::tt_metal::ShardSpecBuffer> const&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                       │·
#13 0x00007fff8906e84a in tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&) ()                                   │·   from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                                                            │·#14 0x00007fff88c37418 in tt::tt_metal::operation::generic_create_output_tensors<tt::operations::primary::Matmul> () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                          │·#15 0x00007fff88c2f1ba in tt::operations::primary::Matmul::create_output_tensors(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor> > const&) const ()

@cglagovichTT
Copy link
Contributor

case 2: deallocate
image

2024-07-02 18:42:19.973 | DEBUG    | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_rot_mat_decode_129_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 32, │·
128, 128]), dtype BFLOAT16, layout TILE                                                                                                                                                                                               │·
2024-07-02 18:42:19.977 | DEBUG    | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_129_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1│·
, 32, 160]), dtype BFLOAT16, layout TILE                                                                                                                                                                                              │·
--Type <RET> for more, q to quit, c to continue without paging--                                                                                                                                                                      │·
                                                                                                                                                                                                                                      │·
Thread 219 "python" received signal SIGSEGV, Segmentation fault.                                                                                                                                                                      │·
[Switching to Thread 0x7ffe86fdf700 (LWP 838272)]                                                                                                                                                                                     │·
0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                         │·
(gdb) list                                                                                                                                                                                                                            │·
1       <built-in>: No such file or directory.                                                                                                                                                                                        │·
(gdb) bt                                                                                                                                                                                                                              │·
#0  0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                     │·
#1  0x00007fff88875f17 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                              │·
#2  0x00007fff88873b82 in tt::tt_metal::EnqueueDeallocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Allocator&, unsigned int, tt::tt_metal::BufferType, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so       │·
#3  0x00007fff89071d70 in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda(tt::tt_metal::Device*)#1}, std::__1│·
::allocator<{lambda(tt::tt_metal::Device*)#1}>, void (tt::tt_metal::Device*)>::operator()(tt::tt_metal::Device*&&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                         │·
#4  0x00007fff8907200f in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda()#1}, std::__1::allocator<{lambda()│·
#1}>, void ()>::operator()() () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                               │·
#5  0x00007fff8881429b in tt::WorkExecutor::run_worker() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                   │·
#6  0x00007fff8881450b in void* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (tt::WorkExecutor::*)(), tt::WorkEx│·
ecutor*> >(void*) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                          │·
#7  0x00007ffff7db5609 in start_thread (arg=<optimized out>) at pthread_create.c:477                                                                                                                                                  │·
#8  0x00007ffff7eef353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@cglagovichTT
Copy link
Contributor

Case 3: search_first
image

2024-07-02 18:57:17.853 | DEBUG    | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_149_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1│·
, 32, 160]), dtype BFLOAT16, layout TILE                                                                                                                                                                                              │·
--Type <RET> for more, q to quit, c to continue without paging--                                                                                                                                                                      │·
                                                                                                                                                                                                                                      │·
Thread 219 "python" received signal SIGSEGV, Segmentation fault.                                                                                                                                                                      │·
[Switching to Thread 0x7ffe86fdf700 (LWP 857393)]                                                                                                                                                                                     │·
\0x00007fff88833d13 in tt::tt_metal::allocator::FreeList::search_first(unsigned long, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                │·
(gdb) bt                                                                                                                                                                                                                              │·
#0  0x00007fff88833d13 in tt::tt_metal::allocator::FreeList::search_first(unsigned long, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                             │·
#1  0x00007fff888347a9 in tt::tt_metal::allocator::FreeList::allocate(unsigned long, bool, unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                  │·#2  0x00007fff888378d5 in tt::tt_metal::allocator::BankManager::allocate_buffer(unsigned int, unsigned int, bool, tt::umd::xy_pair, std::__1::optional<unsigned int>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so      │·
#3  0x00007fff88839777 in tt::tt_metal::allocator::base_alloc(tt::tt_metal::AllocatorConfig const&, tt::tt_metal::allocator::BankManager&, unsigned long, unsigned long, bool, std::__1::optional<unsigned int>) ()                   │·   from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                            │·
#4  0x00007fff888398bf in tt::tt_metal::allocator::allocate_buffer(tt::tt_metal::Allocator&, unsigned int, unsigned int, tt::tt_metal::BufferType const&, bool, std::__1::optional<unsigned int>) ()                                  │·
   from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                            │·#5  0x00007fff888741ee in tt::tt_metal::EnqueueAllocateBufferImpl(tt::tt_metal::AllocBufferMetadata) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                       │·
#6  0x00007fff88876924 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                              │·
#7  0x00007fff888742e5 in tt::tt_metal::EnqueueAllocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer*, bool, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                              │·
#8  0x00007fff88822c73 in tt::tt_metal::Buffer::allocate() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                 │·
#9  0x00007fff888215e8 in tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, bool│·
) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                          │·
#10 0x00007fff8905c6a8 in std::__1::shared_ptr<tt::tt_metal::Buffer> std::__1::allocate_shared[abi:ue170006]<tt::tt_metal::Buffer, std::__1::allocator<tt::tt_metal::Buffer>, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt│·::tt_metal::BufferType const&, void>(std::__1::allocator<tt::tt_metal::Buffer> const&, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt::tt_metal::BufferType const&) ()                                                      │·   from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                                                            │·#11 0x00007fff88fa6e4a in tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::│·
__1::optional<tt::tt_metal::ShardSpecBuffer> const&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                       │·#12 0x00007fff8906e84a in tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&) ()                                   │·   from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                                                            │·#13 0x00007fff88c37418 in tt::tt_metal::operation::generic_create_output_tensors<tt::operations::primary::Matmul> () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                          │·
#14 0x00007fff88c2f1ba in tt::operations::primary::Matmul::create_output_tensors(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor> > const&) const ()

@cglagovichTT
Copy link
Contributor

I was not able to repro this segfault with async queues disabled.

In one of the deallocate segfaults in a worker thread, I see that the main thread is involved in sending a tensor to device.
image

This made me wonder if this code pattern is the culprit:

            rot_mats = ttnn.as_tensor(
                rot_mat,
                dtype=ttnn.bfloat16,
                layout=ttnn.TILE_LAYOUT,
                device=self.device_mesh,
                cache_file_name=cache_name(f"rot_mat_decode_{start_pos}"),
                memory_config=self.model_config["DRAM_MEMCFG"],
                mesh_mapper=ReplicateTensorToMesh(self.device_mesh),
            )
            rot_mats = ttnn.to_device(rot_mats, self.device_mesh)

The to_device should be unnecessary but not incorrect. I ran the test again with this call removed, but the segfaults did not go away.

cglagovichTT added a commit that referenced this issue Jul 3, 2024
@cglagovichTT
Copy link
Contributor

Repro instructions:

  • branch: cglagovich/9837
  • Build in release mode
  • set CPU frequency governor to ondemand (seems to help with repro)
gdb --args python -m pytest -svv models/demos/t3000/llama3_70b/demo/demo.py::test_LlamaModel_demo[wormhole_b0-True-check_disabled-greedy-tt-70b-T3000-80L-decode_only-text_completion-llama3]

Expected output:

--Type <RET> for more, q to quit, c to continue without paging--

Thread 224 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffe827d6700 (LWP 369024)]
0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so
(gdb) q

@tstescoTT
Copy link
Contributor Author

On a new T3000 machine to get the first run to 2816 tokens generated in a single sequence I got 6 crashes:

2024-07-03-Kuaishou-ttsmi
  • OS: Ubuntu 20.04
  • tt-kmd: 1.27.1
  • firmward bundle: 80.8.12.0
  • tt-metal commit: a053bc8

I did a soft reset tt-smi -r 0,1,2,3 after each crash and reran the first run script.

crash 1:

(python_env) user@66a27c372dce:~/tt-metal-llama3-70b/src$ python tt_metal_impl/demo/demo_llama3_first_run_4k.py
...
2024-07-03 11:26:09.820 | INFO     | __main__:run_decode:199 - Loop 5

free(): invalid pointer
Aborted (core dumped)

2:

(python_env) user@66a27c372dce:~/tt-metal-llama3-70b/src$ python tt_metal_impl/demo/demo_llama3_first_run_4k.py
...
2024-07-03 11:33:32.714 | INFO     | __main__:run_decode:199 - Loop 88

                 Always | FATAL    | Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception
info:
Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
backtrace:
 --- tt::tt_metal::allocator::BankManager::allocate_buffer(unsigned int, unsigned int, bool, tt::umd::xy_pair, std::__1::optional<unsigned int>)
 --- tt::tt_metal::allocator::base_alloc(tt::tt_metal::AllocatorConfig const&, tt::tt_metal::allocator::BankManager&, unsigned long, unsigned long, bool, std::__1::optional<unsigned int>)
 --- tt::tt_metal::allocator::allocate_buffer(tt::tt_metal::Allocator&, unsigned int, unsigned int, tt::tt_metal::BufferType const&, bool, std::__1::optional<unsigned int>)
 --- tt::tt_metal::EnqueueAllocateBufferImpl(tt::tt_metal::AllocBufferMetadata)
 --- tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&)
 --- tt::tt_metal::EnqueueAllocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer*, bool, bool)
 --- tt::tt_metal::Buffer::allocate()
 --- tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, bool)
 --- /tt-metal/build/lib/libtt_eager.so(+0x648958) [0x7f063a443958]
 --- tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::__1::op
tional<tt::tt_metal::ShardSpecBuffer> const&)
 --- tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&)
 --- /tt-metal/build/lib/libtt_eager.so(_ZN2tt8tt_metal9operation29generic_create_output_tensorsINS_10operations7primary6MatmulEEENS1_21program_output_helperIT_Xsr18has_create_programIS7_EE5valueEE4typeERKS7_RKNSt3__
16vectorINS0_6TensorENSC_9allocatorISE_EEEENSC_8optionalINS0_8DataTypeEEENS0_6LayoutERKNSK_INS0_12MemoryConfigEEE+0x178) [0x7f063a047928]
 --- tt::operations::primary::Matmul::create_output_tensors(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&) const
 --- /tt-metal/build/lib/libtt_eager.so(+0x1ee299) [0x7f0639fe9299]
 --- std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> tt::tt_metal::operation::detail::run_device_operation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Te
nsor>>>(std::__1::reference_wrapper<tt::tt_metal::CommandQueue>, tt::tt_metal::operation::DeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>> const&, std::__1::vector<tt
::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std
::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&)
 --- std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> tt::tt_metal::operation::run<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>(tt::tt_metal::ope
ration::DeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>> const&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::v
ector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::_
_1::optional<tt::tt_metal::Tensor>>> const&, unsigned char)
 --- /tt-metal/build/lib/libtt_eager.so(+0x1ed88f) [0x7f0639fe888f]
 --- /tt-metal/build/lib/libtt_eager.so(+0x1ecf7d) [0x7f0639fe7f7d]
 --- tt::tt_metal::operation::launch_op(std::__1::function<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> (std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::T
ensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>,
 std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&)>&&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocato
r<tt::tt_metal::Tensor>>&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>>, std::__1::vector<std::__1::optional<tt::tt_metal::Tens
or>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>>, bool)
 --- /tt-metal/build/lib/libtt_eager.so(+0x1eb994) [0x7f0639fe6994]
 --- /tt-metal/build/lib/libtt_eager.so(+0x24b709) [0x7f063a046709]
 --- /tt-metal/build/lib/libtt_eager.so(+0x516e39) [0x7f063a311e39]
 --- /tt-metal/build/lib/libtt_eager.so(+0x517b6f) [0x7f063a312b6f]
 --- /tt-metal/build/lib/libtt_metal.so(+0x1579eb) [0x7f0639c319eb]
 --- /tt-metal/build/lib/libtt_metal.so(+0x157c5b) [0x7f0639c31c5b]
 --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f06a66c0609]
 --- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f06a67fa353]

Aborted (core dumped)

crash 3 (same stack trace as above):

2024-07-03 11:40:37.964 | INFO     | __main__:run_decode:199 - Loop 127

                 Always | FATAL    | Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception

crash 4 (same stack trace as above):

2024-07-03 11:49:00.270 | INFO     | __main__:run_decode:199 - Loop 489

                 Always | FATAL    | Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception

crash 5:

2024-07-03 11:57:17.461 | INFO     | __main__:run_decode:199 - Loop 847

2024-07-03 11:57:17.573 | INFO     | __main__:run_decode:199 - Loop 848

Segmentation fault (core dumped)

crash 6 (hang)

2024-07-03 12:13:12.789 | INFO     | __main__:run_decode:199 - Loop 913

2024-07-03 12:13:12.903 | INFO     | __main__:run_decode:199 - Loop 914

^C^C^C^C^CTerminated

Rerunning after this crash got to 2816 tokens and gets to the known issue #9839. This completes the first run and generation for 2k context is relatively reliable.

tt-asaigal added a commit that referenced this issue Jul 3, 2024
  - This handles cases where a device tensor is reassigned to a host tensor
  - Exposed during model cache generation which uses the following pattern:
      device_tensor = device_tensor.cpu()
@tt-asaigal
Copy link
Contributor

Hey @tstescoTT would you mind running with this commit cherry-picked: 4558673. It resolved the segfault for me locally

tt-asaigal added a commit that referenced this issue Jul 4, 2024
  - This handles cases where a device tensor is reassigned to a host tensor
  - Exposed during model cache generation which uses the following pattern:
      device_tensor = device_tensor.cpu()
tt-asaigal added a commit that referenced this issue Jul 4, 2024
  - This handles cases where a device tensor is reassigned to a host tensor
  - Exposed during model cache generation which uses the following pattern:
      device_tensor = device_tensor.cpu()
@smehtaTT
Copy link

@tstescoTT - can you help repro and confirm?

@smehtaTT
Copy link

@mbahnasTT confirms tested - can be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working llama2-70b P1
Projects
None yet
Development

No branches or pull requests

8 participants