[Bug Report] Write Barrier in DRAM Sharded MM Reader causing a hang #15279

tt-asaigal · 2024-11-21T00:37:12Z

Describe the bug
Inserting a barrier inside matmul/device/kernels/dataflow/reader_bmm_tile_layout_in0_sender_dram_sharded.cpp after a noc_async_write_multicast_loopback_src causes DRAM Sharded Matmul tests to hang consitently.

There is a possibility that this is causing LLAMA to hang (tracked here: #15018).

To Reproduce
Please check out asaigal/dram_sharded_reader_hang and run:
pytest -svv tests/tt_eager/python_api_testing/unit_testing/misc/test_matmul_dram_sharded.py::test_matmul_in1_dram_sharded_with_mm_chain

The test should hang immediately. NOC should report that at least one of the worker NCRISC cores is stuck at waypoint BWW, which corresponds to the barrier issued in the reader kernel.

The text was updated successfully, but these errors were encountered:

yugaoTT · 2024-11-21T14:37:43Z

pushed a change to yugao/hang. with barrier it can pass now because the number of dest set in mcast matches the actual num cores sent.
The reason is that when receiving responses from noc, I purposely set the number of cores to be less than the actual num dest (was under the impression that this could save some cycle).
Fixing this does not solve hang, as this shouldn't affect the following ops (the sw counter for noc is reset for each op).

jbaumanTT · 2024-11-21T20:10:07Z

I'm seeing the same hang even with f3291fb . The difference is I've got the noc_async_write_barrier at the end of the kernel, not in the middle, so it's possible the error is happening after the other barrier, or on a different codepath.

yugaoTT · 2024-11-21T20:24:31Z

@jbaumanTT the hang is for the unit test or llama? if it's the dram sharded unit test, could you please share your code changes so I can repro

jbaumanTT · 2024-11-21T20:37:41Z

it's for llama (though it might affect the unit test as well). My change is 455961e

yugaoTT · 2024-11-22T00:16:11Z

@jbaumanTT I pushed a fix to the barrier error, the number of cores set for one type of mcast sender is wrong (should -1).

however, the tested kernel is now passing, but the dispatcher is stuck at noc_async_atomic_barrier, and test hangs at "NABW"

I confirm that all the kernels for MM now has thre barriers at the very end and they passed the barrier.

Could you also try on your end, see if it's a dispatcher issue, or the interaction with MM? Thanks

I pushed to yugao/hang

jbaumanTT · 2024-11-22T17:37:16Z

That NABW hang in the dispatcher seems to be what we're seeing in #15018. Given that the barriers at the end of this kernel seem to be running correctly it looks like your new patch fixes this DRAM sharded barrier issue, and the barrier problem in this kernel was probably unrelated to the overall hang problem we're seeing.

yugaoTT · 2024-11-26T21:52:28Z

@tt-asaigal the fix is in main, could we close if this issue is resolved?

tt-asaigal · 2024-11-26T22:19:16Z

thanks @yugaoTT !

tt-asaigal added bug Something isn't working P0 labels Nov 21, 2024

tt-asaigal assigned yugaoTT Nov 21, 2024

tt-asaigal mentioned this issue Nov 21, 2024

Llama70b trace demo hangs after several decode iterations #15018

Open

tt-asaigal closed this as completed Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] Write Barrier in DRAM Sharded MM Reader causing a hang #15279

[Bug Report] Write Barrier in DRAM Sharded MM Reader causing a hang #15279

tt-asaigal commented Nov 21, 2024

yugaoTT commented Nov 21, 2024

jbaumanTT commented Nov 21, 2024

yugaoTT commented Nov 21, 2024

jbaumanTT commented Nov 21, 2024

yugaoTT commented Nov 22, 2024

jbaumanTT commented Nov 22, 2024

yugaoTT commented Nov 26, 2024

tt-asaigal commented Nov 26, 2024

[Bug Report] Write Barrier in DRAM Sharded MM Reader causing a hang #15279

[Bug Report] Write Barrier in DRAM Sharded MM Reader causing a hang #15279

Comments

tt-asaigal commented Nov 21, 2024

yugaoTT commented Nov 21, 2024

jbaumanTT commented Nov 21, 2024

yugaoTT commented Nov 21, 2024

jbaumanTT commented Nov 21, 2024

yugaoTT commented Nov 22, 2024

jbaumanTT commented Nov 22, 2024

yugaoTT commented Nov 26, 2024

tt-asaigal commented Nov 26, 2024