-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Report] Write Barrier in DRAM Sharded MM Reader causing a hang #15279
Comments
pushed a change to yugao/hang. with barrier it can pass now because the number of dest set in mcast matches the actual num cores sent. |
I'm seeing the same hang even with f3291fb . The difference is I've got the noc_async_write_barrier at the end of the kernel, not in the middle, so it's possible the error is happening after the other barrier, or on a different codepath. |
@jbaumanTT the hang is for the unit test or llama? if it's the dram sharded unit test, could you please share your code changes so I can repro |
it's for llama (though it might affect the unit test as well). My change is 455961e |
@jbaumanTT I pushed a fix to the barrier error, the number of cores set for one type of mcast sender is wrong (should -1). however, the tested kernel is now passing, but the dispatcher is stuck at noc_async_atomic_barrier, and test hangs at "NABW" I confirm that all the kernels for MM now has thre barriers at the very end and they passed the barrier. Could you also try on your end, see if it's a dispatcher issue, or the interaction with MM? Thanks I pushed to yugao/hang |
That NABW hang in the dispatcher seems to be what we're seeing in #15018. Given that the barriers at the end of this kernel seem to be running correctly it looks like your new patch fixes this DRAM sharded barrier issue, and the barrier problem in this kernel was probably unrelated to the overall hang problem we're seeing. |
@tt-asaigal the fix is in main, could we close if this issue is resolved? |
thanks @yugaoTT ! |
Describe the bug
Inserting a barrier inside
matmul/device/kernels/dataflow/reader_bmm_tile_layout_in0_sender_dram_sharded.cpp
after anoc_async_write_multicast_loopback_src
causes DRAM Sharded Matmul tests to hang consitently.There is a possibility that this is causing LLAMA to hang (tracked here: #15018).
To Reproduce
Please check out
asaigal/dram_sharded_reader_hang
and run:pytest -svv tests/tt_eager/python_api_testing/unit_testing/misc/test_matmul_dram_sharded.py::test_matmul_in1_dram_sharded_with_mm_chain
The test should hang immediately. NOC should report that at least one of the worker NCRISC cores is stuck at waypoint
BWW
, which corresponds to the barrier issued in the reader kernel.The text was updated successfully, but these errors were encountered: