Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Write Barrier in DRAM Sharded MM Reader causing a hang #15279

Closed
tt-asaigal opened this issue Nov 21, 2024 · 8 comments
Closed

[Bug Report] Write Barrier in DRAM Sharded MM Reader causing a hang #15279

tt-asaigal opened this issue Nov 21, 2024 · 8 comments
Assignees
Labels
bug Something isn't working P0

Comments

@tt-asaigal
Copy link
Contributor

Describe the bug
Inserting a barrier inside matmul/device/kernels/dataflow/reader_bmm_tile_layout_in0_sender_dram_sharded.cpp after a noc_async_write_multicast_loopback_src causes DRAM Sharded Matmul tests to hang consitently.

There is a possibility that this is causing LLAMA to hang (tracked here: #15018).

To Reproduce
Please check out asaigal/dram_sharded_reader_hang and run:
pytest -svv tests/tt_eager/python_api_testing/unit_testing/misc/test_matmul_dram_sharded.py::test_matmul_in1_dram_sharded_with_mm_chain

The test should hang immediately. NOC should report that at least one of the worker NCRISC cores is stuck at waypoint BWW, which corresponds to the barrier issued in the reader kernel.

@yugaoTT
Copy link
Contributor

yugaoTT commented Nov 21, 2024

pushed a change to yugao/hang. with barrier it can pass now because the number of dest set in mcast matches the actual num cores sent.
The reason is that when receiving responses from noc, I purposely set the number of cores to be less than the actual num dest (was under the impression that this could save some cycle).
Fixing this does not solve hang, as this shouldn't affect the following ops (the sw counter for noc is reset for each op).

@jbaumanTT
Copy link
Contributor

I'm seeing the same hang even with f3291fb . The difference is I've got the noc_async_write_barrier at the end of the kernel, not in the middle, so it's possible the error is happening after the other barrier, or on a different codepath.

@yugaoTT
Copy link
Contributor

yugaoTT commented Nov 21, 2024

@jbaumanTT the hang is for the unit test or llama? if it's the dram sharded unit test, could you please share your code changes so I can repro

@jbaumanTT
Copy link
Contributor

it's for llama (though it might affect the unit test as well). My change is 455961e

@yugaoTT
Copy link
Contributor

yugaoTT commented Nov 22, 2024

@jbaumanTT I pushed a fix to the barrier error, the number of cores set for one type of mcast sender is wrong (should -1).

however, the tested kernel is now passing, but the dispatcher is stuck at noc_async_atomic_barrier, and test hangs at "NABW"

I confirm that all the kernels for MM now has thre barriers at the very end and they passed the barrier.

Could you also try on your end, see if it's a dispatcher issue, or the interaction with MM? Thanks

I pushed to yugao/hang

@jbaumanTT
Copy link
Contributor

That NABW hang in the dispatcher seems to be what we're seeing in #15018. Given that the barriers at the end of this kernel seem to be running correctly it looks like your new patch fixes this DRAM sharded barrier issue, and the barrier problem in this kernel was probably unrelated to the overall hang problem we're seeing.

@yugaoTT
Copy link
Contributor

yugaoTT commented Nov 26, 2024

@tt-asaigal the fix is in main, could we close if this issue is resolved?

@tt-asaigal
Copy link
Contributor Author

thanks @yugaoTT !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0
Projects
None yet
Development

No branches or pull requests

3 participants