Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in Blackhole didt stress workload #12608

Closed
abhullar-tt opened this issue Sep 12, 2024 · 5 comments
Closed

Segfault in Blackhole didt stress workload #12608

abhullar-tt opened this issue Sep 12, 2024 · 5 comments
Assignees

Comments

@abhullar-tt
Copy link
Contributor

abhullar-tt commented Sep 12, 2024

on abhullar/didt-mm syseng report segfault when bumping up to 50k iterations for:
pytest models/experimental/falcon_7b/tests/test_reproduce_hang_matmul.py -k ff1-hang

Ran this on yyzo-bh-05 and yyzo-bh-06 for 100k iterations multiple times without any segfault

@pavlepopovic
Copy link
Contributor

While doing the wormhole didt testing, we also had segfaults which stopped upon rebasing to a newer version of main. This happened ~1 month aho, and didn’t see them since.
I see that the mentioned branch is quite old, so it might be worth rebasing it to latest main.

@abhullar-tt
Copy link
Contributor Author

After rebasing to main abhullar/didt-mm-rebased the test hangs after 1500 iterations

@abhullar-tt
Copy link
Contributor Author

abhullar-tt commented Sep 12, 2024

After rebasing to main abhullar/didt-mm-rebased the test hangs after 1500 iterations, even with subblocks set to 1

Adding noc_async_writes_flushed removes the 1500 iteration hang. The following comment was in the reader kernels:

// Note: no need for write barrier, since these two multicasts are done on the same noc id, same vc,
// same cmd_buf Also, this only works because we are setting VCs statically (using NOC_CMD_STATIC_VC).

@abhullar-tt
Copy link
Contributor Author

Pushed update to abhullar/didt-mm-rebased and seeing the same test run successfully for 100k iterations multiple times

@abhullar-tt
Copy link
Contributor Author

Closing this issue because syseng confirmed that the workload is now passing for 100k iterations after some changes to voltage regulator configurations were reverted and then re-applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants