You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just launched a 4-node experiment on mi1008x (t006-[009-010],t007-[009-010]) and found that my experiment ran significantly slower (more than 10 times) than before. Then I ran the exact same experiment on another 4 node (t004-007,t006-007,t008-[007,009]) and the speed is the same as before. I haven't experience this issue before. I'm wondering if there's something wrong with the nodes in (t006-[009-010],t007-[009-010]). Thanks!
The text was updated successfully, but these errors were encountered:
Hello @h4duan. Did you try to reproduce the issue again in the same nodes? I noticed one of the GPUs (t006-009, ID 0) remained unused during the execution you are mentioning, but I've been able to run successfully in that same GPU.
Hi,
I just launched a 4-node experiment on mi1008x (t006-[009-010],t007-[009-010]) and found that my experiment ran significantly slower (more than 10 times) than before. Then I ran the exact same experiment on another 4 node (t004-007,t006-007,t008-[007,009]) and the speed is the same as before. I haven't experience this issue before. I'm wondering if there's something wrong with the nodes in (t006-[009-010],t007-[009-010]). Thanks!
The text was updated successfully, but these errors were encountered: