Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebooting one of the MI1008x nodes? #25

Closed
jaywonchung opened this issue Aug 7, 2024 · 1 comment
Closed

Rebooting one of the MI1008x nodes? #25

jaywonchung opened this issue Aug 7, 2024 · 1 comment

Comments

@jaywonchung
Copy link

Hi,

I know this is a strange request, but we've been facing a bug in just the MI100 GPUs when we try to read its cumulative energy counter with amd-smi. One of the suggestions from the amdsmi team was to reboot the machine: ROCm/amdsmi#38 (comment)

Is there any way we can test and see if the reboot works?

In case it helps, I just verified that the counter problem still exists in one of the MI1008x node. Job allocation ID was 61463.

Thank you.

@koomie
Copy link
Collaborator

koomie commented Aug 14, 2024

t006-009 has been rebooted so you can test this out. You should be able to request the host specifically by adding the following to your slurm request:

-w t006-009

@koomie koomie closed this as completed Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants