Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resume action does not work due to 'resume' not being a valid state for a node. #40

Closed
NucciTheBoss opened this issue Nov 18, 2024 · 0 comments · Fixed by #34
Closed
Labels
bug Something isn't working

Comments

@NucciTheBoss
Copy link
Member

Bug Description

The resume action on slurmctld/leader fails because resume is an invalid state. Valid states for compute nodes are:

State  State of the node with respect to the initiation of user jobs.   Acceptable  values
              are CLOUD, DOWN, DRAIN, FAIL, FAILING, FUTURE and UNKNOWN.  Node states of BUSY and
              IDLE should not be specified in the node configuration, but set the node  state  to
              UNKNOWN  instead.   Setting the node state to UNKNOWN will result in the node state
              being set to BUSY, IDLE or other appropriate  state  based  upon  recovered  system
              state information.  The default value is UNKNOWN.  Also see the DownNodes parameter
              below.

              CLOUD     Indicates the node exists in  the  cloud.   Its  initial  state  will  be
                        treated  as  powered  down.  The node will be available for use after its
                        state is recovered from Slurm's state save  file  or  the  slurmd  daemon
                        starts on the compute node.

              DOWN      Indicates the node failed and is unavailable to be allocated work.

              DRAIN     Indicates the node is unavailable to be allocated work.

              FAIL      Indicates the node is expected to fail soon, has no jobs allocated to it,
                        and will not be allocated to any new jobs.

              FAILING   Indicates the node is expected  to  fail  soon,  has  one  or  more  jobs
                        allocated to it, but will not be allocated to any new jobs.

              FUTURE    Indicates  the node is defined for future use and need not exist when the
                        Slurm daemons are started. These nodes can  be  made  available  for  use
                        simply  by updating the node state using the scontrol command rather than
                        restarting the slurmctld daemon. After these nodes  are  made  available,
                        change  their  State  in  the slurm.conf file. Until these nodes are made
                        available, they will not be seen using any Slurm commands or nor will any
                        attempt be made to contact them.

                        Dynamic Future Nodes
                               A  slurmd  started  with  -F[<feature>]  will be associated with a
                               FUTURE node that matches the same configuration  (sockets,  cores,
                               threads)  as  reported  by  slurmd  -C.  The  node's  NodeAddr and
                               NodeHostname will automatically be retrieved from the  slurmd  and
                               will  be cleared when set back to the FUTURE state. Dynamic FUTURE
                               nodes retain non-FUTURE state on restart. Use scontrol to put node
                               back into FUTURE state.

              UNKNOWN   Indicates  the  node's state is undefined but will be established (set to
                        BUSY or IDLE) when the slurmd daemon on that node registers.  UNKNOWN  is
                        the default state.

Therefore, instead of trying to set the node state to "resume" in the action handler, we should instead set it to "idle":

cmd = f"scontrol update nodename={nodes} state=resume"

To Reproduce

  1. tox run -e integration -- --keep-models
  2. Wait a few minutes for the cluster to deploy...
  3. juju run slurmctld/leader resume nodename="juju-656d75-[15-16]"
  4. See error...

Environment

Latest version of slurmctld-operator

Relevant log output

Running operation 17 with 1 task
  - task 18 on unit-slurmctld-3

Waiting for task 18...
17:53:19 Resuming juju-656d75-[15-16].

Action id 18 failed: Error resuming juju-656d75-[15-16]: b''

slurm_update error: Invalid node state specified

Additional context

Originally caught, reported, and fixed by @matheushent

@NucciTheBoss NucciTheBoss added the bug Something isn't working label Nov 18, 2024
@NucciTheBoss NucciTheBoss linked a pull request Nov 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant