Removing new slurmd
unit causes hook failed: "slurmd-relation-departed"
in slurmctld
due to invalid scontrol update
#56
Labels
needs triage
Needs further investigation to determine cause and/or work required to implement fix/feature
Bug Description
In a cluster with
slurmctld
andslurmd
charms deployed and integrated, adding and removing aslurmd
unit causesslurmctld
to error withhook failed: "slurmd-relation-departed"
.Logs indicate the failure is due to an attempt to resume the departed node. Specifically, in
_on_write_slurm_conf()
,self._resume_nodes(transitioning_nodes)
is called, which attempts command['scontrol', 'update', 'nodename=juju-8e1c9b-2', 'state=resume']
, wherejuju-8e1c9b-2
is the departed node. This results in aslurm_update error: Invalid node name specified
, causing the hook failure.The hook failure does not occur if
juju run node-configured
is run on the unit prior to removal. In that case, the unit is successfully removed without issue.To Reproduce
Environment
The latest edge of both
slurmctld
andslurmd
charms running on an[email protected]
base.Relevant log output
Additional context
Presumably this is a bug determining the list of
transitioning_nodes
inslurmctld
's_on_write_slurm_conf()
.The text was updated successfully, but these errors were encountered: