Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing new slurmd unit causes hook failed: "slurmd-relation-departed" in slurmctld due to invalid scontrol update #56

Open
dsloanm opened this issue Dec 19, 2024 · 1 comment
Labels
needs triage Needs further investigation to determine cause and/or work required to implement fix/feature

Comments

@dsloanm
Copy link
Contributor

dsloanm commented Dec 19, 2024

Bug Description

In a cluster with slurmctld and slurmd charms deployed and integrated, adding and removing a slurmd unit causes slurmctld to error with hook failed: "slurmd-relation-departed".

Logs indicate the failure is due to an attempt to resume the departed node. Specifically, in _on_write_slurm_conf(), self._resume_nodes(transitioning_nodes) is called, which attempts command ['scontrol', 'update', 'nodename=juju-8e1c9b-2', 'state=resume'], where juju-8e1c9b-2 is the departed node. This results in a slurm_update error: Invalid node name specified, causing the hook failure.

The hook failure does not occur if juju run node-configured is run on the unit prior to removal. In that case, the unit is successfully removed without issue.

To Reproduce

$ juju add-model charmed-hpc
Added 'charmed-hpc' model on localhost/localhost with credential 'localhost' for user 'admin'
$ juju deploy slurmctld --channel latest/edge --base [email protected]
Deployed "slurmctld" from charm-hub charm "slurmctld", revision 83 in channel latest/edge on [email protected]/stable
$ juju deploy slurmd --channel latest/edge --base [email protected]
Deployed "slurmd" from charm-hub charm "slurmd", revision 104 in channel latest/edge on [email protected]/stable
$ juju add-unit slurmd
$ juju integrate slurmctld:slurmd slurmd:slurmctld

# Wait for deployments to finish

$ juju remove-unit slurmd/1 --no-prompt
will remove unit slurmd/1
$ juju status
Model        Controller      Cloud/Region         Version  SLA          Timestamp
charmed-hpc  lxd-controller  localhost/localhost  3.5.4    unsupported  18:14:04Z

App        Version          Status  Scale  Charm      Channel      Rev  Exposed  Message
slurmctld  23.11.4-1.2u...  error       1  slurmctld  latest/edge   83  no       hook failed: "slurmd-relation-departed"
slurmd     23.11.4-1.2u...  active      1  slurmd     latest/edge  104  no       

Unit          Workload  Agent  Machine  Public address  Ports  Message
slurmctld/0*  error     idle   0        10.210.110.59          hook failed: "slurmd-relation-departed"
slurmd/0*     active    idle   1        10.210.110.2           

Machine  State    Address        Inst id        Base          AZ  Message
0        started  10.210.110.59  juju-8e1c9b-0  [email protected]      Running
1        started  10.210.110.2   juju-8e1c9b-1  [email protected]      Running

Environment

The latest edge of both slurmctld and slurmd charms running on an [email protected] base.

Relevant log output

unit-slurmd-1: 18:13:49 INFO juju.worker.uniter.operation ran "slurmctld-relation-departed" hook (via hook dispatching script: dispatch)
unit-slurmd-1: 18:13:50 INFO juju.worker.uniter.operation ran "slurmctld-relation-broken" hook (via hook dispatching script: dispatch)
unit-slurmctld-0: 18:13:51 ERROR unit.slurmctld/0.juju-log slurmd:1: command ['scontrol', 'update', 'nodename=juju-8e1c9b-2', 'state=resume'] failed with message slurm_update error: Invalid node name specified

unit-slurmd-1: 18:13:51 INFO unit.slurmd/1.juju-log Running legacy hooks/service-slurmd-stopped.
unit-slurmctld-0: 18:13:51 ERROR unit.slurmctld/0.juju-log slurmd:1: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/./src/charm.py", line 443, in <module>
    main.main(SlurmctldCharm)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/__init__.py", line 348, in main
    return _legacy_main.main(
           ^^^^^^^^^^^^^^^^^^
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/main.py", line 45, in main
    return _main.main(charm_class=charm_class, use_juju_for_storage=use_juju_for_storage)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/_main.py", line 543, in main
    manager.run()
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/_main.py", line 529, in run
    self._emit()
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/_main.py", line 518, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name, self._juju_context)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/_main.py", line 134, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 347, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 857, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 947, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/src/interface_slurmd.py", line 137, in _on_relation_departed
    self.on.slurmd_departed.emit()
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 347, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 857, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 947, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/./src/charm.py", line 271, in _on_write_slurm_conf
    self._resume_nodes(transitioning_nodes)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/./src/charm.py", line 394, in _resume_nodes
    self._slurmctld.scontrol("update", f"nodename={','.join(nodelist)}", "state=resume")
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/lib/charms/hpc_libs/v0/slurm_ops.py", line 911, in scontrol
    return _call("scontrol", *args).stdout
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/lib/charms/hpc_libs/v0/slurm_ops.py", line 141, in _call
    raise SlurmOpsError(f"command {cmd} failed. stderr:\n{result.stderr}")
charms.hpc_libs.v0.slurm_ops.SlurmOpsError: command ['scontrol', 'update', 'nodename=juju-8e1c9b-2', 'state=resume'] failed. stderr:
slurm_update error: Invalid node name specified

unit-slurmctld-0: 18:13:51 ERROR juju.worker.uniter.operation hook "slurmd-relation-departed" (via hook dispatching script: dispatch) failed: exit status 1
unit-slurmctld-0: 18:13:51 INFO juju.worker.uniter awaiting error resolution for "relation-departed" hook

Additional context

Presumably this is a bug determining the list of transitioning_nodes in slurmctld's _on_write_slurm_conf().

@NucciTheBoss NucciTheBoss added the needs triage Needs further investigation to determine cause and/or work required to implement fix/feature label Dec 20, 2024
@NucciTheBoss
Copy link
Member

Tagging as needs triage since this issue will likely require a deeper look into how we're managing node state transition during the cluster Day 0 to Day N lifecycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Needs further investigation to determine cause and/or work required to implement fix/feature
Projects
None yet
Development

No branches or pull requests

2 participants