Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(all): implement slurm_ops charm library #35

Merged
merged 30 commits into from
Nov 18, 2024

Conversation

NucciTheBoss
Copy link
Member

@NucciTheBoss NucciTheBoss commented Nov 8, 2024

This PR implements the slurm_ops and is_container charm libraries to provide the Slurm operations logic to all the Slurm charms.

Key features are that there's now one standard interface for interacting with the Slurm services, and we have an easy way to drive munge, jwt tokens, prometheus, etc, without requiring duplication of logic between charms. slurm_ops also provides back-end implementations for both the Slurm deb and Slurm snap so that we can easily switch between the chosen format.

Related issues

Misc.

I came across a few issues that were fixed as part of the work to integrate slurm_ops into the charms:

  1. The user-supplied SlurmctldParameters was being overwritten by CHARM_MAINTAINED_SLURM_CONF_PARAMETERS when writing out the slurm.conf file.
  2. The user-supplied NHC configuration was not connected to anything. Even if a custom configuration was supplied by the user, the handler would just copy the template stored on the charm.
  3. Properly adjust ulimit rules for Slurm so that we can have more open files on the system.

…ipulation

These submodules are created underneath `utils` since `slurm_ops` is specific to Slurm
rather than including customizations required specifically by the slurmd operator. The `machine`
module is a wrapper around the `slurmd -C` command which retrieves compute node information required
by slurmctld, the `nhc` module is a wrapper around the NHC installation + configuration logic,
and `service` includes the systemd customizations required to modify ulimits + support
`juju_systemd_notices`.

Signed-off-by: Jason C. Nucciarone <[email protected]>
`slurm_ops` provides the neccessary methods to manage slurmd
without needing to carry a `slurmd_ops` module that copies methods
from several other custom ops managers.

Signed-off-by: Jason C. Nucciarone <[email protected]>
Signed-off-by: Jason C. Nucciarone <[email protected]>
Signed-off-by: Jason C. Nucciarone <[email protected]>
Signed-off-by: Jason C. Nucciarone <[email protected]>
`ConfigData.get(...)` from `ops` returns a looser type than what's allowed
by pyright if we're passing the result off to `nhc.generate_config(...)`, so
we use `typing.cast(str, ...)` to ensure that the type of the output is just str.

Other changes:

- Removed stale TODO comment from development code.

Signed-off-by: Jason C. Nucciarone <[email protected]>
The `rustc` compiler is required by the cryptography Python package
since all charm dependencies are pulled as source packages rather than
as pre-built wheels. The parts declarations will not be required when
`jwtctl` is implemented, but until then, the `rustc` compiler is required
to successfully pack the Slurm charms.

Signed-off-by: Jason C. Nucciarone <[email protected]>
@NucciTheBoss
Copy link
Member Author

Hmm... seems like the CI hit a timeout. Will try running locally to see what the issues are :shipit:

@NucciTheBoss NucciTheBoss added the enhancement New feature or request label Nov 8, 2024
@NucciTheBoss
Copy link
Member Author

Upon further investigation, it seems like the current issue with the CI is the time it takes to build the cryptography package using the rustc compiler. My Framework sounded like a jet-engine when building all four Slurm charms at the same time, so I'm expecting that we might need to swap cryptography back to pycryptodome, or we add an additional build step for the charms.

@jamesbeedy
Copy link
Contributor

This looks awesome! Nice work!

Installation would fail originally because lbnl-* would be the top-level
directory under `/tmp` so `subprocess.check_output(...)` could not find
the `autogen.sh` file we needed to build NHC on the node.

Signed-off-by: Jason C. Nucciarone <[email protected]>
Modifies `interface_slurmd` to not add an empty 'Nodes=' field to the
Partition info if no compute nodes are enlisted yet. `scontrol reconfigure`
will fail as Slurm cannot process an empty 'Nodes=' field.

Signed-off-by: Jason C. Nucciarone <[email protected]>
Ignoring an empty 'Nodes=' field in the Partition configuration by checking
the length of configured `partition_nodes` to be greater than 0, the cyclomatic
complexity of `interface_slurmd.Slurmd.get_new_nodes_and_nodes_and_partitions`
becomes 11 which cause the lint job to fail.

This method should eventually be reduced if possible, but setting the
max cyclomatic complexity to 15 works as an effective stop-gap.

Signed-off-by: Jason C. Nucciarone <[email protected]>
`pyright` doesn't like the dynamic attributes provided by the `slurmutils`
config models in the `slurmdbd` charm, so we'll need to enable type checking
over on that project.

For the `slurmctld` charm, we just needed to correct the hint for
`assemble_slurmctld_parameters` to be `dict[str, Any]` rather than
string which is returned for the slurm.conf editor.

Signed-off-by: Jason C. Nucciarone <[email protected]>
@NucciTheBoss NucciTheBoss marked this pull request as ready for review November 15, 2024 17:12
@NucciTheBoss
Copy link
Member Author

@jedel1043 only took ~45 minutes for the integration tests to run, but this PR is R4R 🥳

Still looking into charmcraftcache for how that will work with our monorepo...

Copy link
Collaborator

@jedel1043 jedel1043 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work! I love the line diff; +1766 - 2422 = -656 less lines, which is always great.

charms/slurmdbd/src/constants.py Show resolved Hide resolved
charms/slurmd/src/utils/machine.py Show resolved Hide resolved
charms/slurmctld/src/charm.py Outdated Show resolved Hide resolved
charms/slurmctld/src/charm.py Show resolved Hide resolved
charms/slurmd/src/utils/nhc.py Outdated Show resolved Hide resolved
Prematurely called the debug message that MUNGE was restarted correctly
before actually restarting MUNGE on the machine.

Signed-off-by: Jason C. Nucciarone <[email protected]>
Changes:

* slurmctld: create gh issue for enabling Juju Secrets.
* slurmd: remove stale TODO comment from development work.

Signed-off-by: Jason C. Nucciarone <[email protected]>
Copy link
Contributor

@dsloanm dsloanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's everything addressed - LGTM 👍

@NucciTheBoss NucciTheBoss merged commit e3e0fbd into charmed-hpc:main Nov 18, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants