Skip to content

v1.1.0

Latest
Compare
Choose a tag to compare
@rezib rezib released this 07 May 20:07
· 53 commits to main since this release
v1.1.0

Added

  • Integration with RacksDB to extract emulated cluster topology (#1).
  • Support for debian12 (Debian bookworm) in OS images sources YAML file.
  • Introduce fhpc_addresses, fhpc_nodes, fhpc_emulator_mode and fhpc_db extra variables. The first is a hash with containers as keys and the list of IP addresses as values. The second is also a hash with node tags as keys and the list of nodes assigned with the tag in values. The third is a boolean set to true when --slurm-emulator option is set on firehpc command line. The fourth is the local absolute path to RacksDB database.
  • Possibility to run command with SSH paramiko library in addition to ssh binary executable.
  • Add example RacksDB database.
  • Add possibility to deploy users directory extracted from another existing cluster to have the same user accounts on multiple clusters eventually.
  • Generate and manage groups tree internally. Groups definitions are exported to ansible with fhpc_groups extra variable and can be dumped with firehpc status command.
  • Support containers namespace to allow multiple users start the same virtual clusters on the same host without conflict.
  • cli:
    • Support for tags to filter deployed configuration tasks.
    • Report cluster status in JSON format with --json option.
    • Add --slurm-emulator option to deploy and configure a cluster with emulated Slurm cluster nodes (only one admin node with up to 64k virtual compute nodes).
    • Add --users option on deploy command to extract users directory from another existing cluster.
    • Introduce fhpc-emulate-slurm-usage command to emulate random usage of Slurm cluster.
    • Add start and stop commands to respectively start and stop all containers of an emulated cluster.
  • conf:
    • Optional support of Rackslab developement Deb and RPM repositories, disabled by default.
    • Introduce racksdb role to install RacksDB and deploy database content.
    • Introduce slurmweb role to install and setup Slurmweb, optional and disabled by default.
    • Support multiple Slurm accounts definitions with hierarchy and control of users membership.
    • Add tags on all roles.
    • Add variable for slurmrestd socket path in slurm role.
    • Support optional additional slurmdbd parameters.
    • Deploy SSH root private and public keys on admin.
    • Generate /etc/hosts with all cluster IP addresses and hostnames.
    • Add nodeset_fold and nodeset_expand Jinja2 filters.
    • Support Slurm emulation with fully virtual nodes (up to 64k).
    • Support optional secondary groups in LDAP directory.
    • Add possibility to deploy Redis server on admin host.
    • Use fhpc_groups for default slurm_accounts variable value and to define LDAP groups.
    • Use fhpc_db for default racksdb_database variable value and to define RacksDB database content.
    • Install bach-completion by default on all nodes with common role.
    • Install clustershell on all nodes by default with new clustershell role.
    • Introduce nginx role.
  • docs:
    • Mention conf command --db, --schema and --tags options in firehpc(1) manpage.
    • Mention deploy command --db and --schema options in firehpc(1) manpage.
    • Mention status command --json option in firehpc(1) manpage.
    • Mention new start and stop commands in firehpc(1) manpage.
    • Add manpage for fhpc-emulate-slurm-usage
    • Mention conf and deploy commands --slurm-emulator option in firehpc(1) manpage.
    • Mention deploy command --users option in firehpc(1) manpage.

Changed

  • Replaced notion of zone in favor of cluster, both in CLI options and configuration variables names.
  • Removed extra directory from source tree. It used to contain ansible machinectl connection plugin as Git submodule. This dependency is now injected in FireHPC as a package supplementary source in packages built by Fatbuildr.
  • conf:
    • Declare SSH host keys valid for both containers FQDN and short hostname in system known hosts file.
    • Split ssh role in 3 steps: localkeys for local bootstrap, bootstrap to initialize files on containers with machinectl and main for normal
      operations with SSH (known_hosts, SSH root keys).
    • Replace hardcoded admin hosts by selection of first admin group member for LDAP server hostname and Slurm server.
    • Generate Slurm nodes and partitions based RacksDB database content.
    • Split playbook by sections with hosts targets to avoid many skipped tasks.
  • docs: Update after zone→cluster rename in CLI options.

Fixed

  • Check OS images argument in CLI against values available in OS images YAML file instead of hard-coded argparse choices.
  • Storage service stop and removal.
  • Start storage service with container when cluster is started.
  • Retry SSH connections up to 3 times in case of failure.
  • Wait some time before starting the second container to finish container private network setup and avoid the following container from erasing
    everything before completion.
  • Handle RacksDB format and schema errors with correct error message.
  • Wait for both IPv4 and IPv6 addresses when retrieving container addresses, to avoid retrieving only IPv6 before IPv4 address is finally available.
  • Correctly handle and report DNS errors in SSH module.
  • conf:
    • Open slurmd spool directory permissions to all users for running batch jobs scripts.
    • Manage home directories ownership and permissions, in addition to some their content.
    • Add missing common name in LDAP x509 TLS/SSL certificate.
    • Do not use cgroups with Slurm in emulator mode.
    • Force update of APT repositories metadata.
    • Install en_US.UTF-8 locale on Debian, as well as done on RHEL by default.
    • Set systemd-networkd DHCP client identifier to mac on RHEL to avoid getting a different address than those obtained by NetworkManager at boot, which eventually result in IPv4 adresses in /etc/hosts being removed from network interfaces when initial leases reach their timeout.
  • docs: Grammatical error and typos in firehpc(1) manpage
  • lib: limit network devices names to 12 characters to avoid network zone name errors with systemd-nspawn.