Skip to content

Latest commit

 

History

History
199 lines (165 loc) · 22.3 KB

File metadata and controls

199 lines (165 loc) · 22.3 KB

Description

This module creates a slurm controller node via the SchedMD/slurm-gcp slurm_controller_instance and slurm_instance_template modules.

More information about Slurm On GCP can be found at the project's GitHub page and in the Slurm on Google Cloud User Guide.

The user guide provides detailed instructions on customizing and enhancing the Slurm on GCP cluster as well as recommendations on configuring the controller for optimal performance at different scales.

WARNING: The variables enable_reconfigure, enable_cleanup_compute and enable_cleanup_subscriptions, if set to true, require additional dependencies to be installed on the system running terraform apply. Python3 (>=3.6.0, <4.0.0) must be installed along with the pip packages listed in the requirements.txt file of SchedMD/slurm-gcp.

Example

- id: slurm_controller
  source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
  use:
  - network1
  - homefs
  - compute_partition
  settings:
    machine_type: c2-standard-8

This creates a controller node with the following attributes:

  • connected to the primary subnetwork of network1
  • the filesystem with the ID homefs (defined elsewhere in the blueprint) mounted
  • One partition with the ID compute_partition (defined elsewhere in the blueprint)
  • machine type upgraded from the default c2-standard-4 to c2-standard-8

For a complete example using this module, see slurm-gcp-v5-cluster.yaml.

Live Cluster Reconfiguration (enable_reconfigure)

The schedmd-slurm-gcp-v5-controller module supports the reconfiguration of partitions and slurm configuration in a running, active cluster. This option is activated through the enable_reconfigure setting:

- id: slurm_controller
  source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
  settings:
    enable_reconfigure: true

This option has some additional requirements:

  • The Pub/Sub API must be activated in the target project: gcloud services enable file.googleapis.com --project "<<PROJECT_ID>>"
  • The authenticated user in the local development environment (or where terraform apply is called) must have the Pub/Sub Admin (roles/pubsub.admin) IAM role.
  • Python and some python packages need to be installed with pip in the local development environment deploying the cluster. For more information, see the warning in the description of this module.
  • The project in your gcloud config must match the project the cluster is being deployed onto due to a known issue with the reconfigure scripts. To set your default config project, run the following command: gcloud config set core/<<PROJECT ID>>

Support

The HPC Toolkit team maintains the wrapper around the slurm-on-gcp terraform modules. For support with the underlying modules, see the instructions in the slurm-gcp README.

License

Copyright 2022 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Requirements

Name Version
terraform >= 0.14.0
google >= 3.83

Providers

Name Version
google >= 3.83

Modules

Name Source Version
slurm_controller_instance github.com/SchedMD/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_controller_instance 5.3.0
slurm_controller_template github.com/SchedMD/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_instance_template 5.3.0

Resources

Name Type
google_compute_default_service_account.default data source

Inputs

Name Description Type Default Required
access_config Access configurations, i.e. IPs via which the VM instance can be accessed via the Internet.
list(object({
nat_ip = string
network_tier = string
}))
[] no
additional_disks List of maps of disks.
list(object({
disk_name = string
device_name = string
disk_type = string
disk_size_gb = number
disk_labels = map(string)
auto_delete = bool
boot = bool
}))
[] no
can_ip_forward Enable IP forwarding, for NAT instances for example. bool false no
cgroup_conf_tpl Slurm cgroup.conf template file path. string null no
cloud_parameters cloud.conf options.
object({
resume_rate = number
resume_timeout = number
suspend_rate = number
suspend_timeout = number
})
{
"resume_rate": 0,
"resume_timeout": 300,
"suspend_rate": 0,
"suspend_timeout": 300
}
no
cloudsql Use this database instead of the one on the controller.
server_ip : Address of the database server.
user : The user to access the database as.
password : The password, given the user, to access the given database. (sensitive)
db_name : The database to access.
object({
server_ip = string
user = string
password = string # sensitive
db_name = string
})
null no
compute_startup_script Startup script used by the compute VMs. string "" no
compute_startup_scripts_timeout The timeout (seconds) applied to the compute_startup_script. If
any script exceeds this timeout, then the instance setup process is considered
failed and handled accordingly.

NOTE: When set to 0, the timeout is considered infinite and thus disabled.
number 300 no
controller_startup_script Startup script used by the controller VM. string "" no
controller_startup_scripts_timeout The timeout (seconds) applied to the controller_startup_script. If
any script exceeds this timeout, then the instance setup process is considered
failed and handled accordingly.

NOTE: When set to 0, the timeout is considered infinite and thus disabled.
number 300 no
deployment_name Name of the deployment. string n/a yes
disable_controller_public_ips If set to false. The controller will have a random public IP assigned to it. Ignored if access_config is set. bool true no
disable_default_mounts Disable default global network storage from the controller
* /usr/local/etc/slurm
* /etc/munge
* /home
* /apps
Warning: If these are disabled, the slurm etc and munge dirs must be added
manually, or some other mechanism must be used to synchronize the slurm conf
files and the munge key across the cluster.
bool false no
disable_smt Disables Simultaneous Multi-Threading (SMT) on instance. bool true no
disk_auto_delete Whether or not the boot disk should be auto-deleted. bool true no
disk_size_gb Boot disk size in GB. number 50 no
disk_type Boot disk type, can be either pd-ssd, local-ssd, or pd-standard. string "pd-ssd" no
enable_bigquery_load Enable loading of cluster job usage into big query. bool false no
enable_cleanup_compute Enables automatic cleanup of compute nodes and resource policies (e.g.
placement groups) managed by this module, when cluster is destroyed.

NOTE: Requires Python and pip packages listed at the following link:
https://github.com/SchedMD/slurm-gcp/blob/3979e81fc5e4f021b5533a23baa474490f4f3614/scripts/requirements.txt

WARNING: Toggling this may impact the running workload. Deployed compute nodes
may be destroyed and their jobs will be requeued.
bool false no
enable_cleanup_subscriptions Enables automatic cleanup of pub/sub subscriptions managed by this module, when
cluster is destroyed.

NOTE: Requires Python and pip packages listed at the following link:
https://github.com/SchedMD/slurm-gcp/blob/3979e81fc5e4f021b5533a23baa474490f4f3614/scripts/requirements.txt

WARNING: Toggling this may temporarily impact var.enable_reconfigure behavior.
bool false no
enable_confidential_vm Enable the Confidential VM configuration. Note: the instance image must support option. bool false no
enable_devel Enables development mode. Not for production use. bool false no
enable_oslogin Enables Google Cloud os-login for user login and authentication for VMs.
See https://cloud.google.com/compute/docs/oslogin
bool true no
enable_reconfigure Enables automatic Slurm reconfiguration when Slurm configuration changes (e.g.
slurm.conf.tpl, partition details). Compute instances and resource policies
(e.g. placement groups) will be destroyed to align with new configuration.
NOTE: Requires Python and Google Pub/Sub API.
WARNING: Toggling this will impact the running workload. Deployed compute nodes
will be destroyed and their jobs will be requeued.
bool false no
enable_shielded_vm Enable the Shielded VM configuration. Note: the instance image must support option. bool false no
epilog_scripts List of scripts to be used for Epilog. Programs for the slurmd to execute
on every node when a user's job completes.
See https://slurm.schedmd.com/slurm.conf.html#OPT_Epilog.
list(object({
filename = string
content = string
}))
[] no
gpu GPU information. Type and count of GPU to attach to the instance template. See
https://cloud.google.com/compute/docs/gpus more details.
type : the GPU type
count : number of GPUs
object({
type = string
count = number
})
null no
instance_image Defines the image that will be used in the Slurm controller VM instance. This
value is overridden if any of source_image, source_image_family or
source_image_project are set.

Expected Fields:
name: The name of the image. Mutually exclusive with family.
family: The image family to use. Mutually exclusive with name.
project: The project where the image is hosted.

Custom images must comply with Slurm on GCP requirements; it is highly
advised to use the packer templates provided by Slurm on GCP when
constructing custom slurm images.

More information can be found in the slurm-gcp docs:
https://github.com/SchedMD/slurm-gcp/blob/5.3.0/docs/images.md#public-image.
map(string)
{
"family": "schedmd-v5-slurm-22-05-6-hpc-centos-7",
"project": "projects/schedmd-slurm-public/global/images/family"
}
no
labels Labels, provided as a map. map(string) {} no
login_startup_scripts_timeout The timeout (seconds) applied to the login startup script. If
any script exceeds this timeout, then the instance setup process is considered
failed and handled accordingly.

NOTE: When set to 0, the timeout is considered infinite and thus disabled.
number 300 no
machine_type Machine type to create. string "c2-standard-4" no
metadata Metadata, provided as a map. map(string) {} no
min_cpu_platform Specifies a minimum CPU platform. Applicable values are the friendly names of
CPU platforms, such as Intel Haswell or Intel Skylake. See the complete list:
https://cloud.google.com/compute/docs/instances/specify-min-cpu-platform
string null no
network_ip Private IP address to assign to the instance if desired. string "" no
network_self_link Network to deploy to. Either network_self_link or subnetwork_self_link must be specified. string null no
network_storage Storage to mounted on all instances.
server_ip : Address of the storage server.
remote_mount : The location in the remote instance filesystem to mount from.
local_mount : The location on the instance filesystem to mount to.
fs_type : Filesystem type (e.g. "nfs").
mount_options : Options to mount with.
list(object({
server_ip = string
remote_mount = string
local_mount = string
fs_type = string
mount_options = string
}))
[] no
on_host_maintenance Instance availability Policy. string "MIGRATE" no
partition Cluster partitions as a list.
list(object({
compute_list = list(string)
partition = object({
enable_job_exclusive = bool
enable_placement_groups = bool
network_storage = list(object({
server_ip = string
remote_mount = string
local_mount = string
fs_type = string
mount_options = string
}))
partition_conf = map(string)
partition_name = string
partition_nodes = map(object({
access_config = list(object({
network_tier = string
}))
bandwidth_tier = string
node_count_dynamic_max = number
node_count_static = number
enable_spot_vm = bool
group_name = string
instance_template = string
node_conf = map(string)
spot_instance_config = object({
termination_action = string
})
}))
partition_startup_scripts_timeout = number
subnetwork = string
zone_policy_allow = list(string)
zone_policy_deny = list(string)
})
}))
[] no
preemptible Allow the instance to be preempted. bool false no
project_id Project ID to create resources in. string n/a yes
prolog_scripts List of scripts to be used for Prolog. Programs for the slurmd to execute
whenever it is asked to run a job step from a new job allocation.
See https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog.
list(object({
filename = string
content = string
}))
[] no
region Region where the instances should be created. string null no
service_account Service account to attach to the controller instance. If not set, the
default compute service account for the given project will be used with the
"https://www.googleapis.com/auth/cloud-platform" scope.
object({
email = string
scopes = set(string)
})
null no
shielded_instance_config Shielded VM configuration for the instance. Note: not used unless
enable_shielded_vm is 'true'.
enable_integrity_monitoring : Compare the most recent boot measurements to the
integrity policy baseline and return a pair of pass/fail results depending on
whether they match or not.
enable_secure_boot : Verify the digital signature of all boot components, and
halt the boot process if signature verification fails.
enable_vtpm : Use a virtualized trusted platform module, which is a
specialized computer chip you can use to encrypt objects like keys and
certificates.
object({
enable_integrity_monitoring = bool
enable_secure_boot = bool
enable_vtpm = bool
})
{
"enable_integrity_monitoring": true,
"enable_secure_boot": true,
"enable_vtpm": true
}
no
slurm_cluster_name Cluster name, used for resource naming and slurm accounting. If not provided it will default to the first 8 characters of the deployment name (removing any invalid characters). string null no
slurm_conf_tpl Slurm slurm.conf template file path. string null no
slurmdbd_conf_tpl Slurm slurmdbd.conf template file path. string null no
source_image The custom VM image. It is recommended to use instance_image instead. string "" no
source_image_family The custom VM image family. It is recommended to use instance_image instead. string "" no
source_image_project The hosting the custom VM image. It is recommended to use instance_image instead. string "" no
static_ips List of static IPs for VM instances. list(string) [] no
subnetwork_project The project that subnetwork belongs to. string null no
subnetwork_self_link Subnet to deploy to. Either network_self_link or subnetwork_self_link must be specified. string null no
tags Network tag list. list(string) [] no
zone Zone where the instances should be created. If not specified, instances will be
spread across available zones in the region.
string null no

Outputs

Name Description
controller_instance_id The server-assigned unique identifier of the controller compute instance.