Skip to content

Latest commit

 

History

History
223 lines (181 loc) · 18.2 KB

File metadata and controls

223 lines (181 loc) · 18.2 KB

Description

This module creates a compute partition that can be used as input to the schedmd-slurm-gcp-v5-controller.

The partition module is designed to work alongside the schedmd-slurm-gcp-v5-node-group module. A partition can be made up of one or more node groups, provided either through use (preferred) or defined manually in the node_groups variable.

Warning: updating a partition and running terraform apply will not cause the slurm controller to update its own configurations (slurm.conf) unless enable_reconfigure is set to true in the partition and controller modules.

Example

The following code snippet creates a partition module with:

  • 2 node groups added via use.
    • The first node group is made up of machines of type c2-standard-30.
    • The second node group is made up of machines of type c2-standard-60.
    • Both node groups have a maximum count of 200 dynamically created nodes.
  • partition name of "compute".
  • connected to the network1 module via use.
  • nodes mounted to homefs via use.
- id: node_group_1
  source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
  settings:
    name: c30
    node_count_dynamic_max: 200
    machine_type: c2-standard-30

- id: node_group_2
  source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
  settings:
    name: c60
    node_count_dynamic_max: 200
    machine_type: c2-standard-60

- id: compute_partition
  source: community/modules/compute/schedmd-slurm-gcp-v5-partition
  use:
  - network1
  - homefs
  - node_group_1
  - node_group_2
  settings:
    partition_name: compute

For a complete example using this module, see slurm-gcp-v5-cluster.yaml.

Compute VM Zone Policies

WARNING: Lenient zone policies can lead to additional egress costs when moving data between Google Cloud resources in different zones in the same region, such as between filestore and other VM instances. For more information on egress fees, see the Network Pricing Google Cloud documentation.

To avoid egress charges, ensure your compute nodes are created in the same zone as the other resources that share data with them by setting zone_policy_deny to all other zones in the region.

The Slurm on GCP partition modules provide the option to set policies regarding which zone the compute VM instances will be created in through the zone_policy_allow and zone_policy_deny variables.

As an example, see the the following module:

- id: partition-with-zone-policy
  source: community/modules/compute/schedmd-slurm-gcp-v5-partition
  settings:
    zone_policy_allow:
    - us-central1-a
    - us-central1-b
    zone_policy_deny: [us-central1-f]

In this module, the following is defined:

  • us-central1-a and us-central1-b zones have been explicitly allowed.
  • us-central1-f has been explicitly denied, therefore no nodes in this partition will be created in that zone.
  • Since us-central1-c was not included in the zone policy, it will default to "Allow", which means the partition has the same likelihood of creating a node in that zone as the zones explicitly listed under zone_policy_allow.

NOTE: zone_policy_allow does not guarantee the use of specified zones because zones are allowed by default. Configure zone_policy_deny to ensure that zones outside the allowed list are not used.

Setting a Single Zone

The zone variable is another option for setting the zone policy. If zone is set and neither zone_policy_deny nor zone_policy_allow are set, the policy will be configured as follows:

  • All currently active zones in the region at deploy time will be set in the zone_policy_deny list, with the exception of the provided zone.
  • The provided zone will be set as the only value in the zone_policy_allow list.

zone_policy_allow and zone_policy_deny take precedence over zone if both are set.

NOTE: If a new zone is added to the region while the cluster is active, nodes in the partition may be created in that zone as well. In this case, the partition may need to be redeployed (possible via enable_reconfigure if set) to ensure the newly added zone is set to "Deny".

Support

The HPC Toolkit team maintains the wrapper around the slurm-on-gcp terraform modules. For support with the underlying modules, see the instructions in the slurm-gcp README.

License

Copyright 2022 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Requirements

Name Version
terraform >= 0.13.0
google >= 3.83

Providers

Name Version
google >= 3.83

Modules

Name Source Version
slurm_partition github.com/SchedMD/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_partition 5.3.0

Resources

Name Type
google_compute_zones.available data source

Inputs

Name Description Type Default Required
additional_disks Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead.
list(object({
disk_name = string
device_name = string
disk_size_gb = number
disk_type = string
disk_labels = map(string)
auto_delete = bool
boot = bool
}))
null no
bandwidth_tier Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. string null no
can_ip_forward Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. bool null no
deployment_name Name of the deployment. string n/a yes
disable_smt Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. bool null no
disk_auto_delete Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. bool null no
disk_size_gb Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. number null no
disk_type Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. string null no
enable_confidential_vm Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. bool null no
enable_oslogin Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. bool null no
enable_placement Enable placement groups. bool true no
enable_reconfigure Enables automatic Slurm reconfigure on when Slurm configuration changes (e.g.
slurm.conf.tpl, partition details). Compute instances and resource policies
(e.g. placement groups) will be destroyed to align with new configuration.

NOTE: Requires Python and Google Pub/Sub API.

WARNING: Toggling this will impact the running workload. Deployed compute nodes
will be destroyed and their jobs will be requeued.
bool false no
enable_shielded_vm Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. bool null no
enable_spot_vm Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. bool null no
exclusive Exclusive job access to nodes. bool true no
gpu Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead.
object({
count = number,
type = string
})
null no
is_default Sets this partition as the default partition by updating the partition_conf.
If "Default" is already set in partition_conf, this variable will have no effect.
bool false no
labels Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. any null no
machine_type Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. string null no
metadata Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. map(string) null no
min_cpu_platform Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. string null no
network_storage An array of network attached storage mounts to be configured on the partition compute nodes.
list(object({
server_ip = string,
remote_mount = string,
local_mount = string,
fs_type = string,
mount_options = string
}))
[] no
node_conf Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. map(any) null no
node_count_dynamic_max Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. number null no
node_count_static Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. number null no
node_groups A list of node groups associated with this partition. See
schedmd-slurm-gcp-v5-node-group for more information on defining a node
group in a blueprint.
list(object({
access_config = list(object({
network_tier = string
}))
node_count_static = number
node_count_dynamic_max = number
group_name = string
node_conf = map(string)
additional_disks = list(object({
disk_name = string
device_name = string
disk_size_gb = number
disk_type = string
disk_labels = map(string)
auto_delete = bool
boot = bool
}))
bandwidth_tier = string
can_ip_forward = bool
disable_smt = bool
disk_auto_delete = bool
disk_labels = map(string)
disk_size_gb = number
disk_type = string
enable_confidential_vm = bool
enable_oslogin = bool
enable_shielded_vm = bool
enable_spot_vm = bool
gpu = object({
count = number
type = string
})
instance_template = string
labels = map(string)
machine_type = string
metadata = map(string)
min_cpu_platform = string
on_host_maintenance = string
preemptible = bool
service_account = object({
email = string
scopes = list(string)
})
shielded_instance_config = object({
enable_integrity_monitoring = bool
enable_secure_boot = bool
enable_vtpm = bool
})
spot_instance_config = object({
termination_action = string
})
source_image_family = string
source_image_project = string
source_image = string
tags = list(string)
}))
[] no
on_host_maintenance Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. string null no
partition_conf Slurm partition configuration as a map.
See https://slurm.schedmd.com/slurm.conf.html#SECTION_PARTITION-CONFIGURATION
map(string) {} no
partition_name The name of the slurm partition. string n/a yes
preemptible Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. string null no
project_id Project in which the HPC deployment will be created. string n/a yes
region The default region for Cloud resources. string n/a yes
service_account Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead.
object({
email = string
scopes = set(string)
})
null no
shielded_instance_config Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead.
object({
enable_integrity_monitoring = bool
enable_secure_boot = bool
enable_vtpm = bool
})
null no
slurm_cluster_name Cluster name, used for resource naming and slurm accounting. If not provided it will default to the first 8 characters of the deployment name (removing any invalid characters). string null no
source_image Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. string null no
source_image_family Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. string null no
source_image_project Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. string null no
spot_instance_config Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead.
object({
termination_action = string
})
null no
subnetwork_project The project the subnetwork belongs to. string "" no
subnetwork_self_link Subnet to deploy to. string null no
tags Deprecated: Use the schedmd-slurm-gcp-v5-node-group module for defining node groups instead. list(string) null no
zone Zone in which to create all compute VMs. If zone_policy_deny or zone_policy_allow are set, the zone variable will be ignored. string null no
zone_policy_allow Partition nodes will prefer to be created in the listed zones. If a zone appears
in both zone_policy_allow and zone_policy_deny, then zone_policy_deny will take
priority for that zone.
set(string) [] no
zone_policy_deny Partition nodes will not be created in the listed zones. If a zone appears in
both zone_policy_allow and zone_policy_deny, then zone_policy_deny will take
priority for that zone.
set(string) [] no

Outputs

Name Description
partition Details of a slurm partition