Skip to content

Commit

Permalink
Document CPU/RAM quotas (#2991)
Browse files Browse the repository at this point in the history
This PR adds documentation on CPU/RAM quotas introduced in
WATonomous/infra-config#2619
  • Loading branch information
ben-z authored Aug 11, 2024
1 parent 6105787 commit d1bcc11
Show file tree
Hide file tree
Showing 4 changed files with 45 additions and 16 deletions.
32 changes: 31 additions & 1 deletion components/quota-table.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -97,4 +97,34 @@ export function NodeLocalQuotaTable({
<TableBody>{rows}</TableBody>
</Table>
)
}
}

export function CPURAMQuotaTable({
className = "",
}: {
className?: string
}) {
const rows = []
for (const machine of machineInfo.machines.dev_vms) {
rows.push(
<TableRow>
<TableCell>{machine.name}</TableCell>
<TableCell className='text-center'>{machine.cpu_quota}</TableCell>
<TableCell className='text-center'>{machine.memory_quota}</TableCell>
</TableRow>
)
}

return (
<Table className={className}>
<TableHeader>
<TableRow>
<TableHead>Node</TableHead>
<TableHead className='text-center'>CPU Quota (% of 1 core)</TableHead>
<TableHead className='text-center'>Memory Quota (bytes)</TableHead>
</TableRow>
</TableHeader>
<TableBody>{rows}</TableBody>
</Table>
)
}
13 changes: 0 additions & 13 deletions pages/docs/compute-cluster/machine-usage-guide.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -122,19 +122,6 @@ Examples of software that we do not install:

If there is a piece of software that you think should be installed on the machines, please reach out to a WATcloud team member.

### `OMP_NUM_THREADS`

On general-use machines, we set the `OMP_NUM_THREADS` environment variable to `1` by default. This is to prevent system overload when running
programs that use OpenMP[^openmp]. This default is consistent with the
[default behaviour of PyTorch Distributed](https://github.com/pytorch/pytorch/blob/4b494d075093096d822b9d614e2719a0e821c6af/torch/distributed/run.py#L758-L763).
If you'd like to change this default, simply set the `OMP_NUM_THREADS` environment variable to the desired value, for example:

```bash copy
export OMP_NUM_THREADS=4
```

[^openmp]: [OpenMP](https://www.openmp.org/) is an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran.

## Maintenance and Outages

We try to keep the machines in the cluster up and running at all times. However, we do need to perform regular maintenance to keep the machines
Expand Down
13 changes: 11 additions & 2 deletions pages/docs/compute-cluster/quotas.mdx
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import { GlobalQuotaTable, NodeLocalQuotaTable, CPURAMQuotaTable } from '@/components/quota-table'

# Quotas

To ensure that everyone has a fair share of resources, we enforce a set of quotas in the cluster.
Expand All @@ -18,8 +20,6 @@ To ensure that everyone has a fair share of resources, we enforce a set of quota

This section lists the default per-user disk quotas. These quotas are subject to change as we learn more about the usage patterns.

import { GlobalQuotaTable, NodeLocalQuotaTable } from '@/components/quota-table'

#### Global disk quotas

Global disk quotas are quotas on filesystems that are shared across all nodes.
Expand Down Expand Up @@ -122,6 +122,15 @@ profile using the [Profile Editor](../utilities/profile-editor).

[^watcloud-contact]: Your WATcloud contact is the person as described in the [Getting Access](./getting-access#determine-your-watcloud-contact) section.

## CPU and memory quotas

On general-use machines, per-user CPU and memory quotas are enforced to ensure fair resource sharing.

<CPURAMQuotaTable className="mt-4" />

Unlike disk quotas, we don't allow users to request CPU and memory quota increases. Please use [SLURM](./slurm) to run resource-intensive jobs.


{
// Separate footnotes from the main content
}
Expand Down
3 changes: 3 additions & 0 deletions scripts/generate-machine-info.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,7 @@ def generate_fixtures(data_path):
}

if "login_nodes" in group_names:
login_nodes_config = get_group_config(host, "login_nodes")
properties.update({
"cpu_info": get_cpu_info(data_path, name),
"memory_info": get_memory_info(data_path, name),
Expand All @@ -174,6 +175,8 @@ def generate_fixtures(data_path):
"lsb_release_info": get_lsb_release_info(data_path, name),
"ssh_host_keys": get_file_lines(data_path, name, "ssh-host-keys.log"),
"mounts_with_quotas": get_mounts_with_quotas(host),
"cpu_quota": login_nodes_config.get("cpu_quota"),
"memory_quota": login_nodes_config.get("memory_max"),
})
dev_vms.append(properties)
if "slurmd_nodes" in group_names:
Expand Down

0 comments on commit d1bcc11

Please sign in to comment.