From d94547d9a81a173b22df79a6017d21681c364719 Mon Sep 17 00:00:00 2001
From: Andy Turner <a.turner@epcc.ed.ac.uk>
Date: Tue, 27 Feb 2024 11:20:48 +0000
Subject: [PATCH] Add Capability Day notes

---
 docs/user-guide/scheduler.md | 63 ++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/docs/user-guide/scheduler.md b/docs/user-guide/scheduler.md
index 9d2e06a07..bdc734459 100644
--- a/docs/user-guide/scheduler.md
+++ b/docs/user-guide/scheduler.md
@@ -2272,6 +2272,69 @@ Your request will be checked by the ARCHER2 User Administration team and, if app
 !!! tip
     You can submit jobs to a reservation as soon as the reservation has been set up; jobs will remain queued until the reservation starts.
 
+## Capability Days
+
+!!! important
+    The next ARCHER2 Capability Day is 0900 14 Mar - 0900 15 Mar 2024.
+
+ARCHER2 Capability Days are a mechanism to allow users to run large scale (512 node or more) tests
+on the system free of charge. The motivations behind Capability Days are:
+
+- Enhancing world-leading science from ARCHER2 by enabling modelling and simulation at scales that are not otherwise possible.
+- Enabling capability use cases that are not possible on other UK HPC services.
+- Providing a facility that can be used to test scaling to help prepare software and communities for future exascale resources.
+
+To enable this, a 24h period will be made available regularly where users can run jobs free of
+charge with the following limits:
+
+- Minimum job size: 512 nodes
+    - Individual jobs steps (i.e. `srun` commands) within job scripts should also be a minimum of 512 nodes
+    - Jobs that do not stick to these limits will be killed
+- Maximum walltime: 3 hours
+- Job numbers: 8 jobs maximum per user in the QoS
+    - 2 jobs maximum running per user
+- Users must have a valid, positive CU budget to be able to run jobs during Capability Days
+
+Users wishing to run jobs during Capability Day should submit to the `capabilityday` QoS. Jobs can be 
+submitted ahead of time and will start when the Capability Day starts.
+
+### Example Capability Day job submission script
+
+```slurm
+#!/bin/bash
+#SBATCH --job-name=capability_job
+#SBATCH --nodes=1024
+#SBATCH --ntasks-per-node=8
+#SBATCH --cpus-per-task=16
+#SBATCH --time=1:0:0
+#SBATCH --partition=standard
+#SBATCH --qos=capabilityday
+#SBATCH --account=t01
+
+export OMP_NUM_THREADS=16
+export OMP_PLACES=cores
+export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
+
+# Check process/thread placement
+module load xthi
+srun --hint=multithread --distribution=block:block xthi > placement-${SLURM_JOBID}.out
+
+srun --hint=multithread --distribution=block:block my_app.x
+```
+
+### Capability Day tips
+
+- OFI communications protocol seems to work more reliably at capability scale than UCX protocol
+    - UCX often sees memory/timeout errors
+- All-to-all collective patterns do not generally scale well to large MPI process counts, particularly when there are high MPI process counts per node
+  - c.f. On the Frontier exascale system there are typically a maximum of 8 MPI processes per node (1 per GPU). 9,408 compute nodes gives a maximum of 75,264 MPI processes for a whole system job.
+    - 4096 ARCHER2 compute nodes, 1 MPI process per core is 524,488 MPI processes!
+- MPI-IO does not generally scale well to high process counts unless the IO pattern is very simple
+    - Same for IO libraries based on MPI-IO: parallel HDF5, NetCDF
+    - Consider a different parallel IO approach, e.g. ADIOS2
+- Make use of the scratch, solid state file system so you do not hit unexpected storage quota issues
+- With very high MPI process counts, you may see long MPI startup times, take this into account in wall times in your job scripts
+
 ## Serial jobs
 
 You can run serial jobs on the shared data analysis nodes. More information