The data science group has a small compute cluster for educational use. We are going to use this for the Speaker Recognition Challenge of the course MLiP 2023.
The cluster consists of two compute nodes, lovingly named cn47
and cn48
, and a so-called head node, cn84
. All these machines live in the domain science.ru.nl
, so the head node's fully qualified name is cn84.science.ru.nl
.
Both compute nodes have the following specifications:
- 8 Nvidia RTX 2080 Ti GPUs, with 11 GB memory
- 48 Xeon CPUs
- 128 GB memory, shared between the CPUs
- Linux Ubuntu 20.04 operating system
The head node has the same OS installed as the compute nodes, but does not have GPUs, and is not intended for heavy computation. The general idea is that you use the head-node for
- simple editing and file manipulation
- submitting jobs to the compute nodes and controlling these jobs
You need a science account in order to be able to log into the cluster.
These nodes are not directly accessible from the internet, in on order to reach these machines you need to either
- use the science.ru VPN
- you have direct access to
cn84
, this is somewhat easier with copying throughscp
andrsync
, remote editing, etc. -
local+vpn$ ssh [email protected]
- you have direct access to
- login through the machine
lilo.science.ru.nl
- The preferred way is to use the
ProxyJump
option of ssh:local$ ssh -J [email protected] [email protected]
- Alternatively, you can login in two steps. In case you have to transport files, please be reminded only your (small) home filesystem
~
is available onlilo
.local$ ssh [email protected] lilo7$ ssh cn84
- The preferred way is to use the
Either way, you will be working through a secure-shell connection, so you must have a ssh
client on your local laptop/computer.
There are several places where you can store code and data. They have different characteristics:
filesystem | size | speed | scope |
---|---|---|---|
~ |
10 GB | fast | shared |
/scratch | few T | fastest | local |
/ceph/csedu-scratch | several TB | slow | shared |
The limitations on the home filesystem, ~
(a.k.a. $HOME
) are pretty tight---just installing pytorch typically consumes a significant portion of your disk quota. We have a "cluster preparation" script that will set up an environment for you that will give you best experience working on the cluster:
- python packages are installed in a virtual environment
- source data, logs, and models are put on large shared filesystems
/ceph
- python libraries are copied to all fast local filesystems
/scratch
- soft-links to these places are made into the project directory
- the project code is available on fast shared filesystems
~
Before you can carry out the instructions below properly, you need to fork this repository on Gitlab, check out a clone on your home directory on the cluster, and setup the environment. You can follow the instructions here.
The cluster is an environment where multiple people use computer resources in a co-operative way. Something needs to manage all these resources, and that process is called a workload manager. At science.ru we use SLURM to do this, like in many other compute clusters in the world.
Slurm is a clever piece of software, but in the tradition of hard-core computing environments most of the documentation that is available is in plain text "man pages" and inaccessible mailing lists. View the experience as a time machine, going back to the 1970's...
It is possible to ask for an interactive shell to one of the compute nodes. This will only work smoothly if there is a slot available. If the cluster is "full", jobs will wait until a slot is available, and this may take a while. An interactive session takes up a slot. In this example we will ask for a single GPU, the command srun
is what makes it all happen, the other commands run inside of the session fired up by srun
:
srun --pty --partition csedu --gres gpu:1 /bin/bash
hostname ## we're on cn47 or cn48
nvidia-smi ## it appears there is 1 GPU available in this machine
exit ## make the slot available again, exit to cn84 again
In general, we would advice not to use the interactive shell option, as described here, with a GPU and all, unless you need to just do a quick check in a situation where a GPU is required.
The normal way of working on the cluster is by submitting a batch job. This consists of several components:
- a script (typically bash) that contains all instructions to run the script
- job control information specifying resources that you need for the job
- information on where to store the output (standard out and error)
A job is submitted using sbatch
, specifying the script as an argument and the other information as options.
As an example, look at this file, which is a minimalistic script that just gives some information about the environment in which the script runs. You can submit this for running on the cluster using
sbatch --partition csedu --gres gpu:1 experiments/slurm-job.sh
squeue
The sbatch
will return immediately (unlike the srun
earlier) and if you were quick enough with typing the squeue
you might have seen your job either running or being queued in the job queue.
When the job has started, you will find a file named slurm-$jobid.out
in the current working directory:
ls slurm-*
This is where the standard output of the script is collected.
Having the metadata (--partiton
, --gres
, etc) on the command line separate from the script may not always be handy. Therefore SLURM allows the specification of the job metadata inside the script, by using a special #SBATCH
syntax. For bash (and most other script languages) the #
starts a comment, so it has no meaning to the script itself.
A full example is in the skeleton training script. Inspect the top of this script, it contains tons of instructions for sbatch
.
This skeleton training script is written in a "relative paths" style, assuming you will submit the job while your current working directory is the root of this repository, i.e., trough calling sbatch experiments/experiment_1_cluster.sh
. E.g., the logfiles are indicated as ./logs/slurm/%J.out
, the ./logs
refers to the link you've made above setting up the virtual environment. In this way we don't have to put "hard paths" in the script, which would include your user-specific installation directory, and the script will work for every user.
The following #SBATCH
options are in this example:
--partition=csedu
: specifying the subset of all science.ru nodes, we will always be usingcsedu
, referring tocn47
andcn48
.--gres=gpu:1
: we want one GPU--mem=10G
: we think the job will not use more than 10GB of CPU memory--cpus-per-task=6
: we want to claim 6 CPUs for this task (mainly for the dataloaders)--time=6:00:00
: we expect the training to be finished well before 6 hours (wall clock) time. SLURM will terminate the job if it takes longer...--output=./logs/slurm/%J.out
: The place were the stdout is collected.%J
refers to the job ID.--error=./logs/slurm/%J.err
: This is where stderr is collected--mail-type=BEGIN,END,FAIL
: specify that we want a mail message sent to our science account email at the start and finish, and in case of a failed job.--qos=csedu-normal
: This specifies that your job can run for at most 12 hours. If you want to run a job which can run for at most 48 hours, you can useqos=csedu-large
, but you will have decreased priority.
When you are ready for it, you can run your first skeleton speaker recognition training job. The options in the command-line training script are explained here, here we will show you how to submit the job in slurm. Beware: completing the training takes several hours, even with this minimalistic neural network.
sbatch experiments/experiment_1_cluster.sh
You can now inspect the status of your job using squeue
, and watch the training progressing slowly using tail -f logs/slurm/$jobid.out