Merge branch 'main' of gitlab.science.ru.nl:lnguyen/tiny-voxceleb-ske…

…leton-2023
jonatelintelo · Mar 8, 2023 · 6e7f22e · 6e7f22e
2 parents 2ab1cc8 + 20c8791
commit 6e7f22e
Show file tree

Hide file tree

Showing 19 changed files with 219 additions and 110 deletions.
diff --git a/.env.example b/.env.example
@@ -1,7 +1,6 @@
 # In this file, we store environment variables which are used in
 # scripts throughout this project
-
 SCIENCE_USERNAME=put_your_science_username_here
 
-# '/home/$USERNAME/tiny-voxceleb-skeleton' is just a guess, can be some other value
-DATA_FOLDER=/home/$SCIENCE_USERNAME/tiny-voxceleb-skeleton/data
+# this path is just a guess, can be some other value
+DATA_FOLDER=/home/$SCIENCE_USERNAME/mlip/tiny-voxceleb-skeleton-2023/data
diff --git a/cli_evaluate.py b/cli_evaluate.py
@@ -25,7 +25,7 @@
 from skeleton.models.prototype import PrototypeSpeakerRecognitionModule
 
 from skeleton.evaluation.evaluation import (
-    evaluate_speaker_trails,
+    evaluate_speaker_trials,
     EmbeddingSample,
     load_evaluation_pairs,
 )
@@ -69,7 +69,7 @@
 parser.add_argument(
     "--use-gpu",
     type=lambda x: x.lower() in ("yes", "true", "t", "1"),
-    default=False,
+    default=True,
     help="whether to evaluate on a GPU device",
 )
 
@@ -161,7 +161,7 @@ def main(
 
     # for each trial, compute scores based on cosine similarity between
     # speaker embeddings
-    results = evaluate_speaker_trails(pairs, embeddings, skip_eer=True)
+    results = evaluate_speaker_trials(pairs, embeddings, skip_eer=True)
     scores = results['scores']
 
     # write each trial (with computed score) to a file

diff --git a/doc/bootstrap.md b/doc/bootstrap.md
@@ -1,12 +1,12 @@
 ## Bootstrapping the first challenge in MLIP
 
  - Form a team of three or four students
- - Make a [fork](#forking-the-repository-on-scienceru-gitlab) of the repository on Science Gitlab to one of your team member's science account, and add the other team members
- - Configure the Science [VPN](https://wiki.cncz.science.ru.nl/Vpn)
- - Log in to the [compute clusters](cluster.md) machine `slurm22.science.ru.nl`
- - Set up an [ssh private/public key pair](clone.md#etting-up-an-ssh-key-in-order-to-clone-your-copy-of-the-repo) to access this cloned repository from the science cluster
+ - Configure the Science [VPN](https://wiki.cncz.science.ru.nl/Vpn) and connect to it
+ - Make a [fork](./clone.md#forking-the-repository-on-scienceru-gitlab) of the repository on Science Gitlab to one of your team member's science account, and add the other team members
+ - Log in to the [compute clusters](cluster.md) machine `cn84.science.ru.nl`
+ - Set up an [ssh private/public key pair](clone.md#setting-up-an-ssh-key-in-order-to-clone-your-copy-of-the-repo) to access this cloned repository from the science cluster
  - [Clone](clone.md#cloning) your private Gitlab repository to the cluster
- - [Set up](clone.md##setting-up-links-and-virtual-environments-in-the-cluster) the environment on the cluster
+ - [Set up](clone.md#setting-up-links-and-virtual-environments-in-the-cluster) the environment on the cluster
  - Submit your first [SLURM job](cluster.md#queuing-slurm-jobs)
  - Study the code in the [skeleton](skeleton.md), perhaps make some trivial changes
  - Submit your first speaker recognition [training](skeleton.md#training-the-basic-network) SLURM job with [sbatch](cluster.md#more-advanced-slurm-scripts)

diff --git a/doc/clone.md b/doc/clone.md
@@ -1,4 +1,4 @@
-## Forking and cloning this repository
+## Setting up the cluster environment
 
 It is handy to have a central repository for the code that your team is working on.  You can easily do this by _forking_ the repo directly from science.ru gitlab.
 
@@ -20,12 +20,20 @@ Each member, before they can clone this forked repo to their local computer or t
 ### Setting up an SSH key in order to clone your copy of the repo
 
 First, if you have never done so for the machine you're working on (your local computer or the cluster), generate a public/private key pair:
+
 ```
 $ ssh-keygen 
 ## and hit <return> a few times
 ```
 
+Also put the newly generated key in `~/.ssh/authorized_keys`.
+
+```
+cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
+```
+
 Then, print your public key in the terminal and copy it to the clipboard:
+
 ```
 $ cat ~/.ssh/id_rsa.pub
 ```
@@ -55,7 +63,7 @@ You can repeat this process of adding an ssh-key for each computer from which yo
 
 ### Cloning
 
-Now, if you want to clone this repo to the cluster, log on to the cluster node `slurm22` (through VPN or via `lilo`).  If you want to clone to a local computer, open a local shell.  
+Now, if you want to clone this repo to the cluster, log on to the cluster node `cn84` (through VPN or via `lilo`).  If you want to clone to a local computer, open a local shell.  
 
 You can copy the exact URL for cloning by clicking the _Clone_ button on your own repository:
 
@@ -90,21 +98,19 @@ while the remote `upstream` points to the original, skeleton code you forked. Yo
 git remote -v
 ```
 
-
 ### Setting up links and virtual environments in the cluster
 
 If everything is all right, you have a reasonable clean set of files and directories upon first checkout (we will from now on drop the command prompt `$` in the example code):
 ```bash
 ls
 ```
 Now run the script for setting up the virtual environment and links to various places where data is / will be stored.  This script will take several minutes to complete:
+
 ```bash
-scripts/prepare_cluster.sh
+./scripts/prepare_cluster.sh
 ls -l 
 ```
 You will see the soft links made to 
  - `data`, where the audio data is stored, 
  - `logs`, where results and log outputs of your scripts are stored
  - `venv`, the python virtual environment that has been copied to local discs on cluster nodes `cn47` and `cn48`. 
-
-
diff --git a/doc/cluster.md b/doc/cluster.md
@@ -2,7 +2,7 @@
 
 The data science group has a small compute cluster for educational use.  We are going to use this for the Speaker Recognition Challenge of the course [MLiP 2023](https://brightspace.ru.nl/d2l/home/333310).  
 
-The cluster consists of two _compute nodes_, lovingly named `cn47` and `cn48`, and a so-called _head node_, `slurm22`.  All these machines live in the domain `science.ru.nl`, so the head node's fully qualified name is `slurm22.science.ru.nl`.  
+The cluster consists of two _compute nodes_, lovingly named `cn47` and `cn48`, and a so-called _head node_, `cn84`.  All these machines live in the domain `science.ru.nl`, so the head node's fully qualified name is `cn84.science.ru.nl`.  
 
 Both compute nodes have the following specifications:
  - 8 Nvidia RTX 2080 Ti GPUs, with 11 GB memory
@@ -14,23 +14,25 @@ The head node has the same OS installed as the compute nodes, but does not have
  - simple editing and file manipulation
  - submitting jobs to the compute nodes and controlling these jobs
 
+### accessing the cluster
+
 You need a [science account](https://wiki.cncz.science.ru.nl/Nieuwe_studenten#.5BScience_login_.28vachternaam.29_.5D.5BScience_login_.28isurname.29.5D) in order to be able to log into the cluster.  
 
 These nodes are not directly accessible from the internet, in on order to reach these machines you need to either
  - use the science.ru [VPN](https://wiki.cncz.science.ru.nl/Vpn)
-   - you have direct access to `slurm22`, this is somewhat easier with copying through `scp` and `rsync`, remote editing, etc.
+   - you have direct access to `cn84`, this is somewhat easier with copying through `scp` and `rsync`, remote editing, etc.
    - ```
-     local+vpn$ ssh slurm22
+     local+vpn$ ssh [email protected]
      ```
- - login through the machine `lilo.science.ru.nl`.  
+ - login through the machine `lilo.science.ru.nl`
    - The preferred way is to use the `ProxyJump` option of ssh:
+        ```
+        local$ ssh -J [email protected] [email protected]
+        ```
+   - Alternatively, you can login in two steps. In case you have to transport files, please be reminded only your (small) home filesystem `~` is available on `lilo`. 
      ```
-     local$ ssh -J lilo.science.ru.nl cn99
-     ```
-   - Alternatively, you can login in two steps.  In case you have to transport files, please be reminded only your (small) home filesystem `~` is available on `lilo`. 
-   - ```
-     local$ ssh lilo.science.ru.nl
-     lilo7$ ssh slurm22
+       local$ ssh [email protected]
+       lilo7$ ssh cn84
      ```
 
 Either way, you will be working through a secure-shell connection, so you must have a `ssh` client on your local laptop/computer.  
@@ -54,7 +56,7 @@ The limitations on the home filesystem, `~` (a.k.a. `$HOME`) are pretty tight---
  
 ### Forking and cloning the repository
 
-Before you can carry out the instructions below properly, you need to fork this repository on Gitlab, and check out a clone on your home directory on the cluster  You can follow the [instructions here](./clone.md).
+Before you can carry out the instructions below properly, you need to fork this repository on Gitlab, check out a clone on your home directory on the cluster, and setup the environment. You can follow the [instructions here](./clone.md).
 
 ## SLURM
 
@@ -69,9 +71,10 @@ It is possible to ask for an interactive shell to one of the compute nodes.  Thi
 srun --pty --partition csedu --gres gpu:1 /bin/bash
 hostname ## we're on cn47 or cn48
 nvidia-smi ## it appears there is 1 GPU available in this machine
-exit ## make the slot available again, exit to slurm22 again
+exit ## make the slot available again, exit to cn84 again
 ```
 In general, we would advice not to use the interactive shell option, as described here, with a GPU and all, unless you need to just do a quick check in a situation where a GPU is required.  
+
 ### Queuing slurm jobs
 
 The normal way of working on the cluster is by submitting a batch job.  This consists of several components:
@@ -111,6 +114,7 @@ The following `#SBATCH` options are in this example:
  - `--output=./logs/slurm/%J.out`: The place were the stdout is collected. `%J` refers to the job ID.  
  - `--error=./logs/slurm/%J.err`: This is where stderr is collected
  - `--mail-type=BEGIN,END,FAIL`: specify that we want a mail message sent to our science account email at the start and finish, and in case of a failed job. 
+ - `--qos=csedu-normal`: This specifies that your job can run for at most 12 hours. If you want to run a job which can run for at most 48 hours, you can use `qos=csedu-large`, but you will have decreased priority.
 
 When you are ready for it, you can run your first [skeleton speaker recognition](./skeleton.md) training job.  The options in the command-line training script are explained [here](./skeleton.md), here we will show you how to submit the job in slurm.  Beware: completing the training takes several hours, even with this [minimalistic neural network](../skeleton/models/prototype.py#L124-126). 
 

diff --git a/doc/honour-code.md b/doc/honour-code.md
@@ -15,23 +15,23 @@ In order to keep this project fun but also educational, we ask you to respect th
 We want every group to be able to use GPU resources provided in the CSEDU compute cluster. Therefore, we ask everyone to honour these rules:
 
 * Do **not** `ssh` into `cn47` and/or `cn48` directly to run any experiments. This leads to other programs crashing due to e.g. out of memory errors, and this way the resources on the cluster cannot be allocated (fairly).
-* Your group should only have one running job at a time.
+* Your group should only have one long-running job at a time.
   * If you submit an [array job](https://slurm.schedmd.com/job_array.html), you must use a parallelism of `1`, by using `%1` in e.g. `#SBATCH --array=0-4%1`.
-* Your jobs can use a maximum of 6 CPUs, 16 GB memory, and 1 GPU.
+* Your jobs can use a maximum of 6 CPUs, 15 GB memory, and 1 GPU.
   * This can be controlled with the `SBATCH` parameters below 
     ```
     #SBATCH --gres=gpu:1       # this value may not exceed 1
-    #SBATCH --mem=10G          # this value may not exceed 16
+    #SBATCH --mem=10G          # this value may not exceed 15
     #SBATCH --cpus-per-task=6  # this value may not exceed 6
     ```
-* Your jobs time-out after at most 24 hours. However, we ask everyone to **aim** for a maximum of 12 hours for most jobs.
+* Your jobs time-out after at most 48 hours. However, we ask everyone to **aim** for a maximum of 12 hours for most jobs.
   * This can be controlled with `#SBATCH --time=12:00:00`
   * If you have evidence that you need to train for longer than 12 hours, be fair, and restrict your usage afterwards.
   * If you train for longer than 12 hours, make sure that you can argue why this was necessary.
 * Use sharded data loading (as implemented in [TinyVoxcelebDataModule](../skeleton/data/tiny_voxceleb.py)), rather than individual file access, wherever you can, to prevent high i/o loads on the network file system.
-* Do not run any long-running foreground tasks on the `slurm22` head node.
-  * The `slurm22` node should only be used to schedule SLURM jobs
-  * An example of short-running foreground tasks with are OK to run on `slurm22`: manipulation of file-system with `rsync` or `cp`, using `git`, using `srun` or `sbatch`.
+* Do not run any long-running foreground tasks on the `cn84` head node.
+  * The `cn84` node should only be used to schedule SLURM jobs
+  * An example of short-running foreground tasks with are OK to run on `cn84`: manipulation of file-system with `nano`, `rsync` or `cp`, using `git`, using `tmux`, using `srun` or `sbatch`.
   * Example of tasks with which should be submitted as a job: offline data augmentation, compiling a large software project.
 * Whenever you're using the cluster, use your judgement to make sure that everyone can have access.
 

diff --git a/doc/skeleton.md b/doc/skeleton.md
@@ -1,6 +1,7 @@
 ## The skeleton code for tiny-voxceleb
 
-We have provided some skeleton code to get you started training and evaluating on the tiny-voxceleb data.
+We have provided some skeleton code to get you started training and evaluating on the tiny-voxceleb data. Note that the instructions below are intended for running the code on
+your own machine. For running the code on the cluster, read [these instructions](cluster.md).
 
 ### setting up the environment variables
 
@@ -217,6 +218,8 @@ You can look at `experiments/experiment_1_local.sh` for an example script which
 
 ### Evaluating a network
 
+**PLEASE NOTE** For running on the CSEDU compute cluster, you need to submit the commands below as batch jobs, see [the cluster documentation](./cluster.md) for further details.  This section describes the working of the python command line evaluation script.
+
 We have an evaluation script which you can use to compute score lists on the dev and eval test set:
 
 ```

diff --git a/doc/speaker-recognition.md b/doc/speaker-recognition.md
@@ -36,6 +36,7 @@ Apart from working on basic disciriminability of speakers, a lot of performance
  - Neural embeddings from raw waweforms instead of MFCCs: [Wav2Spk: A Simple DNN Architecture for Learning Speaker Embeddings from Waveforms](./papers/wav2spk.pdf), Weiwei Lin and Man-Wai Mak, Proc. Interspeech 2020, 3211-3215, doi: 10.21437/Interspeech.2020-1287
  - Recent state-of-the-art model: [ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification](./papers/ecapa_tdnn.pdf), Brecht Desplanques, Jenthe Thienpondt, Kris Demuynck, Proc. Interspeech 2020, 3830-3834, doi: 10.21437/Interspeech.2020-2650
  - Recent comparison of loss functions: [In Defence of Metric Learning for Speaker Recognition](./papers/metric_learning.pdf), Joon Son Chung et al, Proc. Interspeech 2020, 2977-2981, doi: 10.21437/Interspeech.2020-1064
+ - Speaker recognition with small datasets: [Training speaker recognition systems with limited data](./papers/training_with_limited_data.pdf) Nik Vaessen and David A. van Leeuwen, 2022, Proc. Interspeech 2022, 4760-4764
  - I-Vectors: the state of the art for many years, a form of embeddings avant-la-lettre: [Front-End Factor Analysis for Speaker Verification](./papers/najim-ivector-taslp-2009.pdf) Dehak, N. and Kenny, P. J. and Dehak, R. and Dumouchel, P. and Ouellet, P., [IEEE Trans. on Audio, Speech and Language Processing](http://ieeexplore.ieee.org/document/5545402), vol. 19, no. 4, pp. 788-798, May 2011, doi: 10.1109/TASL.2010.2064307.
  - An introduction to calibration: [An Introduction to Application-Independent Evaluation of Speaker Recognition Systems](./papers/appindepeval-lnai-2007.pdf) David A. van Leeuwen and Niko Brümmer, 2007, In: Müller C. (eds) [Speaker Classification I.](https://link.springer.com/chapter/10.1007%2F978-3-540-74200-5_19) Lecture Notes in Computer Science, vol 4343. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74200-5_19
  - An older overview: [A Tutorial on Text-Independent Speaker Verification](./papers/bimbot-overview.pdf), Bimbot, F., Bonastre, JF., Fredouille, C. et al. [EURASIP J. Adv. Signal Process. 2004](https://asp-eurasipjournals.springeropen.com/articles/10.1155/S1110865704310024), 101962 (2004).

diff --git a/experiments/experiment_1_cluster.sh b/experiments/experiment_1_cluster.sh
@@ -1,5 +1,7 @@
 #!/usr/bin/env bash
 #SBATCH --partition=csedu
+#SBATCH --account=cseduimc030
+#SBATCH --qos=csedu-normal
 #SBATCH --gres=gpu:1
 #SBATCH --mem=10G
 #SBATCH --cpus-per-task=6

diff --git a/requirements.txt b/requirements.txt
@@ -1,14 +1,15 @@
 # libraries for the heavy-duty neural network stuff
 -f https://download.pytorch.org/whl/torch_stable.html
-torch==1.13.1+cu116
-torchaudio==0.13.1+cu116
+torch==1.13.1+cu117
+torchaudio==0.13.1+cu117
 torchdata==0.5.1
 pytorch-lightning==1.9.0
 torchmetrics==0.11.0
 tensorboard==2.11.2
+torchvision==0.14.1
 
 # always useful in any data science project (but not required)
-numpy
+numpy==1.21.5
 scikit-learn
 matplotlib
 jupyterlab

diff --git a/scripts/download_data.sh b/scripts/download_data.sh
@@ -16,7 +16,7 @@ mkdir -p "$DATA_FOLDER"
 
 # rsync data from cluster to the local data folder
 USERNAME=your_username
-rsync -P "$SCIENCE_USERNAME"@slurm22.science.ru.nl:/ceph/csedu-scratch/course/IMC030_MLIP/data/data.zip "$DATA_FOLDER"/data.zip
+rsync -P "$SCIENCE_USERNAME"@cn84.science.ru.nl:/ceph/csedu-scratch/course/IMC030_MLIP/data/data.zip "$DATA_FOLDER"/data.zip
 
 # now you can unzip, by doing:
 # unzip "$DATA_FOLDER"/data.zip $DATA_FOLDER
diff --git a/scripts/download_remote_logs.sh b/scripts/download_remote_logs.sh
@@ -15,4 +15,4 @@ fi
 mkdir -p "$DATA_FOLDER"
 
 # rsync remote logs
-rsync -azP "$SCIENCE_USERNAME"@slurm22.science.ru.nl:/ceph/csedu-scratch/course/IMC030_MLIP/users/"$SCIENCE_USERNAME"/ "$DATA_FOLDER"
+rsync -azP "$SCIENCE_USERNAME"@cn84.science.ru.nl:/ceph/csedu-scratch/course/IMC030_MLIP/users/"$SCIENCE_USERNAME"/ "$DATA_FOLDER"