First, you need an account to start using Compute Canada.
Click here and do it as follows:
- Confirm agreement to some policies and terms of use. (Last one is Consent to Access User Data that you have the choice to not consent).
- Then select
No
in the next step:
- Enter your personal information:
- Then, enter your institution information:
Note: As a student, you must enter the CCRI of your sponsor or supervisor. They will be asked to confirm your role. Your sponsor can find their CCRI on their login page at https://ccdb.computecanada.ca/. A CCRI is an identifier of the form abc-123-01
.
- Final step is determining a username and password:
Now, you can sign in.
- Sign in with your username and password.
- The Resources and Allocations can be found in
My Account > My Resources and Allocations
:
If you are not a student: There will be a couple of computing servers (clusters) with different names and their specifications can be found here. Proper choice should be made based on their technical details etc.
Then you want to connect to your available servers.
There are two ways to connect to a server:
- Traditionally use SSH and Windows command prompt or Linux's terminal. Host address (or remote host) will be
{name_of_the_server}.computecanada.ca
and the username/password is the same as your account. - There are tools that make this process easier by providing GUI and other features. Our choices are MobaXterm for SSH client and WinSCP for file transferring (FileZilla is also a good choice).
Here is a simple instruction of how to use MobaXterm for connecting to a server:
- Open a session by clicking
session button
on the top left corner as follows:
- Click
SSH button
.
- Enter your
Remote Host
andUsername
as follows:
-
click
OK
. -
Enter your Username and Password if requested in the opened terminal.
Before Deploying the project, you need to upload your project files to the server.
Most of the servers have a projects directory. Using MobaXterm, after connecting to the server, you can open it and upload your files with drag and drop
or via upload buttons
.
We recommend using WinSCP especially if your files are huge. It will be also helpful when you want to download the outputs too.
Here is a simple instruction of how to use WinSCP for connecting to a server for transferring your files:
- After opening the app, click on the
New Session
button or clicksession
on the navigation bar and then selectNew Session
. (You can also use this shortcut:Ctrl+N
)
As you can see on the left side you have your system directories.
- Enter your
Remote Host
,Username
, andPassword
and then clickLogin
as follows:
- After connection, you have the server files and directories on the right side. So, you can transfer your files between your system and server or vice versa.
Now that you are connected to the server, and ready to enter your commands, it's time to deploy your project on the server.
For making your project available on server, ready to run and utilize, after uploading the project to the server of choice, you must create a virtual environment:
- First, you have to choose a Python version to load, based on your project and supported versions by Compute Canada. List of the supported versions is available here. Also, you can get it by this command:
[name@server ~]$ module avail python
- Load desired python version (e.g., 3.10):
[name@server ~]$ module load python/3.10
- Create a virtual environment using Python module that you loaded (
project_name_env
is the name of the directory for your new environment):
[name@server ~]$ virtualenv --no-download {project_name_env}
- Activate the virtual environment:
[name@server ~]$ source {project_name_env}/bin/activate
- Upgrade pip in the environment:
[name@server ~]$ pip install --no-index --upgrade pip
- Exit the virtual environment, simply enter the command deactivate:
({project_name_env}) [name@server ~] deactivate
- You can now use the same virtual environment over and over again. Each time:
- Load the same environment modules that you loaded when you created the virtual environment, e.g:
module load python scipy-stack
- Activate the environment:
source {project_name_env}/bin/activate
- Load the same environment modules that you loaded when you created the virtual environment, e.g:
Note: Although you might experience many timeouts while uploading or working with clusters using the university's network, the best setting based on my experience is using VPN on securelogin, and external.
You should do the above steps as a necessary task for deploying your project. However, there is no need to redo them each time you want to execute your code. The following part shows how to run your code on the server as a job.
To submit your project as a job to the server, you have to specify the time and other resources you need in a bash script. For example:
#!/bin/bash
##setup the environment
module load python/3.8;
module load scipy-stack;
source {project_name_env}/bin/activate;
pip install --no-index --upgrade pip;
# SBATCH --time=20:00:00
# SBATCH --account=def-hfani
# SBATCH --gpus-per-node=2
# SBATCH --mem=64000M
# SBATCH [email protected]
# SBATCH --mail-type=ALL
python main.py -arg0 ‘argv0’
Note: If you want to download from the internet, Remove the -–no-index when installing pip modules. The -–no-index refer to you not using any index to fetch your modules except the ones provided by Compute Canada.
IMPORTANT NOTE: Some bash commands should be entered in the command line instead of the bash file to affect the result. Read the Submit Job section for more information.
Refer here for detailed explanations and examples.
To request for the resources and schedule a job for execution:
[name@server ~]$ sbatch computecanada.sh
As we metioned, Some bash commands should be entered in the command line instead of the bash file to affect the result.
For example, if you want to run a job with 40 cpus
, 20 hour
time limit, 64GB ram
, and you want to get your email notifications to [email protected]
you have to use this command (if the name of your bash file is computecanada.sh):
[name@server ~]$ sbatch --time=20:00:00 --mem=64000M [email protected] --mail-type=ALL --cpus-per-task=40 computecanada.sh
If you want to use GPU instead, you have to use --gpus-per-node=2
(if you want 2 GPUs).
Certainly, you should first check whether gpu is available. For example, if you use torch, you can use torch.cuda.is_available()
as a manual check.
We will add our experience of executing project on gpu on Compute Canada here, soon.
Finally, you are executing your project on the server as a job. The next issue is that you are curious about the status of the job or the output and results. So, you need to know how to monitor progress.
After submitting a job, you can monitor the progress and results with different tools.
The general command for checking the status of Slurm jobs is squeue
, but by default it supplies information about all jobs in the system, not just your own. You can use the shorter sq
to list only your own jobs:
$ sq
JOBID USER ACCOUNT NAME ST TIME_LEFT NODES CPUS GRES MIN_MEM NODELIST (REASON)
123456 smithj def-smithj simple_j R 0:03 1 1 (null) 4G cdr234 (None)
123457 smithj def-smithj bigger_j PD 2-00:00:00 1 16 (null) 16G (Priority)
The ST column
of the output shows the status of each job. The two most common states are "PD"
for "pending" or "R"
for "running".
Compute Canada servers run with Linux which by default uses the fork method to run your multiprocessing. This should be of no problem if you are running independant methods. But when passing partial for computation, you need to have spawn
method enabled. Its a best practice to fetch the number of cpus from the environmental variables stored in the compute nodes rather than specifying them manually
Example python code to call a search method for retrieval computation:
from multiprocessing import freeze_support,get_context
ncpus = int(os.environ.get('SLURM_CPUS_PER_TASK',default=1))
...
if __name__ == '__main__':
freeze_support()
mp.set_start_method('spawn')
with get_context('spawn').Pool(ncpus) as p:
p.starmap(partial(search, ranker=ranker, topk=topk, batch=settings['batch']),file_changes)
By default the output is placed in a file named "slurm-"
, suffixed with the job ID number
and ".out"
, e.g. slurm-123456.out, in the directory from which the job was submitted. Having the job ID as part of the file name is convenient for troubleshooting. A different name or location can be specified if your workflow requires it by using the --output
directive.
Slurm files are similar to log files, you can examine those in order to find issues or track the output of your program.
Helpful Command: You can use tail
command to read the last part of our log file. For example, to read last 50 lines of slurm-123.out you can use the following command:
[name@server ~]$ tail -f -n 50 slurm-123.out