diff --git a/doc/scheduler-directives.tex b/doc/scheduler-directives.tex new file mode 100644 index 0000000..4665a38 --- /dev/null +++ b/doc/scheduler-directives.tex @@ -0,0 +1,51 @@ +% ------------------------------------------------------------------------------ +\subsubsection{Directives} + +Directives are comments included at the beginning of a job script that set the shell +and the options for the job scheduler. + +The shebang directive is always the first line of a script. In your job script, +this directive sets which shell your script's commands will run in. On ``Speed'', +we recommend that your script use a shell from the \texttt{/encs/bin} directory. + +To use the \texttt{tcsh} shell, start your script with: \verb|#!/encs/bin/tcsh| + +For \texttt{bash}, start with: \verb|#!/encs/bin/bash| + +Directives that start with \verb|"#$"|, set the options for the cluster's +``Altair Grid Engine (AGE)'' scheduler. The script template, \file{template.sh}, +provides the essentials: + +\begin{verbatim} +#$ -N +#$ -cwd +#$ -m bea +#$ -pe smp +#$ -l h_vmem=G +\end{verbatim} + +Replace, \verb++, with the name that you want your cluster job to have; +\option{-cwd}, makes the current working directory the ``job working directory'', +and your standard output file will appear here; \option{-m bea}, provides e-mail +notifications (begin/end/abort); replace, \verb++, with the degree of +(multithreaded) parallelism (i.e., cores) you attach to your job (up to 32), +be sure to delete or comment out the \verb| #$ -pe smp | parameter if it +is not relevant; replace, \verb++, with the value (in GB), that you want +your job's memory space to be (up to 500), and all jobs MUST have a memory-space +assignment. + +If you are unsure about memory footprints, err on assigning a generous +memory space to your job so that it does not get prematurely terminated +(the value given to \api{h\_vmem} is a hard memory ceiling). You can refine +\api{h\_vmem} values for future jobs by monitoring the size of a job's active +memory space on \texttt{speed-submit} with: + +\begin{verbatim} +qstat -j | grep maxvmem +\end{verbatim} + +Memory-footprint values are also provided for completed jobs in the final +e-mail notification (as, ``Max vmem''). + +\emph{Jobs that request a low-memory footprint are more likely to load on a busy +cluster.} diff --git a/doc/scheduler-env.tex b/doc/scheduler-env.tex new file mode 100644 index 0000000..2707e55 --- /dev/null +++ b/doc/scheduler-env.tex @@ -0,0 +1,88 @@ +% ------------------------------------------------------------------------------ +\subsubsection{Environment Set Up} +\label{sect:envsetup} + +After creating an SSH connection to ``Speed'', you will need to source +the ``Altair Grid Engine (AGE)'' scheduler's settings file. +Sourcing the settings file will set the environment variables required to +execute scheduler commands. + +Based on the UNIX shell type, choose one of the following commands to source +the settings file. + +csh/\tool{tcsh}: +\begin{verbatim} +source /local/pkg/uge-8.6.3/root/default/common/settings.csh +\end{verbatim} + +Bourne shell/\tool{bash}: +\begin{verbatim} +. /local/pkg/uge-8.6.3/root/default/common/settings.sh +\end{verbatim} + +In order to set up the default ENCS bash shell, executing the following command +is also required: +\begin{verbatim} +printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile +\end{verbatim} + +To verify that you have access to the scheduler commands execute +\texttt{qstat -f -u "*"}. If an error is returned, attempt sourcing +the settings file again. + +The next step is to copy a job template to your home directory and to set up your +cluster-specific storage. Execute the following command from within your +home directory. (To move to your home directory, type \texttt{cd} at the Linux +prompt and press \texttt{Enter}.) + +\begin{verbatim} +cp /home/n/nul-uge/template.sh . && mkdir /speed-scratch/$USER +\end{verbatim} + +\textbf{Tip:} Add the source command to your shell-startup script. + +\textbf{Tip:} the default shell for GCS ENCS users is \tool{tcsh}. +If you would like to use \tool{bash}, please contact +\texttt{rt-ex-hpc AT encs.concordia.ca}. + +For \textbf{new ENCS Users}, and/or those who don't have a shell-startup script, +based on your shell type use one of the following commands to copy a start up script +from \texttt{nul-uge}'s. home directory to your home directory. (To move to your home +directory, type \tool{cd} at the Linux prompt and press \texttt{Enter}.) + +csh/\tool{tcsh}: +\begin{verbatim} +cp /home/n/nul-uge/.tcshrc . +\end{verbatim} + +Bourne shell/\tool{bash}: +\begin{verbatim} +cp /home/n/nul-uge/.bashrc . +\end{verbatim} + +Users who already have a shell-startup script, use a text editor, such as +\tool{vim} or \tool{emacs}, to add the source request to your existing +shell-startup environment (i.e., to the \file{.tcshrc} file in your home directory). + +csh/\tool{tcsh}: +Sample \file{.tcshrc} file: +\begin{verbatim} +# Speed environment set up +if ($HOSTNAME == speed-submit.encs.concordia.ca) then + source /local/pkg/uge-8.6.3/root/default/common/settings.csh +endif +\end{verbatim} + +Bourne shell/\tool{bash}: +Sample \file{.bashrc} file: +\begin{verbatim} +# Speed environment set up +if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then + . /local/pkg/uge-8.6.3/root/default/common/settings.sh + printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile +fi +\end{verbatim} + +Note that you will need to either log out and back in, or execute a new shell, +for the environment changes in the updated \file{.tcshrc} or \file{.bashrc} file to be applied +(\textbf{important}). diff --git a/doc/scheduler-faq.tex b/doc/scheduler-faq.tex new file mode 100644 index 0000000..3d0eed5 --- /dev/null +++ b/doc/scheduler-faq.tex @@ -0,0 +1,203 @@ +% ------------------------------------------------------------------------------ +\section{Frequently Asked Questions} +\label{sect:faqs} + +% ------------------------------------------------------------------------------ +\subsection{Where do I learn about Linux?} + +All Speed users are expected to have a basic understanding of Linux and its commonly used commands. + +% ------------------------------------------------------------------------------ +\subsubsection*{Software Carpentry} + +Software Carpentry provides free resources to learn software, including a workshop on the Unix shell. +\url{https://software-carpentry.org/lessons/} + +% ------------------------------------------------------------------------------ +\subsubsection*{Udemy} + +There are a number of Udemy courses, including free ones, that will assist +you in learning Linux. Active Concordia faculty, staff and students have +access to Udemy courses such as \textbf{Linux Mastery: Master the Linux +Command Line in 11.5 Hours} is a good starting point for beginners. Visit +\url{https://www.concordia.ca/it/services/udemy.html} to learn how Concordians +may access Udemy. + +% ------------------------------------------------------------------------------ +\subsection{How to use the ``bash shell'' on Speed?} + +This section describes how to use the ``bash shell'' on Speed. Review +\xs{sect:envsetup} to ensure that your bash environment is set up. + +% ------------------------------------------------------------------------------ +\subsubsection{How do I set bash as my login shell?} + +In order to set your login shell to bash on Speed, your login shell on all GCS servers must be changed to bash. +To make this change, create a ticket with the Service Desk (or email help at concordia.ca) to request that bash become your default login shell for your ENCS user account on all GCS servers. + +% ------------------------------------------------------------------------------ +\subsubsection{How do I move into a bash shell on Speed?} + +To move to the bash shell, type \textbf{bash} at the command prompt. +For example: +\begin{verbatim} + [speed-submit] [/home/a/a_user] > bash + bash-4.4$ echo $0 + bash +\end{verbatim} + +Note how the command prompt changed from \verb![speed-submit] [/home/a/a_user] >! to \verb!bash-4.4$! after entering the bash shell. + +% ------------------------------------------------------------------------------ +\subsubsection{How do I run scripts written in bash on Speed?} + +To execute bash scripts on Speed: +\begin{enumerate} + \item +Ensure that the shebang of your bash job script is \verb!#!/encs/bin/bash! + \item +Use the qsub command to submit your job script to the scheduler. +\end{enumerate} + +The Speed GitHub contains a sample \href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/bash.sh}{bash job script}. + +% ------------------------------------------------------------------------------ +\subsection{How to resolve ``Disk quota exceeded'' errors?} + +% ------------------------------------------------------------------------------ +\subsubsection{Probable Cause} + +The \texttt{``Disk quota exceeded''} Error occurs when your application has run out of disk space to write to. On Speed this error can be returned when: +\begin{enumerate} + \item +The \texttt{/tmp} directory on the speed node your application is running on is full and cannot be written to. + \item +Your NFS-provided home is full and cannot be written to. +\end{enumerate} + +% ------------------------------------------------------------------------------ +\subsubsection{Possible Solutions} + +\begin{enumerate} + \item +Use the \textbf{-cwd} job script option to set the directory that the job +script is submitted from the \texttt{job working directory}. The +\texttt{job working directory} is the directory that the job will write output files in. + \item +The use local disk space is generally recommended for IO intensive operations. However, as the size of \texttt{/tmp} on speed nodes +is \texttt{1GB} it can be necessary for scripts to store temporary data +elsewhere. +Review the documentation for each module called within your script to +determine how to set working directories for that application. +The basic steps for this solution are: +\begin{itemize} + \item + Review the documentation on how to set working directories for + each module called by the job script. + \item + Create a working directory in speed-scratch for output files. + For example, this command will create a subdirectory called \textbf{output} + in your \verb!speed-scratch! directory: + \begin{verbatim} + mkdir -m 750 /speed-scratch/$USER/output + \end{verbatim} + \item + To create a subdirectory for recovery files: + \begin{verbatim} + mkdir -m 750 /speed-scratch/$USER/recovery + \end{verbatim} + \item + Update the job script to write output to the subdirectories you created in your \verb!speed-scratch! directory, e.g., \verb!/speed-scratch/$USER/output!. + \end{itemize} +\end{enumerate} +In the above example, \verb!$USER! is an environment variable containing your ENCS username. + +% ------------------------------------------------------------------------------ +\subsubsection{Example of setting working directories for \tool{COMSOL}} + +\begin{itemize} + \item + Create directories for recovery, temporary, and configuration files. + For example, to create these directories for your GCS ENCS user account: + \begin{verbatim} + mkdir -m 750 -p /speed-scratch/$USER/comsol/{recovery,tmp,config} + \end{verbatim} + \item + Add the following command switches to the COMSOL command to use the + directories created above: + \begin{verbatim} + -recoverydir /speed-scratch/$USER/comsol/recovery + -tmpdir /speed-scratch/$USER/comsol/tmp + -configuration/speed-scratch/$USER/comsol/config + \end{verbatim} +\end{itemize} +In the above example, \verb!$USER! is an environment variable containing your ENCS username. + +% ------------------------------------------------------------------------------ +\subsubsection{Example of setting working directories for \tool{Python Modules}} + +By default when adding a python module the \texttt{/tmp} directory is set as the temporary repository for files downloads. +The size of the \texttt{/tmp} directory on \verb!speed-submit! is too small for pytorch. +To add a python module +\begin{itemize} + \item + Create your own tmp directory in your \verb!speed-scratch! directory + \begin{verbatim} + mkdir /speed-scratch/$USER/tmp + \end{verbatim} + \item + Use the tmp directory you created + \begin{verbatim} + setenv TMPDIR /speed-scratch/$USER/tmp + \end{verbatim} + \item + Attempt the installation of pytorch +\end{itemize} + +In the above example, \verb!$USER! is an environment variable containing your ENCS username. + +% ------------------------------------------------------------------------------ +\subsection{How do I check my job's status?} + +When a job with a job id of 1234 is running, the status of that job can be tracked using \verb!`qstat -j 1234`!. +Likewise, if the job is pending, the \verb!`qstat -j 1234`! command will report as to why the job is not scheduled or running. +Once the job has finished, or has been killed, the \textbf{qacct} command must be used to query the job's status, e.g., \verb!`qaact -j [jobid]`!. + +% ------------------------------------------------------------------------------ +\subsection{Why is my job pending when nodes are empty?} + +% ------------------------------------------------------------------------------ +\subsubsection{Disabled nodes} + +It is possible that a (or a number of) the Speed nodes are disabled. Nodes are disabled if they require maintenance. +To verify if Speed nodes are disabled, request the current list of disabled nodes from qstat. + +\begin{verbatim} +qstat -f -qs d +queuename qtype resv/used/tot. load_avg arch states +--------------------------------------------------------------------------------- +g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.27 lx-amd64 d +--------------------------------------------------------------------------------- +s.q@speed-07.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 d +--------------------------------------------------------------------------------- +s.q@speed-10.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 d +--------------------------------------------------------------------------------- +s.q@speed-16.encs.concordia.ca BIP 0/0/32 0.02 lx-amd64 d +--------------------------------------------------------------------------------- +s.q@speed-19.encs.concordia.ca BIP 0/0/32 0.03 lx-amd64 d +--------------------------------------------------------------------------------- +s.q@speed-24.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 d +--------------------------------------------------------------------------------- +s.q@speed-36.encs.concordia.ca BIP 0/0/32 0.03 lx-amd64 d +\end{verbatim} + +Note how the all of the Speed nodes in the above list have a state of \textbf{d}, or disabled. + +Your job will run once the maintenance has been completed and the disabled nodes have been enabled. + +% ------------------------------------------------------------------------------ +\subsubsection{Error in job submit request.} + +It is possible that your job is pending, because the job requested resources that are not available within Speed. +To verify why pending job with job id 1234 is not running, execute \verb!`qstat -j 1234`! +and review the messages in the \textbf{scheduling info:} section. diff --git a/doc/scheduler-job-examples.tex b/doc/scheduler-job-examples.tex new file mode 100644 index 0000000..3c27922 --- /dev/null +++ b/doc/scheduler-job-examples.tex @@ -0,0 +1,218 @@ +% ------------------------------------------------------------------------------ +\subsection{Example Job Script: Fluent} + +\begin{figure}[htpb] + \lstinputlisting[language=csh,frame=single,basicstyle=\footnotesize\ttfamily]{fluent.sh} + \caption{Source code for \file{fluent.sh}} + \label{fig:fluent.sh} +\end{figure} + +The job script in \xf{fig:fluent.sh} runs Fluent in parallel over 32 cores. +Of note, we have requested e-mail notifications (\texttt{-m}), are defining the +parallel environment for, \tool{fluent}, with, \texttt{-sgepe smp} (\textbf{very +important}), and are setting \api{\$TMPDIR} as the in-job location for the +``moment'' \file{rfile.out} file (in-job, because the last line of the script +copies everything from \api{\$TMPDIR} to a directory in the user's NFS-mounted home). +Job progress can be monitored by examining the standard-out file (e.g., +\file{flu10000.o249}), and/or by examining the ``moment'' file in +\texttt{/disk/nobackup/} (hint: it starts with your job-ID) on the node running +the job. \textbf{Caveat:} take care with journal-file file paths. + +% ------------------------------------------------------------------------------ +\subsection{Example Job: efficientdet} + +The following steps describing how to create an efficientdet environment on +\emph{Speed}, were submitted by a member of Dr. Amer's research group. + +\begin{itemize} + \item + Enter your ENCS user account's speed-scratch directory + \verb!cd /speed-scratch/! + \item + load python \verb!module load python/3.8.3! + create virtual environment \verb!python3 -m venv ! + activate virtual environment \verb!source /bin/activate.csh! + install DL packages for Efficientdet +\end{itemize} +\begin{verbatim} +pip install tensorflow==2.7.0 +pip install lxml>=4.6.1 +pip install absl-py>=0.10.0 +pip install matplotlib>=3.0.3 +pip install numpy>=1.19.4 +pip install Pillow>=6.0.0 +pip install PyYAML>=5.1 +pip install six>=1.15.0 +pip install tensorflow-addons>=0.12 +pip install tensorflow-hub>=0.11 +pip install neural-structured-learning>=1.3.1 +pip install tensorflow-model-optimization>=0.5 +pip install Cython>=0.29.13 +pip install git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI +\end{verbatim} + +% ------------------------------------------------------------------------------ +\subsection{Java Jobs} + +Jobs that call \tool{java} have a memory overhead, which needs to be taken +into account when assigning a value to \api{h\_vmem}. Even the most basic +\tool{java} call, \texttt{java -Xmx1G -version}, will need to have, +\texttt{-l h\_vmem=5G}, with the 4-GB difference representing the memory overhead. +Note that this memory overhead grows proportionally with the value of +\texttt{-Xmx}. To give you an idea, when \texttt{-Xmx} has a value of 100G, +\api{h\_vmem} has to be at least 106G; for 200G, at least 211G; for 300G, at least 314G. + +% TODO: add a MARF Java job + +% ------------------------------------------------------------------------------ +\subsection{Scheduling On The GPU Nodes} + +The primary cluster has two GPU nodes, each with six Tesla (CUDA-compatible) P6 +cards: each card has 2048 cores and 16GB of RAM. Though note that the P6 +is mainly a single-precision card, so unless you need the GPU double +precision, double-precision calculations will be faster on a CPU node. + +Job scripts for the GPU queue differ in that they do not need these +statements: + +\begin{verbatim} +#$ -pe smp +#$ -l h_vmem=G +\end{verbatim} + +But do need this statement, which attaches either a single GPU, or, two +GPUs, to the job: + +\begin{verbatim} +#$ -l gpu=[1|2] +\end{verbatim} + +Single-GPU jobs are granted 5~CPU cores and 80GB of system memory, and +dual-GPU jobs are granted 10~CPU cores and 160GB of system memory. A +total of \emph{four} GPUs can be actively attached to any one user at any given +time. + +Once that your job script is ready, you can submit it to the GPU queue +with: + +\begin{verbatim} +qsub -q g.q ./.sh +\end{verbatim} + +And you can query \tool{nvidia-smi} on the node that is running your job with: + +\begin{verbatim} +ssh @speed[-05|-17] nvidia-smi +\end{verbatim} + +Status of the GPU queue can be queried with: + +\begin{verbatim} +qstat -f -u "*" -q g.q +\end{verbatim} + +\textbf{Very important note} regarding TensorFlow and PyTorch: +if you are planning to run TensorFlow and/or PyTorch multi-GPU jobs, +do not use the \api{tf.distribute} and/or\\ +\api{torch.nn.DataParallel} +functions, as they will crash the compute node (100\% certainty). +This appears to be the current hardware's architecture's defect. +% +The workaround is to either +% TODO: Need to link to that example +manually effect GPU parallelisation (TensorFlow has an example on how to +do this), or to run on a single GPU. + +\vspace{10pt} +\noindent +\textbf{Important} +\vspace{10pt} + +Users without permission to use the GPU nodes can submit jobs to the \texttt{g.q} +queue but those jobs will hang and never run. + +There are two GPUs in both \texttt{speed-05} and \texttt{speed-17}, and one +in \texttt{speed-19}. Their availability is seen with, \texttt{qstat -F g} +(note the capital): + +\small +\begin{verbatim} +queuename qtype resv/used/tot. load_avg arch states +--------------------------------------------------------------------------------- +... +--------------------------------------------------------------------------------- +g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.04 lx-amd64 + hc:gpu=6 +--------------------------------------------------------------------------------- +g.q@speed-17.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 + hc:gpu=6 +--------------------------------------------------------------------------------- +... +--------------------------------------------------------------------------------- +s.q@speed-19.encs.concordia.ca BIP 0/32/32 32.37 lx-amd64 + hc:gpu=1 +--------------------------------------------------------------------------------- +etc. +\end{verbatim} +\normalsize + +This status demonstrates that all five are available (i.e., have not been +requested as resources). To specifically request a GPU node, add, +\texttt{-l g=[\#GPUs]}, to your \tool{qsub} (statement/script) or +\tool{qlogin} (statement) request. For example, +\texttt{qsub -l h\_vmem=1G -l g=1 ./count.sh}. You +will see that this job has been assigned to one of the GPU nodes: + +\small +\begin{verbatim} +queuename qtype resv/used/tot. load_avg arch states +--------------------------------------------------------------------------------- +g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 hc:gpu=6 +--------------------------------------------------------------------------------- +g.q@speed-17.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 hc:gpu=6 +--------------------------------------------------------------------------------- +s.q@speed-19.encs.concordia.ca BIP 0/1/32 0.04 lx-amd64 hc:gpu=0 (haff=1.000000) + 538 100.00000 count.sh sbunnell r 03/07/2019 02:39:39 1 +--------------------------------------------------------------------------------- +etc. +\end{verbatim} +\normalsize + +And that there are no more GPUs available on that node (\texttt{hc:gpu=0}). Note +that no more than two GPUs can be requested for any one job. + +% ------------------------------------------------------------------------------ +\subsubsection{CUDA} + +When calling \tool{CUDA} within job scripts, it is important to create a link to +the desired \tool{CUDA} libraries and set the runtime link path to the same libraries. +For example, to use the \texttt{cuda-11.5} libraries, specify the following in +your Makefile. + +\begin{verbatim} +-L/encs/pkg/cuda-11.5/root/lib64 -Wl,-rpath,/encs/pkg/cuda-11.5/root/lib64 +\end{verbatim} + +In your job script, specify the version of \texttt{gcc} to use prior to calling +cuda. For example: + \texttt{module load gcc/8.4} +or + \texttt{module load gcc/9.3} + +% ------------------------------------------------------------------------------ +\subsubsection{Special Notes for sending CUDA jobs to the GPU Queue} + +It is not possible to create a \texttt{qlogin} session on to a node in the +\textbf{GPU Queue} (\texttt{g.q}). As direct logins to these nodes is not +available, jobs must be submitted to the \textbf{GPU Queue} in order to compile +and link. + +We have several versions of CUDA installed in: +\begin{verbatim} +/encs/pkg/cuda-11.5/root/ +/encs/pkg/cuda-10.2/root/ +/encs/pkg/cuda-9.2/root +\end{verbatim} + +For CUDA to compile properly for the GPU queue, edit your Makefile +replacing \option{\/usr\/local\/cuda} with one of the above. diff --git a/doc/scheduler-scripting.tex b/doc/scheduler-scripting.tex new file mode 100644 index 0000000..df086d5 --- /dev/null +++ b/doc/scheduler-scripting.tex @@ -0,0 +1,318 @@ +% ------------------------------------------------------------------------------ +\subsubsection{User Scripting} + +The last part the job script is the scripting that will be executed by the job. +This part of the job script includes all commands required to set up and +execute the task your script has been written to do. Any Linux command can be used +at this step. This section can be a simple call to an executable or a complex +loop which iterates through a series of commands. + +Every software program has a unique execution framework. It is the responsibility +of the script's author (e.g., you) to know what is required for the software used +in your script by reviewing the software's documentation. Regardless of which software +your script calls, your script should be written so that the software knows the +location of the input and output files as well as the degree of parallelism. +Note that the cluster-specific environment variable, \api{NSLOTS}, resolves +to the value provided to the scheduler in the \option{-pe smp} option. + +Jobs which touch data-input and data-output files more than once, should make use +of \api{TMPDIR}, a scheduler-provided working space almost 1~TB in size. +\api{TMPDIR} is created when a job starts, and exists on the local disk of the +compute node executing your job. Using \api{TMPDIR} results in faster I/O operations +than those to and from shared storage (which is provided over NFS). + +An sample job script using \api{TMPDIR} is available at \texttt{/home/n/nul-uge/templateTMPDIR.sh}: +the job is instructed to change to \api{\$TMPDIR}, to make the new directory \texttt{input}, to copy data from +\texttt{\$SGE\_O\_WORKDIR/references/} to \texttt{input/} (\texttt{\$SGE\_O\_WORKDIR} represents the +current working directory), to make the new directory \texttt{results}, to +execute the program (which takes input from \texttt{\$TMPDIR/input/} and writes +output to \texttt{\$TMPDIR/results/}), and finally to copy the total end results +to an existing directory, \texttt{processed}, that is located in the current +working directory. TMPDIR only exists for the duration of the job, though, +so it is very important to copy relevant results from it at job's end. + +% ------------------------------------------------------------------------------ +\subsection{Sample Job Script} + +Now, let's look at a basic job script, \file{tcsh.sh} in \xf{fig:tcsh.sh} +(you can copy it from our GitHub page or from \texttt{/home/n/nul-uge}). + +\begin{figure}[htpb] + \lstinputlisting[language=csh,frame=single,basicstyle=\ttfamily]{tcsh.sh} + \caption{Source code for \file{tcsh.sh}} + \label{fig:tcsh.sh} +\end{figure} + +The first line is the shell declaration (also know as a shebang) and sets the shell to \emph{tcsh}. +The lines that begin with \texttt{\#\$} are directives for the scheduler. + +\begin{itemize} + \item \texttt{-N} sets \emph{qsub-test} as the jobname + \item \texttt{-cwd} tells the scheduler to execute the job from the current working directory + \item \texttt{-l h\_vmem=1GB} requests and assigns 1GB of memory to the job. CPU jobs \emph{require} the \texttt{-l h\_vmem} option to be set. +\end{itemize} + +The script then: + +\begin{itemize} + \item Sleeps on a node for 30 seconds + \item Uses the \tool{module} command to load the \texttt{gurobi/8.1.0} environment + \item Prints the list of loaded modules into a file +\end{itemize} + +The scheduler command, \tool{qsub}, is used to submit (non-interactive) jobs. +From an ssh session on speed-submit, submit this job with \texttt{qsub ./tcsh.sh}. You will see, +\texttt{"Your job X ("qsub-test") has been submitted"}. The command, \tool{qstat}, can be used +to look at the status of the cluster: \texttt{qstat -f -u "*"}. You will see +something like this: + +\small +\begin{verbatim} +queuename qtype resv/used/tot. load_avg arch states +--------------------------------------------------------------------------------- +a.q@speed-01.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 +--------------------------------------------------------------------------------- +a.q@speed-03.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 +--------------------------------------------------------------------------------- +a.q@speed-25.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 +--------------------------------------------------------------------------------- +a.q@speed-27.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 +--------------------------------------------------------------------------------- +g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.02 lx-amd64 + 144 100.00000 qsub-test nul-uge r 12/03/2018 16:39:30 1 + 62624 0.09843 case_talle x_yzabc r 11/09/2021 16:50:09 32 +--------------------------------------------------------------------------------- +g.q@speed-17.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 +--------------------------------------------------------------------------------- +s.q@speed-07.encs.concordia.ca BIP 0/0/32 0.04 lx-amd64 +--------------------------------------------------------------------------------- +s.q@speed-08.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 +--------------------------------------------------------------------------------- +s.q@speed-09.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 +--------------------------------------------------------------------------------- +s.q@speed-10.encs.concordia.ca BIP 0/32/32 32.72 lx-amd64 + 62624 0.09843 case_talle x_yzabc r 11/09/2021 16:50:09 32 +--------------------------------------------------------------------------------- +s.q@speed-11.encs.concordia.ca BIP 0/32/32 32.08 lx-amd64 + 62679 0.14212 CWLR_DF a_bcdef r 11/10/2021 17:25:19 32 +--------------------------------------------------------------------------------- +s.q@speed-12.encs.concordia.ca BIP 0/32/32 32.10 lx-amd64 + 62749 0.09000 CLOUDY z_abc r 11/11/2021 21:58:12 32 +--------------------------------------------------------------------------------- +s.q@speed-15.encs.concordia.ca BIP 0/4/32 0.03 lx-amd64 + 62753 82.47478 matlabLDPa b_bpxez r 11/12/2021 08:49:52 4 +--------------------------------------------------------------------------------- +s.q@speed-16.encs.concordia.ca BIP 0/32/32 32.31 lx-amd64 + 62751 0.09000 CLOUDY z_abc r 11/12/2021 06:03:54 32 +--------------------------------------------------------------------------------- +s.q@speed-19.encs.concordia.ca BIP 0/32/32 32.22 lx-amd64 +--------------------------------------------------------------------------------- +... +--------------------------------------------------------------------------------- +s.q@speed-35.encs.concordia.ca BIP 0/32/32 2.78 lx-amd64 + 62754 7.22952 qlogin-tes a_tiyuu r 11/12/2021 10:31:06 32 +--------------------------------------------------------------------------------- +s.q@speed-36.encs.concordia.ca BIP 0/0/32 0.03 lx-amd64 +etc. +\end{verbatim} +\normalsize + +Remember that you only have 30 seconds before the job is essentially over, so +if you do not see a similar output, either adjust the sleep time in the +script, or execute the \tool{qstat} statement more quickly. The \tool{qstat} +output listed above shows you that your job is +running on node \texttt{speed-05}, that it has a job number of 144, that it +was started at 16:39:30 on 12/03/2018, and that it is a single-core job (the +default). + +Once the job finishes, there will be a new file in the directory that the job +was started from, with the syntax of, \texttt{"job name".o"job number"}, so +in this example the file is, qsub \file{test.o144}. This file represents the +standard output (and error, if there is any) of the job in question. If you +look at the contents of your newly created file, you will see that it +contains the output of the, \texttt{module list} command. +Important information is often written to this file. + +Congratulations on your first job! + +% ------------------------------------------------------------------------------ +\subsection{Common Job Management Commands Summary} +\label{sect:job-management-commands} + +Here are useful job-management commands: + +\begin{itemize} +\item +\texttt{qsub ./.sh}: once that your job script is ready, +on \texttt{speed-submit} you can submit it using this + +\item +\texttt{qstat -f -u }: you can check the status of your job(s) + +\item +\texttt{qstat -f -u "*"}: display cluster status for all users. + +\item +\texttt{qstat -j [job-ID]}: display job information for [job-ID] (said job may be actually running, or waiting in the queue). + +\item +\texttt{qdel [job-ID]}: delete job [job-ID]. + +\item +\texttt{qhold [job-ID]}: hold queued job, [job-ID], from running. + +\item +\texttt{qrls [job-ID]}: release held job [job-ID]. + +\item +\texttt{qacct -j [job-ID]}: get job stats. for completed job [job-ID]. \api{maxvmem} is one of the more useful stats. +\end{itemize} + + +% ------------------------------------------------------------------------------ +\subsection{Advanced \tool{qsub} Options} +\label{sect:qsub-options} + +In addition to the basic \tool{qsub} options presented earlier, there are a +few additional options that are generally useful: + +\begin{itemize} +\item +\texttt{-m bea}: requests that the scheduler e-mail you when a job (b)egins; +(e)nds; (a)borts. Mail is sent to the default address of, +\texttt{"username@encs.concordia.ca"}, unless a different address is supplied (see, +\texttt{-M}). The report sent when a job ends includes job +runtime, as well as the maximum memory value hit (\api{maxvmem}). + +\item +\texttt{-M email@domain.com}: requests that the scheduler use this e-mail +notification address, rather than the default (see, \texttt{-m}). + +\item +\texttt{-v variable[=value]}: exports an environment variable that can be used by the script. + +\item +\texttt{-l h\_rt=[hour]:[min]:[sec]}: sets a job runtime of HH:MM:SS. Note +that if you give a single number, that represents \emph{seconds}, not hours. + +\item +\texttt{-hold\_jid [job-ID]}: run this job only when job [job-ID] finishes. Held jobs appear in the queue. +The many \tool{qsub} options available are read with, \texttt{man qsub}. Also +note that \tool{qsub} options can be specified during the job-submission +command, and these \emph{override} existing script options (if present). The +syntax is, \texttt{qsub [options] PATHTOSCRIPT}, but unlike in the script, +the options are specified without the leading \verb+#$+ +(e.g., \texttt{qsub -N qsub-test -cwd -l h\_vmem=1G ./tcsh.sh}). + +\end{itemize} + +% ------------------------------------------------------------------------------ +\subsection{Array Jobs} + +Array jobs are those that start a batch job or a parallel job multiple times. +Each iteration of the job array is called a task and receives a unique job ID. + +To submit an array job, use the \texttt{\-t} option of the \texttt{qsub} +command as follows: + +\begin{verbatim} +qsub -t n[-m[:s]] +\end{verbatim} + +\textbf{-t Option Syntax:} +\begin{itemize} +\item +\texttt{n}: indicates the start-id. +\item +\texttt{m}: indicates the max-id. +\item +\texttt{s}: indicates the step size. +\end{itemize} + +\textbf{Examples:} +\begin{itemize} +\item +\texttt{qsub -t 10 array.sh}: submits a job with 1 task where the task-id is 10. +\item +\texttt{qsub -t 1-10 array.sh}: submits a job with 10 tasks numbered consecutively from 1 to 10. +\item +\texttt{qsub -t 3-15:3 array.sh}: submits a jobs with 5 tasks numbered consecutively with step size 3 +(task-ids 3,6,9,12,15). +\end{itemize} + +\textbf{Output files for Array Jobs:} + +The default and output and error-files are \option{job\_name.[o|e]job\_id} and\\ +\option{job\_name.[o|e]job\_id.task\_id}. +% +This means that Speed creates an output and an error-file for each task +generated by the array-job as well as one for the super-ordinate array-job. +To alter this behavior use the \option{-o} and \option{-e} option of +\tool{qsub}. + +For more details about Array Job options, please review the manual pages for +\option{qsub} by executing the following at the command line on speed-submit +\tool{man qsub}. + +% ------------------------------------------------------------------------------ +\subsection{Requesting Multiple Cores (i.e., Multithreading Jobs)} + +For jobs that can take advantage of multiple machine cores, up to 32 cores +(per job) can be requested in your script with: + +\begin{verbatim} +#$ -pe smp [#cores] +\end{verbatim} + +\textbf{Do not request more cores than you think will be useful}, as larger-core +jobs are more difficult to schedule. On the flip side, though, if you +are going to be running a program that scales out to the maximum single-machine +core count available, please (please) request 32 cores, to avoid node +oversubscription (i.e., to avoid overloading the CPUs). + +Core count associated with a job appears under, ``states'', in the, +\texttt{qstat -f -u "*"}, output. + +% ------------------------------------------------------------------------------ +\subsection{Interactive Jobs} + +Job sessions can be interactive, instead of batch (script) based. Such +sessions can be useful for testing and optimising code and resource +requirements prior to batch submission. To request an interactive job +session, use, \texttt{qlogin [options]}, similarly to a +\tool{qsub} command-line job (e.g., \texttt{qlogin -N qlogin-test -l h\_vmem=1G}). +Note that the options that are available for \tool{qsub} are not necessarily +available for \tool{qlogin}, notably, \texttt{-cwd}, and, \texttt{-v}. + +% ------------------------------------------------------------------------------ +\subsection{Scheduler Environment Variables} + +The scheduler presents a number of environment variables that can be used in +your jobs. Three of the more useful are \api{TMPDIR}, \api{SGE\_O\_WORKDIR}, +and \api{NSLOTS}: + +\begin{itemize} +\item +\api{\$TMPDIR}=the path to the job's temporary space on the node. It +\emph{only} exists for the duration of the job, so if data in the temporary space +are important, they absolutely need to be accessed before the job terminates. + +\item +\api{\$SGE\_O\_WORKDIR}=the path to the job's working directory (likely an +NFS-mounted path). If, \texttt{-cwd}, was stipulated, that path is taken; if not, +the path defaults to your home directory. + +\item +\api{\$NSLOTS}=the number of cores requested for the job. This variable can +be used in place of hardcoded thread-request declarations. + +\end{itemize} + +\noindent +In \xf{fig:tmpdir.sh} is a sample script, using all three. + +\begin{figure}[htpb] + \lstinputlisting[language=csh,frame=single,basicstyle=\footnotesize\ttfamily]{tmpdir.sh} + \caption{Source code for \file{tmpdir.sh}} + \label{fig:tmpdir.sh} +\end{figure} diff --git a/doc/scheduler-tips.tex b/doc/scheduler-tips.tex new file mode 100644 index 0000000..2d49108 --- /dev/null +++ b/doc/scheduler-tips.tex @@ -0,0 +1,32 @@ +% ------------------------------------------------------------------------------ +\subsection{Tips/Tricks} +\label{sect:tips} + +\begin{itemize} +\item +Files/scripts must have Linux line breaks in them (not Windows ones). +\item +Use \tool{rsync}, not \tool{scp}, when moving data around. +\item +If you are going to move many many files between NFS-mounted storage and the +cluster, \tool{tar} everything up first. +\item +If you intend to use a different shell (e.g., \tool{bash}~\cite{aosa-book-vol1-bash}), +you will need to source a different scheduler file, and will need to +change the shell declaration in your script(s). +\item +The load displayed in \tool{qstat} by default is \api{np\_load}, which is +load/\#cores. That means that a load of, ``1'', which represents a fully active +core, is displayed as $0.03$ on the node in question, as there are 32 cores +on a node. To display load ``as is'' (such that a node with a fully active +core displays a load of approximately $1.00$), add the following to your +\file{.tcshrc} file: \texttt{setenv SGE\_LOAD\_AVG load\_avg} + +\item +Try to request resources that closely match what your job will use: +requesting many more cores or much more memory than will be needed makes a +job more difficult to schedule when resources are scarce. + +\item +E-mail, \texttt{rt-ex-hpc AT encs.concordia.ca}, with any concerns/questions. +\end{itemize} diff --git a/doc/speed-manual.tex b/doc/speed-manual.tex index 9d8a890..431ac16 100644 --- a/doc/speed-manual.tex +++ b/doc/speed-manual.tex @@ -300,93 +300,8 @@ \subsubsection{SSH Connections} commonly used commands. % ------------------------------------------------------------------------------ -\subsubsection{Environment Set Up} -\label{sect:envsetup} - -After creating an SSH connection to ``Speed'', you will need to source -the ``Altair Grid Engine (AGE)'' scheduler's settings file. -Sourcing the settings file will set the environment variables required to -execute scheduler commands. - -Based on the UNIX shell type, choose one of the following commands to source -the settings file. - -csh/\tool{tcsh}: -\begin{verbatim} -source /local/pkg/uge-8.6.3/root/default/common/settings.csh -\end{verbatim} - -Bourne shell/\tool{bash}: -\begin{verbatim} -. /local/pkg/uge-8.6.3/root/default/common/settings.sh -\end{verbatim} - -In order to set up the default ENCS bash shell, executing the following command -is also required: -\begin{verbatim} -printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile -\end{verbatim} - -To verify that you have access to the scheduler commands execute -\texttt{qstat -f -u "*"}. If an error is returned, attempt sourcing -the settings file again. - -The next step is to copy a job template to your home directory and to set up your -cluster-specific storage. Execute the following command from within your -home directory. (To move to your home directory, type \texttt{cd} at the Linux -prompt and press \texttt{Enter}.) - -\begin{verbatim} -cp /home/n/nul-uge/template.sh . && mkdir /speed-scratch/$USER -\end{verbatim} - -\textbf{Tip:} Add the source command to your shell-startup script. - -\textbf{Tip:} the default shell for GCS ENCS users is \tool{tcsh}. -If you would like to use \tool{bash}, please contact -\texttt{rt-ex-hpc AT encs.concordia.ca}. - -For \textbf{new ENCS Users}, and/or those who don't have a shell-startup script, -based on your shell type use one of the following commands to copy a start up script -from \texttt{nul-uge}'s. home directory to your home directory. (To move to your home -directory, type \tool{cd} at the Linux prompt and press \texttt{Enter}.) - -csh/\tool{tcsh}: -\begin{verbatim} -cp /home/n/nul-uge/.tcshrc . -\end{verbatim} - -Bourne shell/\tool{bash}: -\begin{verbatim} -cp /home/n/nul-uge/.bashrc . -\end{verbatim} - -Users who already have a shell-startup script, use a text editor, such as -\tool{vim} or \tool{emacs}, to add the source request to your existing -shell-startup environment (i.e., to the \file{.tcshrc} file in your home directory). - -csh/\tool{tcsh}: -Sample \file{.tcshrc} file: -\begin{verbatim} -# Speed environment set up -if ($HOSTNAME == speed-submit.encs.concordia.ca) then - source /local/pkg/uge-8.6.3/root/default/common/settings.csh -endif -\end{verbatim} - -Bourne shell/\tool{bash}: -Sample \file{.bashrc} file: -\begin{verbatim} -# Speed environment set up -if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then - . /local/pkg/uge-8.6.3/root/default/common/settings.sh - printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile -fi -\end{verbatim} - -Note that you will need to either log out and back in, or execute a new shell, -for the environment changes in the updated \file{.tcshrc} or \file{.bashrc} file to be applied -(\textbf{important}). +% TMP scheduler-specific section +\input{scheduler-env} % ------------------------------------------------------------------------------ \subsection{Job Submission Basics} @@ -405,56 +320,8 @@ \subsection{Job Submission Basics} \end{itemize} % ------------------------------------------------------------------------------ -\subsubsection{Directives} - -Directives are comments included at the beginning of a job script that set the shell -and the options for the job scheduler. - -The shebang directive is always the first line of a script. In your job script, -this directive sets which shell your script's commands will run in. On ``Speed'', -we recommend that your script use a shell from the \texttt{/encs/bin} directory. - -To use the \texttt{tcsh} shell, start your script with: \verb|#!/encs/bin/tcsh| - -For \texttt{bash}, start with: \verb|#!/encs/bin/bash| - -Directives that start with \verb|"#$"|, set the options for the cluster's -``Altair Grid Engine (AGE)'' scheduler. The script template, \file{template.sh}, -provides the essentials: - -\begin{verbatim} -#$ -N -#$ -cwd -#$ -m bea -#$ -pe smp -#$ -l h_vmem=G -\end{verbatim} - -Replace, \verb++, with the name that you want your cluster job to have; -\option{-cwd}, makes the current working directory the ``job working directory'', -and your standard output file will appear here; \option{-m bea}, provides e-mail -notifications (begin/end/abort); replace, \verb++, with the degree of -(multithreaded) parallelism (i.e., cores) you attach to your job (up to 32), -be sure to delete or comment out the \verb| #$ -pe smp | parameter if it -is not relevant; replace, \verb++, with the value (in GB), that you want -your job's memory space to be (up to 500), and all jobs MUST have a memory-space -assignment. - -If you are unsure about memory footprints, err on assigning a generous -memory space to your job so that it does not get prematurely terminated -(the value given to \api{h\_vmem} is a hard memory ceiling). You can refine -\api{h\_vmem} values for future jobs by monitoring the size of a job's active -memory space on \texttt{speed-submit} with: - -\begin{verbatim} -qstat -j | grep maxvmem -\end{verbatim} - -Memory-footprint values are also provided for completed jobs in the final -e-mail notification (as, ``Max vmem''). - -\emph{Jobs that request a low-memory footprint are more likely to load on a busy -cluster.} +% TMP scheduler-specific section +\input{scheduler-directives} % ------------------------------------------------------------------------------ \subsubsection{Module Loads} @@ -506,323 +373,8 @@ \subsubsection{Module Loads} Typically, only the \texttt{module load} command will be used in your script. % ------------------------------------------------------------------------------ -\subsubsection{User Scripting} - -The last part the job script is the scripting that will be executed by the job. -This part of the job script includes all commands required to set up and -execute the task your script has been written to do. Any Linux command can be used -at this step. This section can be a simple call to an executable or a complex -loop which iterates through a series of commands. - -Every software program has a unique execution framework. It is the responsibility -of the script's author (e.g., you) to know what is required for the software used -in your script by reviewing the software's documentation. Regardless of which software -your script calls, your script should be written so that the software knows the -location of the input and output files as well as the degree of parallelism. -Note that the cluster-specific environment variable, \api{NSLOTS}, resolves -to the value provided to the scheduler in the \option{-pe smp} option. - -Jobs which touch data-input and data-output files more than once, should make use -of \api{TMPDIR}, a scheduler-provided working space almost 1~TB in size. -\api{TMPDIR} is created when a job starts, and exists on the local disk of the -compute node executing your job. Using \api{TMPDIR} results in faster I/O operations -than those to and from shared storage (which is provided over NFS). - -An sample job script using \api{TMPDIR} is available at \texttt{/home/n/nul-uge/templateTMPDIR.sh}: -the job is instructed to change to \api{\$TMPDIR}, to make the new directory \texttt{input}, to copy data from -\texttt{\$SGE\_O\_WORKDIR/references/} to \texttt{input/} (\texttt{\$SGE\_O\_WORKDIR} represents the -current working directory), to make the new directory \texttt{results}, to -execute the program (which takes input from \texttt{\$TMPDIR/input/} and writes -output to \texttt{\$TMPDIR/results/}), and finally to copy the total end results -to an existing directory, \texttt{processed}, that is located in the current -working directory. TMPDIR only exists for the duration of the job, though, -so it is very important to copy relevant results from it at job's end. - -% ------------------------------------------------------------------------------ -\subsection{Sample Job Script} - -Now, let's look at a basic job script, \file{tcsh.sh} in \xf{fig:tcsh.sh} -(you can copy it from our GitHub page or from \texttt{/home/n/nul-uge}). - -\begin{figure}[htpb] - \lstinputlisting[language=csh,frame=single,basicstyle=\ttfamily]{tcsh.sh} - \caption{Source code for \file{tcsh.sh}} - \label{fig:tcsh.sh} -\end{figure} - -The first line is the shell declaration (also know as a shebang) and sets the shell to \emph{tcsh}. -The lines that begin with \texttt{\#\$} are directives for the scheduler. - -\begin{itemize} - \item \texttt{-N} sets \emph{qsub-test} as the jobname - \item \texttt{-cwd} tells the scheduler to execute the job from the current working directory - \item \texttt{-l h\_vmem=1GB} requests and assigns 1GB of memory to the job. CPU jobs \emph{require} the \texttt{-l h\_vmem} option to be set. -\end{itemize} - -The script then: - -\begin{itemize} - \item Sleeps on a node for 30 seconds - \item Uses the \tool{module} command to load the \texttt{gurobi/8.1.0} environment - \item Prints the list of loaded modules into a file -\end{itemize} - -The scheduler command, \tool{qsub}, is used to submit (non-interactive) jobs. -From an ssh session on speed-submit, submit this job with \texttt{qsub ./tcsh.sh}. You will see, -\texttt{"Your job X ("qsub-test") has been submitted"}. The command, \tool{qstat}, can be used -to look at the status of the cluster: \texttt{qstat -f -u "*"}. You will see -something like this: - -\small -\begin{verbatim} -queuename qtype resv/used/tot. load_avg arch states ---------------------------------------------------------------------------------- -a.q@speed-01.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 ---------------------------------------------------------------------------------- -a.q@speed-03.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 ---------------------------------------------------------------------------------- -a.q@speed-25.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 ---------------------------------------------------------------------------------- -a.q@speed-27.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 ---------------------------------------------------------------------------------- -g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.02 lx-amd64 - 144 100.00000 qsub-test nul-uge r 12/03/2018 16:39:30 1 - 62624 0.09843 case_talle x_yzabc r 11/09/2021 16:50:09 32 ---------------------------------------------------------------------------------- -g.q@speed-17.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 ---------------------------------------------------------------------------------- -s.q@speed-07.encs.concordia.ca BIP 0/0/32 0.04 lx-amd64 ---------------------------------------------------------------------------------- -s.q@speed-08.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 ---------------------------------------------------------------------------------- -s.q@speed-09.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 ---------------------------------------------------------------------------------- -s.q@speed-10.encs.concordia.ca BIP 0/32/32 32.72 lx-amd64 - 62624 0.09843 case_talle x_yzabc r 11/09/2021 16:50:09 32 ---------------------------------------------------------------------------------- -s.q@speed-11.encs.concordia.ca BIP 0/32/32 32.08 lx-amd64 - 62679 0.14212 CWLR_DF a_bcdef r 11/10/2021 17:25:19 32 ---------------------------------------------------------------------------------- -s.q@speed-12.encs.concordia.ca BIP 0/32/32 32.10 lx-amd64 - 62749 0.09000 CLOUDY z_abc r 11/11/2021 21:58:12 32 ---------------------------------------------------------------------------------- -s.q@speed-15.encs.concordia.ca BIP 0/4/32 0.03 lx-amd64 - 62753 82.47478 matlabLDPa b_bpxez r 11/12/2021 08:49:52 4 ---------------------------------------------------------------------------------- -s.q@speed-16.encs.concordia.ca BIP 0/32/32 32.31 lx-amd64 - 62751 0.09000 CLOUDY z_abc r 11/12/2021 06:03:54 32 ---------------------------------------------------------------------------------- -s.q@speed-19.encs.concordia.ca BIP 0/32/32 32.22 lx-amd64 ---------------------------------------------------------------------------------- -... ---------------------------------------------------------------------------------- -s.q@speed-35.encs.concordia.ca BIP 0/32/32 2.78 lx-amd64 - 62754 7.22952 qlogin-tes a_tiyuu r 11/12/2021 10:31:06 32 ---------------------------------------------------------------------------------- -s.q@speed-36.encs.concordia.ca BIP 0/0/32 0.03 lx-amd64 -etc. -\end{verbatim} -\normalsize - -Remember that you only have 30 seconds before the job is essentially over, so -if you do not see a similar output, either adjust the sleep time in the -script, or execute the \tool{qstat} statement more quickly. The \tool{qstat} -output listed above shows you that your job is -running on node \texttt{speed-05}, that it has a job number of 144, that it -was started at 16:39:30 on 12/03/2018, and that it is a single-core job (the -default). - -Once the job finishes, there will be a new file in the directory that the job -was started from, with the syntax of, \texttt{"job name".o"job number"}, so -in this example the file is, qsub \file{test.o144}. This file represents the -standard output (and error, if there is any) of the job in question. If you -look at the contents of your newly created file, you will see that it -contains the output of the, \texttt{module list} command. -Important information is often written to this file. - -Congratulations on your first job! - -% ------------------------------------------------------------------------------ -\subsection{Common Job Management Commands Summary} -\label{sect:job-management-commands} - -Here are useful job-management commands: - -\begin{itemize} -\item -\texttt{qsub ./.sh}: once that your job script is ready, -on \texttt{speed-submit} you can submit it using this - -\item -\texttt{qstat -f -u }: you can check the status of your job(s) - -\item -\texttt{qstat -f -u "*"}: display cluster status for all users. - -\item -\texttt{qstat -j [job-ID]}: display job information for [job-ID] (said job may be actually running, or waiting in the queue). - -\item -\texttt{qdel [job-ID]}: delete job [job-ID]. - -\item -\texttt{qhold [job-ID]}: hold queued job, [job-ID], from running. - -\item -\texttt{qrls [job-ID]}: release held job [job-ID]. - -\item -\texttt{qacct -j [job-ID]}: get job stats. for completed job [job-ID]. \api{maxvmem} is one of the more useful stats. -\end{itemize} - - -% ------------------------------------------------------------------------------ -\subsection{Advanced \tool{qsub} Options} -\label{sect:qsub-options} - -In addition to the basic \tool{qsub} options presented earlier, there are a -few additional options that are generally useful: - -\begin{itemize} -\item -\texttt{-m bea}: requests that the scheduler e-mail you when a job (b)egins; -(e)nds; (a)borts. Mail is sent to the default address of, -\texttt{"username@encs.concordia.ca"}, unless a different address is supplied (see, -\texttt{-M}). The report sent when a job ends includes job -runtime, as well as the maximum memory value hit (\api{maxvmem}). - -\item -\texttt{-M email@domain.com}: requests that the scheduler use this e-mail -notification address, rather than the default (see, \texttt{-m}). - -\item -\texttt{-v variable[=value]}: exports an environment variable that can be used by the script. - -\item -\texttt{-l h\_rt=[hour]:[min]:[sec]}: sets a job runtime of HH:MM:SS. Note -that if you give a single number, that represents \emph{seconds}, not hours. - -\item -\texttt{-hold\_jid [job-ID]}: run this job only when job [job-ID] finishes. Held jobs appear in the queue. -The many \tool{qsub} options available are read with, \texttt{man qsub}. Also -note that \tool{qsub} options can be specified during the job-submission -command, and these \emph{override} existing script options (if present). The -syntax is, \texttt{qsub [options] PATHTOSCRIPT}, but unlike in the script, -the options are specified without the leading \verb+#$+ -(e.g., \texttt{qsub -N qsub-test -cwd -l h\_vmem=1G ./tcsh.sh}). - -\end{itemize} - -% ------------------------------------------------------------------------------ -\subsection{Array Jobs} - -Array jobs are those that start a batch job or a parallel job multiple times. -Each iteration of the job array is called a task and receives a unique job ID. - -To submit an array job, use the \texttt{\-t} option of the \texttt{qsub} -command as follows: - -\begin{verbatim} -qsub -t n[-m[:s]] -\end{verbatim} - -\textbf{-t Option Syntax:} -\begin{itemize} -\item -\texttt{n}: indicates the start-id. -\item -\texttt{m}: indicates the max-id. -\item -\texttt{s}: indicates the step size. -\end{itemize} - -\textbf{Examples:} -\begin{itemize} -\item -\texttt{qsub -t 10 array.sh}: submits a job with 1 task where the task-id is 10. -\item -\texttt{qsub -t 1-10 array.sh}: submits a job with 10 tasks numbered consecutively from 1 to 10. -\item -\texttt{qsub -t 3-15:3 array.sh}: submits a jobs with 5 tasks numbered consecutively with step size 3 -(task-ids 3,6,9,12,15). -\end{itemize} - -\textbf{Output files for Array Jobs:} - -The default and output and error-files are \option{job\_name.[o|e]job\_id} and\\ -\option{job\_name.[o|e]job\_id.task\_id}. -% -This means that Speed creates an output and an error-file for each task -generated by the array-job as well as one for the super-ordinate array-job. -To alter this behavior use the \option{-o} and \option{-e} option of -\tool{qsub}. - -For more details about Array Job options, please review the manual pages for -\option{qsub} by executing the following at the command line on speed-submit -\tool{man qsub}. - -% ------------------------------------------------------------------------------ -\subsection{Requesting Multiple Cores (i.e., Multithreading Jobs)} - -For jobs that can take advantage of multiple machine cores, up to 32 cores -(per job) can be requested in your script with: - -\begin{verbatim} -#$ -pe smp [#cores] -\end{verbatim} - -\textbf{Do not request more cores than you think will be useful}, as larger-core -jobs are more difficult to schedule. On the flip side, though, if you -are going to be running a program that scales out to the maximum single-machine -core count available, please (please) request 32 cores, to avoid node -oversubscription (i.e., to avoid overloading the CPUs). - -Core count associated with a job appears under, ``states'', in the, -\texttt{qstat -f -u "*"}, output. - -% ------------------------------------------------------------------------------ -\subsection{Interactive Jobs} - -Job sessions can be interactive, instead of batch (script) based. Such -sessions can be useful for testing and optimising code and resource -requirements prior to batch submission. To request an interactive job -session, use, \texttt{qlogin [options]}, similarly to a -\tool{qsub} command-line job (e.g., \texttt{qlogin -N qlogin-test -l h\_vmem=1G}). -Note that the options that are available for \tool{qsub} are not necessarily -available for \tool{qlogin}, notably, \texttt{-cwd}, and, \texttt{-v}. - -% ------------------------------------------------------------------------------ -\subsection{Scheduler Environment Variables} - -The scheduler presents a number of environment variables that can be used in -your jobs. Three of the more useful are \api{TMPDIR}, \api{SGE\_O\_WORKDIR}, -and \api{NSLOTS}: - -\begin{itemize} -\item -\api{\$TMPDIR}=the path to the job's temporary space on the node. It -\emph{only} exists for the duration of the job, so if data in the temporary space -are important, they absolutely need to be accessed before the job terminates. - -\item -\api{\$SGE\_O\_WORKDIR}=the path to the job's working directory (likely an -NFS-mounted path). If, \texttt{-cwd}, was stipulated, that path is taken; if not, -the path defaults to your home directory. - -\item -\api{\$NSLOTS}=the number of cores requested for the job. This variable can -be used in place of hardcoded thread-request declarations. - -\end{itemize} - -\noindent -In \xf{fig:tmpdir.sh} is a sample script, using all three. - -\begin{figure}[htpb] - \lstinputlisting[language=csh,frame=single,basicstyle=\footnotesize\ttfamily]{tmpdir.sh} - \caption{Source code for \file{tmpdir.sh}} - \label{fig:tmpdir.sh} -\end{figure} +% TMP scheduler-specific section +\input{scheduler-scripting} % ------------------------------------------------------------------------------ \subsection{SSH Keys For MPI} @@ -903,225 +455,9 @@ \subsubsection{Anaconda} anaconda's repository. \vspace{10pt} - % ------------------------------------------------------------------------------ -\subsection{Example Job Script: Fluent} - -\begin{figure}[htpb] - \lstinputlisting[language=csh,frame=single,basicstyle=\footnotesize\ttfamily]{fluent.sh} - \caption{Source code for \file{fluent.sh}} - \label{fig:fluent.sh} -\end{figure} - -The job script in \xf{fig:fluent.sh} runs Fluent in parallel over 32 cores. -Of note, we have requested e-mail notifications (\texttt{-m}), are defining the -parallel environment for, \tool{fluent}, with, \texttt{-sgepe smp} (\textbf{very -important}), and are setting \api{\$TMPDIR} as the in-job location for the -``moment'' \file{rfile.out} file (in-job, because the last line of the script -copies everything from \api{\$TMPDIR} to a directory in the user's NFS-mounted home). -Job progress can be monitored by examining the standard-out file (e.g., -\file{flu10000.o249}), and/or by examining the ``moment'' file in -\texttt{/disk/nobackup/} (hint: it starts with your job-ID) on the node running -the job. \textbf{Caveat:} take care with journal-file file paths. - -% ------------------------------------------------------------------------------ -\subsection{Example Job: efficientdet} - -The following steps describing how to create an efficientdet environment on -\emph{Speed}, were submitted by a member of Dr. Amer's research group. - -\begin{itemize} - \item - Enter your ENCS user account's speed-scratch directory - \verb!cd /speed-scratch/! - \item - load python \verb!module load python/3.8.3! - create virtual environment \verb!python3 -m venv ! - activate virtual environment \verb!source /bin/activate.csh! - install DL packages for Efficientdet -\end{itemize} -\begin{verbatim} -pip install tensorflow==2.7.0 -pip install lxml>=4.6.1 -pip install absl-py>=0.10.0 -pip install matplotlib>=3.0.3 -pip install numpy>=1.19.4 -pip install Pillow>=6.0.0 -pip install PyYAML>=5.1 -pip install six>=1.15.0 -pip install tensorflow-addons>=0.12 -pip install tensorflow-hub>=0.11 -pip install neural-structured-learning>=1.3.1 -pip install tensorflow-model-optimization>=0.5 -pip install Cython>=0.29.13 -pip install git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI -\end{verbatim} - -% ------------------------------------------------------------------------------ -\subsection{Java Jobs} - -Jobs that call \tool{java} have a memory overhead, which needs to be taken -into account when assigning a value to \api{h\_vmem}. Even the most basic -\tool{java} call, \texttt{java -Xmx1G -version}, will need to have, -\texttt{-l h\_vmem=5G}, with the 4-GB difference representing the memory overhead. -Note that this memory overhead grows proportionally with the value of -\texttt{-Xmx}. To give you an idea, when \texttt{-Xmx} has a value of 100G, -\api{h\_vmem} has to be at least 106G; for 200G, at least 211G; for 300G, at least 314G. - -% TODO: add a MARF Java job - -% ------------------------------------------------------------------------------ -\subsection{Scheduling On The GPU Nodes} - -The primary cluster has two GPU nodes, each with six Tesla (CUDA-compatible) P6 -cards: each card has 2048 cores and 16GB of RAM. Though note that the P6 -is mainly a single-precision card, so unless you need the GPU double -precision, double-precision calculations will be faster on a CPU node. - -Job scripts for the GPU queue differ in that they do not need these -statements: - -\begin{verbatim} -#$ -pe smp -#$ -l h_vmem=G -\end{verbatim} - -But do need this statement, which attaches either a single GPU, or, two -GPUs, to the job: - -\begin{verbatim} -#$ -l gpu=[1|2] -\end{verbatim} - -Single-GPU jobs are granted 5~CPU cores and 80GB of system memory, and -dual-GPU jobs are granted 10~CPU cores and 160GB of system memory. A -total of \emph{four} GPUs can be actively attached to any one user at any given -time. - -Once that your job script is ready, you can submit it to the GPU queue -with: - -\begin{verbatim} -qsub -q g.q ./.sh -\end{verbatim} - -And you can query \tool{nvidia-smi} on the node that is running your job with: - -\begin{verbatim} -ssh @speed[-05|-17] nvidia-smi -\end{verbatim} - -Status of the GPU queue can be queried with: - -\begin{verbatim} -qstat -f -u "*" -q g.q -\end{verbatim} - -\textbf{Very important note} regarding TensorFlow and PyTorch: -if you are planning to run TensorFlow and/or PyTorch multi-GPU jobs, -do not use the \api{tf.distribute} and/or\\ -\api{torch.nn.DataParallel} -functions, as they will crash the compute node (100\% certainty). -This appears to be the current hardware's architecture's defect. -% -The workaround is to either -% TODO: Need to link to that example -manually effect GPU parallelisation (TensorFlow has an example on how to -do this), or to run on a single GPU. - -\vspace{10pt} -\noindent -\textbf{Important} -\vspace{10pt} - -Users without permission to use the GPU nodes can submit jobs to the \texttt{g.q} -queue but those jobs will hang and never run. - -There are two GPUs in both \texttt{speed-05} and \texttt{speed-17}, and one -in \texttt{speed-19}. Their availability is seen with, \texttt{qstat -F g} -(note the capital): - -\small -\begin{verbatim} -queuename qtype resv/used/tot. load_avg arch states ---------------------------------------------------------------------------------- -... ---------------------------------------------------------------------------------- -g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.04 lx-amd64 - hc:gpu=6 ---------------------------------------------------------------------------------- -g.q@speed-17.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 - hc:gpu=6 ---------------------------------------------------------------------------------- -... ---------------------------------------------------------------------------------- -s.q@speed-19.encs.concordia.ca BIP 0/32/32 32.37 lx-amd64 - hc:gpu=1 ---------------------------------------------------------------------------------- -etc. -\end{verbatim} -\normalsize - -This status demonstrates that all five are available (i.e., have not been -requested as resources). To specifically request a GPU node, add, -\texttt{-l g=[\#GPUs]}, to your \tool{qsub} (statement/script) or -\tool{qlogin} (statement) request. For example, -\texttt{qsub -l h\_vmem=1G -l g=1 ./count.sh}. You -will see that this job has been assigned to one of the GPU nodes: - -\small -\begin{verbatim} -queuename qtype resv/used/tot. load_avg arch states ---------------------------------------------------------------------------------- -g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 hc:gpu=6 ---------------------------------------------------------------------------------- -g.q@speed-17.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 hc:gpu=6 ---------------------------------------------------------------------------------- -s.q@speed-19.encs.concordia.ca BIP 0/1/32 0.04 lx-amd64 hc:gpu=0 (haff=1.000000) - 538 100.00000 count.sh sbunnell r 03/07/2019 02:39:39 1 ---------------------------------------------------------------------------------- -etc. -\end{verbatim} -\normalsize - -And that there are no more GPUs available on that node (\texttt{hc:gpu=0}). Note -that no more than two GPUs can be requested for any one job. - -% ------------------------------------------------------------------------------ -\subsubsection{CUDA} - -When calling \tool{CUDA} within job scripts, it is important to create a link to -the desired \tool{CUDA} libraries and set the runtime link path to the same libraries. -For example, to use the \texttt{cuda-11.5} libraries, specify the following in -your Makefile. - -\begin{verbatim} --L/encs/pkg/cuda-11.5/root/lib64 -Wl,-rpath,/encs/pkg/cuda-11.5/root/lib64 -\end{verbatim} - -In your job script, specify the version of \texttt{gcc} to use prior to calling -cuda. For example: - \texttt{module load gcc/8.4} -or - \texttt{module load gcc/9.3} - -% ------------------------------------------------------------------------------ -\subsubsection{Special Notes for sending CUDA jobs to the GPU Queue} - -It is not possible to create a \texttt{qlogin} session on to a node in the -\textbf{GPU Queue} (\texttt{g.q}). As direct logins to these nodes is not -available, jobs must be submitted to the \textbf{GPU Queue} in order to compile -and link. - -We have several versions of CUDA installed in: -\begin{verbatim} -/encs/pkg/cuda-11.5/root/ -/encs/pkg/cuda-10.2/root/ -/encs/pkg/cuda-9.2/root -\end{verbatim} - -For CUDA to compile properly for the GPU queue, edit your Makefile -replacing \option{\/usr\/local\/cuda} with one of the above. +% TMP scheduler-specific section +\input{scheduler-job-examples} % ------------------------------------------------------------------------------ \section{Conclusion} @@ -1167,43 +503,14 @@ \subsection{Important Limitations} spent on the node(s)) and CPU activity (on the node(s)). \item -Jobs should NEVER be run outside of the province of the scheduler. Repeat -offenders risk loss of cluster access. +Jobs should NEVER be run outside of the province of the scheduler. +Repeat offenders risk loss of cluster access. \end{itemize} % ------------------------------------------------------------------------------ -\subsection{Tips/Tricks} -\label{sect:tips} - -\begin{itemize} -\item -Files/scripts must have Linux line breaks in them (not Windows ones). -\item -Use \tool{rsync}, not \tool{scp}, when moving data around. -\item -If you are going to move many many files between NFS-mounted storage and the -cluster, \tool{tar} everything up first. -\item -If you intend to use a different shell (e.g., \tool{bash}~\cite{aosa-book-vol1-bash}), -you will need to source a different scheduler file, and will need to -change the shell declaration in your script(s). -\item -The load displayed in \tool{qstat} by default is \api{np\_load}, which is -load/\#cores. That means that a load of, ``1'', which represents a fully active -core, is displayed as $0.03$ on the node in question, as there are 32 cores -on a node. To display load ``as is'' (such that a node with a fully active -core displays a load of approximately $1.00$), add the following to your -\file{.tcshrc} file: \texttt{setenv SGE\_LOAD\_AVG load\_avg} - -\item -Try to request resources that closely match what your job will use: -requesting many more cores or much more memory than will be needed makes a -job more difficult to schedule when resources are scarce. - -\item -E-mail, \texttt{rt-ex-hpc AT encs.concordia.ca}, with any concerns/questions. -\end{itemize} +% TMP scheduler-specific section +\input{scheduler-tips} % ------------------------------------------------------------------------------ \subsection{Use Cases} @@ -1259,7 +566,7 @@ \subsection{Acknowledgments} \begin{itemize} \item -The first 6 versions of this manual and early job script samples, +The first 6 (to 6.5) versions of this manual and early UGE job script samples, Singularity testing and user support were produced/done by Dr.~Scott Bunnell during his time at Concordia as a part of the NAG/HPC group. We thank him for his contributions. @@ -1292,200 +599,8 @@ \subsection{Phase 1} \end{itemize} % ------------------------------------------------------------------------------ -\section{Frequently Asked Questions} -\label{sect:faqs} - -% ------------------------------------------------------------------------------ -\subsection{Where do I learn about Linux?} - -All Speed users are expected to have a basic understanding of Linux and its commonly used commands. - -% ------------------------------------------------------------------------------ -\subsubsection*{Software Carpentry} - -Software Carpentry provides free resources to learn software, including a workshop on the Unix shell. -\url{https://software-carpentry.org/lessons/} - -% ------------------------------------------------------------------------------ -\subsubsection*{Udemy} - -There are a number of Udemy courses, including free ones, that will assist -you in learning Linux. Active Concordia faculty, staff and students have -access to Udemy courses such as \textbf{Linux Mastery: Master the Linux -Command Line in 11.5 Hours} is a good starting point for beginners. Visit -\url{https://www.concordia.ca/it/services/udemy.html} to learn how Concordians -may access Udemy. - -% ------------------------------------------------------------------------------ -\subsection{How to use the ``bash shell'' on Speed?} - -This section describes how to use the ``bash shell'' on Speed. Review -\xs{sect:envsetup} to ensure that your bash environment is set up. - -% ------------------------------------------------------------------------------ -\subsubsection{How do I set bash as my login shell?} - -In order to set your login shell to bash on Speed, your login shell on all GCS servers must be changed to bash. -To make this change, create a ticket with the Service Desk (or email help at concordia.ca) to request that bash become your default login shell for your ENCS user account on all GCS servers. - -% ------------------------------------------------------------------------------ -\subsubsection{How do I move into a bash shell on Speed?} - -To move to the bash shell, type \textbf{bash} at the command prompt. -For example: -\begin{verbatim} - [speed-27] [/home/a/a_user] > bash - bash-4.4$ echo $0 - bash -\end{verbatim} - -Note how the command prompt changed from \verb![speed-27] [/home/a/a_user] >! to \verb!bash-4.4$! after entering the bash shell. - -% ------------------------------------------------------------------------------ -\subsubsection{How do I run scripts written in bash on Speed?} - -To execute bash scripts on Speed: -\begin{enumerate} - \item -Ensure that the shebang of your bash job script is \verb!#!/encs/bin/bash! - \item -Use the qsub command to submit your job script to the scheduler. -\end{enumerate} - -The Speed GitHub contains a sample \href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/bash.sh}{bash job script}. - -% ------------------------------------------------------------------------------ -\subsection{How to resolve``Disk quota exceeded'' errors?} - -% ------------------------------------------------------------------------------ -\subsubsection{Probable Cause} - -The \texttt{``Disk quota exceeded''} Error occurs when your application has run out of disk space to write to. On Speed this error can be returned when: -\begin{enumerate} - \item -The \texttt{/tmp} directory on the speed node your application is running on is full and cannot be written to. - \item -Your NFS-provided home is full and cannot be written to. -\end{enumerate} - -% ------------------------------------------------------------------------------ -\subsubsection{Possible Solutions} - -\begin{enumerate} - \item -Use the \textbf{-cwd} job script option to set the directory that the job script - is submitted from the \texttt{job working directory}. The \texttt{job working directory} is the directory that the job will write output files in. - \item -The use local disk space is generally recommended for IO intensive operations. However, as the size of \texttt{/tmp} on speed nodes -is \texttt{1GB} it can be necessary for scripts to store temporary data -elsewhere. -Review the documentation for each module called within your script to -determine how to set working directories for that application. -The basic steps for this solution are: -\begin{itemize} - \item - Review the documentation on how to set working directories for - each module called by the job script. - \item - Create a working directory in speed-scratch for output files. - For example, this command will create a subdirectory called \textbf{output} - in your \verb!speed-scratch! directory: - \begin{verbatim} - mkdir -m 750 /speed-scratch/$USER/output - \end{verbatim} - \item - To create a subdirectory for recovery files: - \begin{verbatim} - mkdir -m 750 /speed-scratch/$USER/recovery - \end{verbatim} - \item - Update the job script to write output to the subdirectories you created in your \verb!speed-scratch! directory, e.g., \verb!/speed-scratch/$USER/output!. - \end{itemize} -\end{enumerate} -In the above example, \verb!$USER! is an environment variable containing your ENCS username. - -% ------------------------------------------------------------------------------ -\subsubsection{Example of setting working directories for \tool{COMSOL}} - -\begin{itemize} - \item - Create directories for recovery, temporary, and configuration files. - For example, to create these directories for your encs user account: - \begin{verbatim} - mkdir -m 750 -p /speed-scratch/$USER/comsol/{recovery,tmp,config} - \end{verbatim} - \item - Add the following command switches to the COMSOL command to use the - directories created above: - \begin{verbatim} - -recoverydir /speed-scratch/$USER/comsol/recovery - -tmpdir /speed-scratch/$USER/comsol/tmp - -configuration/speed-scratch/$USER/comsol/config - \end{verbatim} -\end{itemize} -In the above example, \verb!$USER! is an environment variable containing your ENCS username. - -% ------------------------------------------------------------------------------ -\subsubsection{Example of setting working directories for \tool{Python Modules}} - -By default when adding a python module the \texttt{/tmp} directory is set as the temporary repository for files downloads. -The size of the \texttt{/tmp} directory on \verb!speed-submit! is too small for pytorch. -To add a python module -\begin{itemize} - \item - Create your own tmp directory in your \verb!speed-scratch! direcrtory - \begin{verbatim} - mkdir /speed-scratch/$USER/tmp - \end{verbatim} - \item - Use the tmp direcrtory you created - \begin{verbatim} - setenv TMPDIR /speed-scratch/$USER/tmp - \end{verbatim} - \item - Attempt the installation of pytorch -\end{itemize} - -In the above example, \verb!$USER! is an environment variable containing your ENCS username. - -% ------------------------------------------------------------------------------ -\subsection{How do I check my job's status?} - -When a job with a job id of 1234 is running, the status of that job can be tracked using \verb!`qstat -j 1234`!. -Likewise, if the job is pending, the \verb!`qstat -j 1234`! command will report as to why the job is not scheduled or running. -Once the job has finished, or has been killed, the \textbf{qacct} command must be used to query the job's status, e.g., \verb!`qaact -j [jobid]`!. - -\subsection{Why is my job pending when nodes are empty?} -\subsubsection{Disabled nodes} -It is possible that a (or a number of) the Speed nodes are disabled. Nodes are disabled if they require maintenance. -To verify if Speed nodes are disabled, request the current list of disabled nodes from qstat. - -\begin{verbatim} -qstat -f -qs d -queuename qtype resv/used/tot. load_avg arch states ---------------------------------------------------------------------------------- -g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.27 lx-amd64 d ---------------------------------------------------------------------------------- -s.q@speed-07.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 d ---------------------------------------------------------------------------------- -s.q@speed-10.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 d ---------------------------------------------------------------------------------- -s.q@speed-16.encs.concordia.ca BIP 0/0/32 0.02 lx-amd64 d ---------------------------------------------------------------------------------- -s.q@speed-19.encs.concordia.ca BIP 0/0/32 0.03 lx-amd64 d ---------------------------------------------------------------------------------- -s.q@speed-24.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 d ---------------------------------------------------------------------------------- -s.q@speed-36.encs.concordia.ca BIP 0/0/32 0.03 lx-amd64 d -\end{verbatim} - -Note how the all of the Speed nodes in the above list have a state of \textbf{d}, or disabled. - -Your job will run once the maintenance has been completed and the disabled nodes have been enabled. -\subsubsection{Error in job submit request.} -It is possible that your job is pending, because the job requested resources that are not available within Speed. -To verify why pending job with job id 1234 is not running, execute \verb!`qstat -j 1234`! -and review the messages in the \textbf{scheduling info:} section. +% TMP scheduler-specific section +\input{scheduler-faq} % ------------------------------------------------------------------------------ \section{Sister Facilities}