-
Notifications
You must be signed in to change notification settings - Fork 15
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[manual] modularize into small scheduler-specific files.
Will help with merge and migration of generic and common parts with the scheduler-specific parts for now.
- Loading branch information
Showing
7 changed files
with
925 additions
and
900 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
% ------------------------------------------------------------------------------ | ||
\subsubsection{Directives} | ||
|
||
Directives are comments included at the beginning of a job script that set the shell | ||
and the options for the job scheduler. | ||
|
||
The shebang directive is always the first line of a script. In your job script, | ||
this directive sets which shell your script's commands will run in. On ``Speed'', | ||
we recommend that your script use a shell from the \texttt{/encs/bin} directory. | ||
|
||
To use the \texttt{tcsh} shell, start your script with: \verb|#!/encs/bin/tcsh| | ||
|
||
For \texttt{bash}, start with: \verb|#!/encs/bin/bash| | ||
|
||
Directives that start with \verb|"#$"|, set the options for the cluster's | ||
``Altair Grid Engine (AGE)'' scheduler. The script template, \file{template.sh}, | ||
provides the essentials: | ||
|
||
\begin{verbatim} | ||
#$ -N <jobname> | ||
#$ -cwd | ||
#$ -m bea | ||
#$ -pe smp <corecount> | ||
#$ -l h_vmem=<memory>G | ||
\end{verbatim} | ||
|
||
Replace, \verb+<jobname>+, with the name that you want your cluster job to have; | ||
\option{-cwd}, makes the current working directory the ``job working directory'', | ||
and your standard output file will appear here; \option{-m bea}, provides e-mail | ||
notifications (begin/end/abort); replace, \verb+<corecount>+, with the degree of | ||
(multithreaded) parallelism (i.e., cores) you attach to your job (up to 32), | ||
be sure to delete or comment out the \verb| #$ -pe smp | parameter if it | ||
is not relevant; replace, \verb+<memory>+, with the value (in GB), that you want | ||
your job's memory space to be (up to 500), and all jobs MUST have a memory-space | ||
assignment. | ||
|
||
If you are unsure about memory footprints, err on assigning a generous | ||
memory space to your job so that it does not get prematurely terminated | ||
(the value given to \api{h\_vmem} is a hard memory ceiling). You can refine | ||
\api{h\_vmem} values for future jobs by monitoring the size of a job's active | ||
memory space on \texttt{speed-submit} with: | ||
|
||
\begin{verbatim} | ||
qstat -j <jobID> | grep maxvmem | ||
\end{verbatim} | ||
|
||
Memory-footprint values are also provided for completed jobs in the final | ||
e-mail notification (as, ``Max vmem''). | ||
|
||
\emph{Jobs that request a low-memory footprint are more likely to load on a busy | ||
cluster.} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
% ------------------------------------------------------------------------------ | ||
\subsubsection{Environment Set Up} | ||
\label{sect:envsetup} | ||
|
||
After creating an SSH connection to ``Speed'', you will need to source | ||
the ``Altair Grid Engine (AGE)'' scheduler's settings file. | ||
Sourcing the settings file will set the environment variables required to | ||
execute scheduler commands. | ||
|
||
Based on the UNIX shell type, choose one of the following commands to source | ||
the settings file. | ||
|
||
csh/\tool{tcsh}: | ||
\begin{verbatim} | ||
source /local/pkg/uge-8.6.3/root/default/common/settings.csh | ||
\end{verbatim} | ||
|
||
Bourne shell/\tool{bash}: | ||
\begin{verbatim} | ||
. /local/pkg/uge-8.6.3/root/default/common/settings.sh | ||
\end{verbatim} | ||
|
||
In order to set up the default ENCS bash shell, executing the following command | ||
is also required: | ||
\begin{verbatim} | ||
printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile | ||
\end{verbatim} | ||
|
||
To verify that you have access to the scheduler commands execute | ||
\texttt{qstat -f -u "*"}. If an error is returned, attempt sourcing | ||
the settings file again. | ||
|
||
The next step is to copy a job template to your home directory and to set up your | ||
cluster-specific storage. Execute the following command from within your | ||
home directory. (To move to your home directory, type \texttt{cd} at the Linux | ||
prompt and press \texttt{Enter}.) | ||
|
||
\begin{verbatim} | ||
cp /home/n/nul-uge/template.sh . && mkdir /speed-scratch/$USER | ||
\end{verbatim} | ||
|
||
\textbf{Tip:} Add the source command to your shell-startup script. | ||
|
||
\textbf{Tip:} the default shell for GCS ENCS users is \tool{tcsh}. | ||
If you would like to use \tool{bash}, please contact | ||
\texttt{rt-ex-hpc AT encs.concordia.ca}. | ||
|
||
For \textbf{new ENCS Users}, and/or those who don't have a shell-startup script, | ||
based on your shell type use one of the following commands to copy a start up script | ||
from \texttt{nul-uge}'s. home directory to your home directory. (To move to your home | ||
directory, type \tool{cd} at the Linux prompt and press \texttt{Enter}.) | ||
|
||
csh/\tool{tcsh}: | ||
\begin{verbatim} | ||
cp /home/n/nul-uge/.tcshrc . | ||
\end{verbatim} | ||
|
||
Bourne shell/\tool{bash}: | ||
\begin{verbatim} | ||
cp /home/n/nul-uge/.bashrc . | ||
\end{verbatim} | ||
|
||
Users who already have a shell-startup script, use a text editor, such as | ||
\tool{vim} or \tool{emacs}, to add the source request to your existing | ||
shell-startup environment (i.e., to the \file{.tcshrc} file in your home directory). | ||
|
||
csh/\tool{tcsh}: | ||
Sample \file{.tcshrc} file: | ||
\begin{verbatim} | ||
# Speed environment set up | ||
if ($HOSTNAME == speed-submit.encs.concordia.ca) then | ||
source /local/pkg/uge-8.6.3/root/default/common/settings.csh | ||
endif | ||
\end{verbatim} | ||
|
||
Bourne shell/\tool{bash}: | ||
Sample \file{.bashrc} file: | ||
\begin{verbatim} | ||
# Speed environment set up | ||
if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then | ||
. /local/pkg/uge-8.6.3/root/default/common/settings.sh | ||
printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile | ||
fi | ||
\end{verbatim} | ||
|
||
Note that you will need to either log out and back in, or execute a new shell, | ||
for the environment changes in the updated \file{.tcshrc} or \file{.bashrc} file to be applied | ||
(\textbf{important}). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,203 @@ | ||
% ------------------------------------------------------------------------------ | ||
\section{Frequently Asked Questions} | ||
\label{sect:faqs} | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsection{Where do I learn about Linux?} | ||
|
||
All Speed users are expected to have a basic understanding of Linux and its commonly used commands. | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsubsection*{Software Carpentry} | ||
|
||
Software Carpentry provides free resources to learn software, including a workshop on the Unix shell. | ||
\url{https://software-carpentry.org/lessons/} | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsubsection*{Udemy} | ||
|
||
There are a number of Udemy courses, including free ones, that will assist | ||
you in learning Linux. Active Concordia faculty, staff and students have | ||
access to Udemy courses such as \textbf{Linux Mastery: Master the Linux | ||
Command Line in 11.5 Hours} is a good starting point for beginners. Visit | ||
\url{https://www.concordia.ca/it/services/udemy.html} to learn how Concordians | ||
may access Udemy. | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsection{How to use the ``bash shell'' on Speed?} | ||
|
||
This section describes how to use the ``bash shell'' on Speed. Review | ||
\xs{sect:envsetup} to ensure that your bash environment is set up. | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsubsection{How do I set bash as my login shell?} | ||
|
||
In order to set your login shell to bash on Speed, your login shell on all GCS servers must be changed to bash. | ||
To make this change, create a ticket with the Service Desk (or email help at concordia.ca) to request that bash become your default login shell for your ENCS user account on all GCS servers. | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsubsection{How do I move into a bash shell on Speed?} | ||
|
||
To move to the bash shell, type \textbf{bash} at the command prompt. | ||
For example: | ||
\begin{verbatim} | ||
[speed-submit] [/home/a/a_user] > bash | ||
bash-4.4$ echo $0 | ||
bash | ||
\end{verbatim} | ||
|
||
Note how the command prompt changed from \verb![speed-submit] [/home/a/a_user] >! to \verb!bash-4.4$! after entering the bash shell. | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsubsection{How do I run scripts written in bash on Speed?} | ||
|
||
To execute bash scripts on Speed: | ||
\begin{enumerate} | ||
\item | ||
Ensure that the shebang of your bash job script is \verb!#!/encs/bin/bash! | ||
\item | ||
Use the qsub command to submit your job script to the scheduler. | ||
\end{enumerate} | ||
|
||
The Speed GitHub contains a sample \href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/bash.sh}{bash job script}. | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsection{How to resolve ``Disk quota exceeded'' errors?} | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsubsection{Probable Cause} | ||
|
||
The \texttt{``Disk quota exceeded''} Error occurs when your application has run out of disk space to write to. On Speed this error can be returned when: | ||
\begin{enumerate} | ||
\item | ||
The \texttt{/tmp} directory on the speed node your application is running on is full and cannot be written to. | ||
\item | ||
Your NFS-provided home is full and cannot be written to. | ||
\end{enumerate} | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsubsection{Possible Solutions} | ||
|
||
\begin{enumerate} | ||
\item | ||
Use the \textbf{-cwd} job script option to set the directory that the job | ||
script is submitted from the \texttt{job working directory}. The | ||
\texttt{job working directory} is the directory that the job will write output files in. | ||
\item | ||
The use local disk space is generally recommended for IO intensive operations. However, as the size of \texttt{/tmp} on speed nodes | ||
is \texttt{1GB} it can be necessary for scripts to store temporary data | ||
elsewhere. | ||
Review the documentation for each module called within your script to | ||
determine how to set working directories for that application. | ||
The basic steps for this solution are: | ||
\begin{itemize} | ||
\item | ||
Review the documentation on how to set working directories for | ||
each module called by the job script. | ||
\item | ||
Create a working directory in speed-scratch for output files. | ||
For example, this command will create a subdirectory called \textbf{output} | ||
in your \verb!speed-scratch! directory: | ||
\begin{verbatim} | ||
mkdir -m 750 /speed-scratch/$USER/output | ||
\end{verbatim} | ||
\item | ||
To create a subdirectory for recovery files: | ||
\begin{verbatim} | ||
mkdir -m 750 /speed-scratch/$USER/recovery | ||
\end{verbatim} | ||
\item | ||
Update the job script to write output to the subdirectories you created in your \verb!speed-scratch! directory, e.g., \verb!/speed-scratch/$USER/output!. | ||
\end{itemize} | ||
\end{enumerate} | ||
In the above example, \verb!$USER! is an environment variable containing your ENCS username. | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsubsection{Example of setting working directories for \tool{COMSOL}} | ||
|
||
\begin{itemize} | ||
\item | ||
Create directories for recovery, temporary, and configuration files. | ||
For example, to create these directories for your GCS ENCS user account: | ||
\begin{verbatim} | ||
mkdir -m 750 -p /speed-scratch/$USER/comsol/{recovery,tmp,config} | ||
\end{verbatim} | ||
\item | ||
Add the following command switches to the COMSOL command to use the | ||
directories created above: | ||
\begin{verbatim} | ||
-recoverydir /speed-scratch/$USER/comsol/recovery | ||
-tmpdir /speed-scratch/$USER/comsol/tmp | ||
-configuration/speed-scratch/$USER/comsol/config | ||
\end{verbatim} | ||
\end{itemize} | ||
In the above example, \verb!$USER! is an environment variable containing your ENCS username. | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsubsection{Example of setting working directories for \tool{Python Modules}} | ||
|
||
By default when adding a python module the \texttt{/tmp} directory is set as the temporary repository for files downloads. | ||
The size of the \texttt{/tmp} directory on \verb!speed-submit! is too small for pytorch. | ||
To add a python module | ||
\begin{itemize} | ||
\item | ||
Create your own tmp directory in your \verb!speed-scratch! directory | ||
\begin{verbatim} | ||
mkdir /speed-scratch/$USER/tmp | ||
\end{verbatim} | ||
\item | ||
Use the tmp directory you created | ||
\begin{verbatim} | ||
setenv TMPDIR /speed-scratch/$USER/tmp | ||
\end{verbatim} | ||
\item | ||
Attempt the installation of pytorch | ||
\end{itemize} | ||
|
||
In the above example, \verb!$USER! is an environment variable containing your ENCS username. | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsection{How do I check my job's status?} | ||
|
||
When a job with a job id of 1234 is running, the status of that job can be tracked using \verb!`qstat -j 1234`!. | ||
Likewise, if the job is pending, the \verb!`qstat -j 1234`! command will report as to why the job is not scheduled or running. | ||
Once the job has finished, or has been killed, the \textbf{qacct} command must be used to query the job's status, e.g., \verb!`qaact -j [jobid]`!. | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsection{Why is my job pending when nodes are empty?} | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsubsection{Disabled nodes} | ||
|
||
It is possible that a (or a number of) the Speed nodes are disabled. Nodes are disabled if they require maintenance. | ||
To verify if Speed nodes are disabled, request the current list of disabled nodes from qstat. | ||
|
||
\begin{verbatim} | ||
qstat -f -qs d | ||
queuename qtype resv/used/tot. load_avg arch states | ||
--------------------------------------------------------------------------------- | ||
[email protected] BIP 0/0/32 0.27 lx-amd64 d | ||
--------------------------------------------------------------------------------- | ||
[email protected] BIP 0/0/32 0.01 lx-amd64 d | ||
--------------------------------------------------------------------------------- | ||
[email protected] BIP 0/0/32 0.01 lx-amd64 d | ||
--------------------------------------------------------------------------------- | ||
[email protected] BIP 0/0/32 0.02 lx-amd64 d | ||
--------------------------------------------------------------------------------- | ||
[email protected] BIP 0/0/32 0.03 lx-amd64 d | ||
--------------------------------------------------------------------------------- | ||
[email protected] BIP 0/0/32 0.01 lx-amd64 d | ||
--------------------------------------------------------------------------------- | ||
[email protected] BIP 0/0/32 0.03 lx-amd64 d | ||
\end{verbatim} | ||
|
||
Note how the all of the Speed nodes in the above list have a state of \textbf{d}, or disabled. | ||
|
||
Your job will run once the maintenance has been completed and the disabled nodes have been enabled. | ||
|
||
% ------------------------------------------------------------------------------ | ||
\subsubsection{Error in job submit request.} | ||
|
||
It is possible that your job is pending, because the job requested resources that are not available within Speed. | ||
To verify why pending job with job id 1234 is not running, execute \verb!`qstat -j 1234`! | ||
and review the messages in the \textbf{scheduling info:} section. |
Oops, something went wrong.