Skip to content

Commit

Permalink
[manual] modularize into small scheduler-specific files.
Browse files Browse the repository at this point in the history
Will help with merge and migration of generic and common
parts with the scheduler-specific parts for now.
  • Loading branch information
smokhov committed Oct 4, 2023
1 parent 2d703eb commit c8a7573
Show file tree
Hide file tree
Showing 7 changed files with 925 additions and 900 deletions.
51 changes: 51 additions & 0 deletions doc/scheduler-directives.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
% ------------------------------------------------------------------------------
\subsubsection{Directives}

Directives are comments included at the beginning of a job script that set the shell
and the options for the job scheduler.

The shebang directive is always the first line of a script. In your job script,
this directive sets which shell your script's commands will run in. On ``Speed'',
we recommend that your script use a shell from the \texttt{/encs/bin} directory.

To use the \texttt{tcsh} shell, start your script with: \verb|#!/encs/bin/tcsh|

For \texttt{bash}, start with: \verb|#!/encs/bin/bash|

Directives that start with \verb|"#$"|, set the options for the cluster's
``Altair Grid Engine (AGE)'' scheduler. The script template, \file{template.sh},
provides the essentials:

\begin{verbatim}
#$ -N <jobname>
#$ -cwd
#$ -m bea
#$ -pe smp <corecount>
#$ -l h_vmem=<memory>G
\end{verbatim}

Replace, \verb+<jobname>+, with the name that you want your cluster job to have;
\option{-cwd}, makes the current working directory the ``job working directory'',
and your standard output file will appear here; \option{-m bea}, provides e-mail
notifications (begin/end/abort); replace, \verb+<corecount>+, with the degree of
(multithreaded) parallelism (i.e., cores) you attach to your job (up to 32),
be sure to delete or comment out the \verb| #$ -pe smp | parameter if it
is not relevant; replace, \verb+<memory>+, with the value (in GB), that you want
your job's memory space to be (up to 500), and all jobs MUST have a memory-space
assignment.

If you are unsure about memory footprints, err on assigning a generous
memory space to your job so that it does not get prematurely terminated
(the value given to \api{h\_vmem} is a hard memory ceiling). You can refine
\api{h\_vmem} values for future jobs by monitoring the size of a job's active
memory space on \texttt{speed-submit} with:

\begin{verbatim}
qstat -j <jobID> | grep maxvmem
\end{verbatim}

Memory-footprint values are also provided for completed jobs in the final
e-mail notification (as, ``Max vmem'').

\emph{Jobs that request a low-memory footprint are more likely to load on a busy
cluster.}
88 changes: 88 additions & 0 deletions doc/scheduler-env.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
% ------------------------------------------------------------------------------
\subsubsection{Environment Set Up}
\label{sect:envsetup}

After creating an SSH connection to ``Speed'', you will need to source
the ``Altair Grid Engine (AGE)'' scheduler's settings file.
Sourcing the settings file will set the environment variables required to
execute scheduler commands.

Based on the UNIX shell type, choose one of the following commands to source
the settings file.

csh/\tool{tcsh}:
\begin{verbatim}
source /local/pkg/uge-8.6.3/root/default/common/settings.csh
\end{verbatim}

Bourne shell/\tool{bash}:
\begin{verbatim}
. /local/pkg/uge-8.6.3/root/default/common/settings.sh
\end{verbatim}

In order to set up the default ENCS bash shell, executing the following command
is also required:
\begin{verbatim}
printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile
\end{verbatim}

To verify that you have access to the scheduler commands execute
\texttt{qstat -f -u "*"}. If an error is returned, attempt sourcing
the settings file again.

The next step is to copy a job template to your home directory and to set up your
cluster-specific storage. Execute the following command from within your
home directory. (To move to your home directory, type \texttt{cd} at the Linux
prompt and press \texttt{Enter}.)

\begin{verbatim}
cp /home/n/nul-uge/template.sh . && mkdir /speed-scratch/$USER
\end{verbatim}

\textbf{Tip:} Add the source command to your shell-startup script.

\textbf{Tip:} the default shell for GCS ENCS users is \tool{tcsh}.
If you would like to use \tool{bash}, please contact
\texttt{rt-ex-hpc AT encs.concordia.ca}.

For \textbf{new ENCS Users}, and/or those who don't have a shell-startup script,
based on your shell type use one of the following commands to copy a start up script
from \texttt{nul-uge}'s. home directory to your home directory. (To move to your home
directory, type \tool{cd} at the Linux prompt and press \texttt{Enter}.)

csh/\tool{tcsh}:
\begin{verbatim}
cp /home/n/nul-uge/.tcshrc .
\end{verbatim}

Bourne shell/\tool{bash}:
\begin{verbatim}
cp /home/n/nul-uge/.bashrc .
\end{verbatim}

Users who already have a shell-startup script, use a text editor, such as
\tool{vim} or \tool{emacs}, to add the source request to your existing
shell-startup environment (i.e., to the \file{.tcshrc} file in your home directory).

csh/\tool{tcsh}:
Sample \file{.tcshrc} file:
\begin{verbatim}
# Speed environment set up
if ($HOSTNAME == speed-submit.encs.concordia.ca) then
source /local/pkg/uge-8.6.3/root/default/common/settings.csh
endif
\end{verbatim}

Bourne shell/\tool{bash}:
Sample \file{.bashrc} file:
\begin{verbatim}
# Speed environment set up
if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then
. /local/pkg/uge-8.6.3/root/default/common/settings.sh
printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile
fi
\end{verbatim}

Note that you will need to either log out and back in, or execute a new shell,
for the environment changes in the updated \file{.tcshrc} or \file{.bashrc} file to be applied
(\textbf{important}).
203 changes: 203 additions & 0 deletions doc/scheduler-faq.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
% ------------------------------------------------------------------------------
\section{Frequently Asked Questions}
\label{sect:faqs}

% ------------------------------------------------------------------------------
\subsection{Where do I learn about Linux?}

All Speed users are expected to have a basic understanding of Linux and its commonly used commands.

% ------------------------------------------------------------------------------
\subsubsection*{Software Carpentry}

Software Carpentry provides free resources to learn software, including a workshop on the Unix shell.
\url{https://software-carpentry.org/lessons/}

% ------------------------------------------------------------------------------
\subsubsection*{Udemy}

There are a number of Udemy courses, including free ones, that will assist
you in learning Linux. Active Concordia faculty, staff and students have
access to Udemy courses such as \textbf{Linux Mastery: Master the Linux
Command Line in 11.5 Hours} is a good starting point for beginners. Visit
\url{https://www.concordia.ca/it/services/udemy.html} to learn how Concordians
may access Udemy.

% ------------------------------------------------------------------------------
\subsection{How to use the ``bash shell'' on Speed?}

This section describes how to use the ``bash shell'' on Speed. Review
\xs{sect:envsetup} to ensure that your bash environment is set up.

% ------------------------------------------------------------------------------
\subsubsection{How do I set bash as my login shell?}

In order to set your login shell to bash on Speed, your login shell on all GCS servers must be changed to bash.
To make this change, create a ticket with the Service Desk (or email help at concordia.ca) to request that bash become your default login shell for your ENCS user account on all GCS servers.

% ------------------------------------------------------------------------------
\subsubsection{How do I move into a bash shell on Speed?}

To move to the bash shell, type \textbf{bash} at the command prompt.
For example:
\begin{verbatim}
[speed-submit] [/home/a/a_user] > bash
bash-4.4$ echo $0
bash
\end{verbatim}

Note how the command prompt changed from \verb![speed-submit] [/home/a/a_user] >! to \verb!bash-4.4$! after entering the bash shell.

% ------------------------------------------------------------------------------
\subsubsection{How do I run scripts written in bash on Speed?}

To execute bash scripts on Speed:
\begin{enumerate}
\item
Ensure that the shebang of your bash job script is \verb!#!/encs/bin/bash!
\item
Use the qsub command to submit your job script to the scheduler.
\end{enumerate}

The Speed GitHub contains a sample \href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/bash.sh}{bash job script}.

% ------------------------------------------------------------------------------
\subsection{How to resolve ``Disk quota exceeded'' errors?}

% ------------------------------------------------------------------------------
\subsubsection{Probable Cause}

The \texttt{``Disk quota exceeded''} Error occurs when your application has run out of disk space to write to. On Speed this error can be returned when:
\begin{enumerate}
\item
The \texttt{/tmp} directory on the speed node your application is running on is full and cannot be written to.
\item
Your NFS-provided home is full and cannot be written to.
\end{enumerate}

% ------------------------------------------------------------------------------
\subsubsection{Possible Solutions}

\begin{enumerate}
\item
Use the \textbf{-cwd} job script option to set the directory that the job
script is submitted from the \texttt{job working directory}. The
\texttt{job working directory} is the directory that the job will write output files in.
\item
The use local disk space is generally recommended for IO intensive operations. However, as the size of \texttt{/tmp} on speed nodes
is \texttt{1GB} it can be necessary for scripts to store temporary data
elsewhere.
Review the documentation for each module called within your script to
determine how to set working directories for that application.
The basic steps for this solution are:
\begin{itemize}
\item
Review the documentation on how to set working directories for
each module called by the job script.
\item
Create a working directory in speed-scratch for output files.
For example, this command will create a subdirectory called \textbf{output}
in your \verb!speed-scratch! directory:
\begin{verbatim}
mkdir -m 750 /speed-scratch/$USER/output
\end{verbatim}
\item
To create a subdirectory for recovery files:
\begin{verbatim}
mkdir -m 750 /speed-scratch/$USER/recovery
\end{verbatim}
\item
Update the job script to write output to the subdirectories you created in your \verb!speed-scratch! directory, e.g., \verb!/speed-scratch/$USER/output!.
\end{itemize}
\end{enumerate}
In the above example, \verb!$USER! is an environment variable containing your ENCS username.

% ------------------------------------------------------------------------------
\subsubsection{Example of setting working directories for \tool{COMSOL}}

\begin{itemize}
\item
Create directories for recovery, temporary, and configuration files.
For example, to create these directories for your GCS ENCS user account:
\begin{verbatim}
mkdir -m 750 -p /speed-scratch/$USER/comsol/{recovery,tmp,config}
\end{verbatim}
\item
Add the following command switches to the COMSOL command to use the
directories created above:
\begin{verbatim}
-recoverydir /speed-scratch/$USER/comsol/recovery
-tmpdir /speed-scratch/$USER/comsol/tmp
-configuration/speed-scratch/$USER/comsol/config
\end{verbatim}
\end{itemize}
In the above example, \verb!$USER! is an environment variable containing your ENCS username.

% ------------------------------------------------------------------------------
\subsubsection{Example of setting working directories for \tool{Python Modules}}

By default when adding a python module the \texttt{/tmp} directory is set as the temporary repository for files downloads.
The size of the \texttt{/tmp} directory on \verb!speed-submit! is too small for pytorch.
To add a python module
\begin{itemize}
\item
Create your own tmp directory in your \verb!speed-scratch! directory
\begin{verbatim}
mkdir /speed-scratch/$USER/tmp
\end{verbatim}
\item
Use the tmp directory you created
\begin{verbatim}
setenv TMPDIR /speed-scratch/$USER/tmp
\end{verbatim}
\item
Attempt the installation of pytorch
\end{itemize}

In the above example, \verb!$USER! is an environment variable containing your ENCS username.

% ------------------------------------------------------------------------------
\subsection{How do I check my job's status?}

When a job with a job id of 1234 is running, the status of that job can be tracked using \verb!`qstat -j 1234`!.
Likewise, if the job is pending, the \verb!`qstat -j 1234`! command will report as to why the job is not scheduled or running.
Once the job has finished, or has been killed, the \textbf{qacct} command must be used to query the job's status, e.g., \verb!`qaact -j [jobid]`!.

% ------------------------------------------------------------------------------
\subsection{Why is my job pending when nodes are empty?}

% ------------------------------------------------------------------------------
\subsubsection{Disabled nodes}

It is possible that a (or a number of) the Speed nodes are disabled. Nodes are disabled if they require maintenance.
To verify if Speed nodes are disabled, request the current list of disabled nodes from qstat.

\begin{verbatim}
qstat -f -qs d
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
[email protected] BIP 0/0/32 0.27 lx-amd64 d
---------------------------------------------------------------------------------
[email protected] BIP 0/0/32 0.01 lx-amd64 d
---------------------------------------------------------------------------------
[email protected] BIP 0/0/32 0.01 lx-amd64 d
---------------------------------------------------------------------------------
[email protected] BIP 0/0/32 0.02 lx-amd64 d
---------------------------------------------------------------------------------
[email protected] BIP 0/0/32 0.03 lx-amd64 d
---------------------------------------------------------------------------------
[email protected] BIP 0/0/32 0.01 lx-amd64 d
---------------------------------------------------------------------------------
[email protected] BIP 0/0/32 0.03 lx-amd64 d
\end{verbatim}

Note how the all of the Speed nodes in the above list have a state of \textbf{d}, or disabled.

Your job will run once the maintenance has been completed and the disabled nodes have been enabled.

% ------------------------------------------------------------------------------
\subsubsection{Error in job submit request.}

It is possible that your job is pending, because the job requested resources that are not available within Speed.
To verify why pending job with job id 1234 is not running, execute \verb!`qstat -j 1234`!
and review the messages in the \textbf{scheduling info:} section.
Loading

0 comments on commit c8a7573

Please sign in to comment.