diff --git a/README.md b/README.md index dff791f..9916b8e 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,10 @@ Speed: Gina Cody School HPC Facility: Scripts, Tools, and Refs * [`src/`](src/) -- sample job scripts * [`doc/`](doc/) -- user manual sources +## Software List + +* [EL7 and EL9 Software List](software-list.md) on Speed + ## Contributing and TODO * [Public issue tracker](https://github.com/NAG-DevOps/speed-hpc/issues) @@ -34,6 +38,7 @@ Speed: Gina Cody School HPC Facility: Scripts, Tools, and Refs ### Conferences +* Tariq Daradkeh, Gillian Roper, Carlos Alarcon Meza, and Serguei Mokhov. **HPC jobs classification and resource prediction to minimize job failures.** In International Conference on Computer Systems and Technologies 2024 (CompSysTech ’24), New York, NY, USA, June 2024. ACM. [DOI: 10.1145/3674912.3674914](https://doi.org/10.1145/3674912.3674914) * Serguei Mokhov, Jonathan Llewellyn, Carlos Alarcon Meza, Tariq Daradkeh, and Gillian Roper. 2023. **The use of Containers in OpenGL, ML and HPC for Teaching and Research Support.** In ACM SIGGRAPH 2023 Posters (SIGGRAPH '23). Association for Computing Machinery, New York, NY, USA, Article 49, 1–2. [DOI: 10.1145/3588028.3603676](https://doi.org/10.1145/3588028.3603676) ### Related Repositories @@ -44,13 +49,19 @@ Speed: Gina Cody School HPC Facility: Scripts, Tools, and Refs * https://github.com/NAG-DevOps/openiss-reid-tfk * https://github.com/NAG-DevOps/kg-recommendation-framework -### Technical +### Educational -* [Slurm Workload Manager](https://en.wikipedia.org/wiki/Slurm_Workload_Manager) * [Linux and other tutorials from Software Carpentry](https://software-carpentry.org/lessons/) * [Digital Research Alliance of Canada SLURM Examples](https://docs.alliancecan.ca/wiki/Running_jobs) * Concordia's subscription to [Udemy resources](https://www.concordia.ca/it/services/udemy.html) + +### Technical + +* [Slurm Workload Manager](https://en.wikipedia.org/wiki/Slurm_Workload_Manager) +* [NVIDIA A100](https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf) +* [NVIDIA V100](https://images.nvidia.com/content/technologies/volta/pdf/tesla-volta-v100-datasheet-letter-fnl-web.pdf) * [NVIDIA Tesla P6](https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/solutions/resources/documents1/Tesla-P6-Product-Brief.pdf) +* [NVIDIA RTX 6000 Ada Generation](https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/rtx-6000/proviz-print-rtx6000-datasheet-web-2504660.pdf) * [AMD Tonga FirePro S7100X](https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#FirePro_Server_Series_(S000x/Sxx_000)) ### Legacy diff --git a/doc/Makefile b/doc/Makefile index d73d851..997718b 100644 --- a/doc/Makefile +++ b/doc/Makefile @@ -24,7 +24,7 @@ all: $(DELIVERABLE).pdf #all: arxiv acm #all: arxiv -$(DELIVERABLE).pdf: $(DELIVERABLE).tex $(DELIVERABLE).bib Makefile commands.tex +$(DELIVERABLE).pdf: $(DELIVERABLE).tex $(DELIVERABLE).bib Makefile commands.tex software-list.tex @echo "Compiling *.tex files..." pdflatex $(PDFLATEXFLAGS) $(DELIVERABLE) @echo "Compiling bibliography..." @@ -53,6 +53,12 @@ $(DELIVERABLE)-arxiv.tex: to-arxiv.pl $(DELIVERABLE).tex ./to-arxiv.pl < $(DELIVERABLE).tex > $(DELIVERABLE)-arxiv.tex perl -pi -e 's/\{content\}/\{content-arxiv\}/g' $(DELIVERABLE)-arxiv.tex +software-list: software-list.tex ../software-list.md +software-list.tex ../software-list.md: generate-software-list.sh + @echo "Generating software list. Don't forget to run make afterwards to recompile the manual." + ./generate-software-list.sh + mv -f software-list.md .. + acm: $(DELIVERABLE)-acm.pdf $(DELIVERABLE)-acm.pdf: $(DELIVERABLE)-acm.tex content-acm.tex Makefile diff --git a/doc/generate-software-list.sh b/doc/generate-software-list.sh new file mode 100755 index 0000000..378b900 --- /dev/null +++ b/doc/generate-software-list.sh @@ -0,0 +1,79 @@ +#!/encs/bin/bash + +# Generates .tex and .md versions of the software list +# Serguei Mokhov + +GENERATED_ON=`date` +OUTFILE="software-list" + +# Generate the LaTeX version first +cat > "$OUTFILE.tex" << LATEX_HEADER +% ----------------------------------------------------------------------------- +% $0 +\section{Software Installed On Speed} +\label{sect:software-details} + +This is a generated section by a script; last updated on \textit{$GENERATED_ON}. +We have two major software trees: Scientific Linux 7 (EL7), which is +outgoing, and AlmaLinux 9 (EL9). After major synchronization of software +packages is complete, we will stop maintaining the EL7 tree and +will migrate the remaining nodes to EL9. + +Use \option{--constraint=el7} to select EL7-only installed nodes for their +software packages. Conversely, use \option{--constraint=el9} for the EL9-only +software. These options would be used as a part of your job parameters +in either \api{\#SBATCH} or on the command line. + +\noindent +\textbf{NOTE:} this list does not include packages installed directly on the OS (yet). + +% ----------------------------------------------------------------------------- +\subsection{EL7} +\label{sect:software-el7} + +Not all packages are intended for HPC, but the common tree is available +on Speed as well as teaching labs' desktops. + +\scriptsize +\begin{multicols}{3} +\begin{itemize} +LATEX_HEADER + +ls -1 /encs/ArchDep/x86_64.EL7/pkg/ \ + | egrep -v HIDE \ + | sed 's/^/\\item \\verb|/g' \ + | sed 's/$/|/g' \ + >> "$OUTFILE.tex" + +cat >> "$OUTFILE.tex" << LATEX_EL9_HEADER +\end{itemize} +\end{multicols} +\normalsize + +% ----------------------------------------------------------------------------- +\subsection{EL9} +\label{sect:software-el9} + +\scriptsize +\begin{multicols}{3} +\begin{itemize} +LATEX_EL9_HEADER + +ls -1 /encs/ArchDep/x86_64.EL9/pkg/ \ + | egrep -v HIDE \ + | sed 's/^/\\item \\verb|/g' \ + | sed 's/$/|/g' \ + >> "$OUTFILE.tex" + +cat >> "$OUTFILE.tex" << LATEX_FOOTER +\end{itemize} +\end{multicols} +\normalsize + +% EOF +LATEX_FOOTER + +# Get .md version of the same from LaTeX +pandoc -s "$OUTFILE.tex" -o "$OUTFILE.md" + +# EOF diff --git a/doc/images/speed-architecture-full.png b/doc/images/speed-architecture-full.png new file mode 100644 index 0000000..8c3d24a Binary files /dev/null and b/doc/images/speed-architecture-full.png differ diff --git a/doc/scheduler-directives.tex b/doc/scheduler-directives.tex index 099ab37..5f797e6 100644 --- a/doc/scheduler-directives.tex +++ b/doc/scheduler-directives.tex @@ -1,115 +1,96 @@ -% ------------------------------------------------------------------------------ -\subsubsection{Directives} -\label{sect:directives} +% 2.2.1 Directives +% ------------------- +% TMP scheduler-specific section Directives are comments included at the beginning of a job script that set the shell -and the options for the job scheduler. +and the options for the job scheduler. % The shebang directive is always the first line of a script. In your job script, this directive sets which shell your script's commands will run in. On ``Speed'', -we recommend that your script use a shell from the \texttt{/encs/bin} directory. +we recommend that your script use a shell from the \texttt{/encs/bin} directory.\\ To use the \texttt{tcsh} shell, start your script with \verb|#!/encs/bin/tcsh|. -% -For \texttt{bash}, start with \verb|#!/encs/bin/bash|. -% -Directives that start with \verb|#SBATCH|, set the options for the cluster's -Slurm job scheduler. The script template, \texttt{template.sh}, -provides the essentials: +For \texttt{bash}, start with \verb|#!/encs/bin/bash|.\\ + +Directives that start with \verb|#SBATCH| set the options for the cluster's +SLURM job scheduler. The following provides an example of some essential directives: -%\begin{verbatim} -%#$ -N -%#$ -cwd -%#$ -m bea -%#$ -pe smp -%#$ -l h_vmem=G -%\end{verbatim} +\small \begin{verbatim} -#SBATCH --job-name= ## or -J. Give the job a name -#SBATCH --mail-type= ## Set type of email notifications -#SBATCH --chdir= ## or -D, Set working directory where output files will go -#SBATCH --nodes=1 ## or -N, Node count required for the job -#SBATCH --ntasks=1 ## or -n, Number of tasks to be launched -#SBATCH --cpus-per-task= ## or -c, Core count requested, e.g. 8 cores -#SBATCH --mem= ## Assign memory for this job, e.g., 32G memory per node + #SBATCH --job-name= ## or -J. Give the job a name + #SBATCH --mail-type= ## set type of email notifications + #SBATCH --chdir= ## or -D, set working directory for the job + #SBATCH --nodes=1 ## or -N, node count required for the job + #SBATCH --ntasks=1 ## or -n, number of tasks to be launched + #SBATCH --cpus-per-task= ## or -c, core count requested, e.g. 8 cores + #SBATCH --mem= ## assign memory for this job, + ## e.g., 32G memory per node \end{verbatim} +\normalsize -Replace the following to adjust the job script for your project(s) -\begin{enumerate} - \item \verb++ with a job name for the job - \item \verb++ with the fullpath to your job's working directory, e.g., where your code, -source files and where the standard output files will be written to. By default, \verb+--chdir+ -sets the current directory as the job's working directory - \item \verb++ with the type of e-mail notifications you wish to receive. Valid options are: NONE, BEGIN, END, FAIL, REQUEUE, ALL - \item \verb++ with the degree of multithreaded parallelism (i.e., cores) allocated to your job. Up to 32 by default. - \item \verb++ with the amount of memory, in GB, that you want to be allocated per node. Up to 500 depending on the node. - NOTE: All jobs MUST set a value for the \verb|--mem| option. -\end{enumerate} - -Example with short option equivalents: +\noindent Replace the following to adjust the job script for your project(s) +\begin{itemize} + \item \verb++ with a job name for the job. This name will be displayed in the job queue. + \item \verb++ with the fullpath to your job's working directory, e.g., where your code, + source files and where the standard output files will be written to. + By default, \verb+--chdir+ sets the current directory as the job's working directory. + \item \verb++ with the type of e-mail notifications you wish to receive. + Valid options are: NONE, BEGIN, END, FAIL, REQUEUE, ALL. + \item \verb++ with the degree of multithreaded parallelism (i.e., cores) allocated to your job. Up to 32 by default. + \item \verb++ with the amount of memory, in GB, that you want to be allocated per node. Up to 500 depending on the node.\\ + \textbf{Note}: All jobs MUST set a value for the \option{--mem} option. +\end{itemize} +\noindent Example with short option equivalents: +\small \begin{verbatim} -#SBATCH -J tmpdir ## Job's name set to 'tmpdir' -#SBATCH --mail-type=ALL ## Receive all email type notifications -#SBATCH -D ./ ## Use current directory as working directory -#SBATCH -N 1 ## Node count required for the job -#SBATCH -n 1 ## Number of tasks to be launched -#SBATCH -c 8 ## Request 8 cores -#SBATCH --mem=32G ## Allocate 32G memory per node + #SBATCH -J myjob ## Job's name set to 'myjob' + #SBATCH --mail-type=ALL ## Receive all email type notifications + #SBATCH -D ./ ## Use current directory as working directory + #SBATCH -N 1 ## Node count required for the job + #SBATCH -n 1 ## Number of tasks to be launched + #SBATCH -c 8 ## Request 8 cores + #SBATCH --mem=32G ## Allocate 32G memory per node \end{verbatim} +\normalsize -% -If you are unsure about memory footprints, err on assigning a generous +\noindent \textbf{Tip:} If you are unsure about memory footprints, err on assigning a generous memory space to your job, so that it does not get prematurely terminated. -%(the value given to \api{h\_vmem} is a hard memory ceiling). -You can refine -%\api{h\_vmem} -\option{--mem} -values for future jobs by monitoring the size of a job's active +You can refine \option{--mem} values for future jobs by monitoring the size of a job's active memory space on \texttt{speed-submit} with: -%\begin{verbatim} -%qstat -j | grep maxvmem -%\end{verbatim} - \begin{verbatim} -sacct -j -sstat -j + sacct -j + sstat -j \end{verbatim} -\noindent -This can be customized to show specific columns: +\noindent This can be customized to show specific columns: \begin{verbatim} -sacct -o jobid,maxvmsize,ntasks%7,tresusageouttot%25 -j -sstat -o jobid,maxvmsize,ntasks%7,tresusageouttot%25 -j + sacct -o jobid,maxvmsize,ntasks%7,tresusageouttot%25 -j + sstat -o jobid,maxvmsize,ntasks%7,tresusageouttot%25 -j \end{verbatim} -Memory-footprint values are also provided for completed jobs in the final -e-mail notification as ``maxvmsize''. -% +\noindent Memory-footprint efficiency values (\tool{seff}) are also provided for completed jobs in the final +email notification as ``maxvmsize''. \emph{Jobs that request a low-memory footprint are more likely to load on a busy -cluster.} +cluster.}\\ -Other essential options are \option{--time}, or \verb|-t|, and \option{--account}, or \verb|-A|. -% +\noindent Other essential options are \option{--time}, or \option{-t}, and \option{--account}, or \option{-A}. \begin{itemize} -\item -\option{--time=