diff --git a/doc/scheduler-job-examples.tex b/doc/scheduler-job-examples.tex index 402de13..070bd20 100644 --- a/doc/scheduler-job-examples.tex +++ b/doc/scheduler-job-examples.tex @@ -274,47 +274,44 @@ \subsubsection{Special Notes for Sending CUDA Jobs to the GPU Queues} \subsubsection{OpenISS Examples} \label{sect:openiss-examples} -These examples represent more comprehensive research-like jobs -for computer vision and other tasks with longer runtime (subject to the number of epochs and other parameters). -They derive from the actual research works of students and their theses and require the use of CUDA and GPUs. +These examples represent more comprehensive, research-oriented jobs for computer vision +and other tasks with longer runtime (subject to the number of epochs and other parameters). +They are derived from actual research conducted by students as part of their theses and require +the use of CUDA and GPUs. These examples are available as ``native'' jobs on Speed and as Singularity containers. \noindent Examples include: \paragraph{OpenISS and REID} \label{sect:openiss-reid} -A computer-vision-based person re-identification -(e.g., motion capture-based tracking for stage performance) part of the OpenISS -project by Haotao Lai~\cite{lai-haotao-mcthesis19} using TensorFlow and Keras. +A computer-vision-based person re-identification example +(e.g., motion capture-based tracking for stage performance) is part of the OpenISS +project by Haotao Lai~\cite{lai-haotao-mcthesis19}, implemented using TensorFlow and Keras. The script is available here: \href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/openiss-reid-speed.sh}{openiss-reid-speed.sh}. -The fork of the original repo~\cite{openiss-reid-tfk} adjusted to run on Speed is available here: +A fork of the original repository~\cite{openiss-reid-tfk}, adapted to run on Speed, is available at \href{https://github.com/NAG-DevOps/openiss-reid-tfk}{openiss-reid-tfk}. -Detailed instructions on how to run it on Speed are in the README: +Detailed instructions on running this on Speed can be found in the README: \url{https://github.com/NAG-DevOps/speed-hpc/tree/master/src#openiss-reid-tfk} \paragraph{OpenISS and YOLOv3} \label{sect:openiss-yolov3} -The related code using YOLOv3 framework is in the -the fork of the original repo~\cite{openiss-yolov3} adjusted -to to run on Speed is available here: \href{https://github.com/NAG-DevOps/openiss-yolov3}{openiss-yolov3}.\\ +The code uses YOLOv3 framework, originally from \cite{openiss-yolov3}, adjusted to +run on Speed and is available here: \href{https://github.com/NAG-DevOps/openiss-yolov3}{openiss-yolov3}.\\ -\noindent Example job scripts can run on both CPUs and GPUs, as well as interactively using TensorFlow: +\noindent Example job scripts are provided for both batch (non-interactive) and interactive modes using TensorFlow: \begin{itemize} + \item Non-interactive (batch) mode: + \href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/openiss-yolo-speed.sh} + {openiss-yolo-speed.sh} \item Interactive mode: \href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/openiss-yolo-interactive.sh} {openiss-yolo-interactive.sh} - \item CPU-based job: - \href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/openiss-yolo-cpu.sh} - {openiss-yolo-cpu.sh} - \item GPU-based job: - \href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/openiss-yolo-gpu.sh} - {openiss-yolo-gpu.sh} \end{itemize} -\noindent Detailed instructions on how to run these on Speed are in the README: +\noindent Detailed instructions on running these scripts on Speed can be found in the README: \url{https://github.com/NAG-DevOps/speed-hpc/tree/master/src#openiss-yolov3} % -------------- 2.16 Singularity Containers ------------------ diff --git a/src/README.md b/src/README.md index 8135787..24a9726 100644 --- a/src/README.md +++ b/src/README.md @@ -4,8 +4,8 @@ This directory has example job scripts and some tips and tricks how to run certcain things. + ## TOC - - [Sample Jobs](#sample-jobs) - [Creating Environments and Compiling Code on Speed](#creating-environments-and-compiling-code-on-speed) * [Correct Procedure](#correct-procedure) @@ -25,19 +25,18 @@ run certcain things. * [Diviner Tools](#diviner-tools) * [OpenFoam](#openfoam-multinode) * [OpenISS-yolov3](#openiss-yolov3) - + [Speed Login Configuration ](#speed-login-configuration) - + [Speed Setup and Development Environment Preperation](#speed-setup-and-development-environment-preperation) - + [Run Interactive Script ](#run-interactive-script) - + [Run Non-interactive Script ](#run-non-interactive-script) - + [Performance comparison](#performance-comparison) + + [Prerequisites](#prerequisites-openiss-yolov3) + + [Configuration and Execution](#configuration-and-execution-openiss-yolov3) + - [Run Non-interactive Script](#run-non-interactive-openiss-yolov3) + - [Run Interactive Script ](#run-interactive-openiss-yolov3) + + [Performance Comparison](#performance-comparison-openiss-yolov3) * [OpenISS-reid-tfk](#openiss-reid-tfk) - + [Environment](#environment) + + [Prerequisities](#prerequisites-openiss-reid) + [Configuration and execution](#configuration-and-execution) * [CUDA](#cuda) + [Special Notes for sending CUDA jobs to the GPU Partition (`pg`)](#special-notes-for-sending-cuda-jobs-to-the-gpu-partition-pg) + [Jupyter notebook example: Jupyter-Pytorch-CUDA](#jupyter-example-gpu-pytorch) * [Python Modules](#python-modules) - @@ -262,122 +261,98 @@ This example is taken from OpenFoam tutorials section: $FOAM_TUTORIALS/incompres method scotch; 6. Exit the salloc session, go to the cavity directory and run the script: `sbatch --mem=10Gb -pps --constraint=el9 openfoam-multinode.sh` - + ## OpenISS-yolov3 -This is a case study example on image classification, for more details please visit [openiss-yolov3](https://github.com/NAG-DevOps/openiss-yolov3). +This is a case study example on image classification, for more details please visit [OpenISS keras-yolo3](https://github.com/NAG-DevOps/openiss-yolov3). - -### Speed Login Configuration -1. As an interactive option is supported that show live video, you will need to enable ssh login with -X support. Please check this [link](https://www.concordia.ca/ginacody/aits/support/faq/xserver.html) to do that. -2. If you didn't know how to login to speed and prepare the working environment please check the manual in the follwing [link](https://github.com/NAG-DevOps/speed-hpc/blob/master/doc/speed-manual.pdf) section 2. + +### Prerequisites -After you logged in to speed change your working directory to `/speed-scratch/$USER` directory. -``` -cd /speed-scratch/$USER/ -``` + #### Images and Videos + Images and videos can be from any source, but a sample video and images are provided in `video` and `image` folders in the [OpenISS-YOLOv3 Github repository](https://github.com/NAG-DevOps/openiss-yolov3). - -### Speed Setup and Development Environment Preperation + #### YOLOv3 Weights + The YOLOv3 weights can be downloaded from [YOLO website](http://pjreddie.com/darknet/yolo/). However the script provided includes a command to `wget` the weights from the link above. -The pre-requisites to prepare the virtual development environment using anaconda is explained in [speed manual](https://github.com/NAG-DevOps/speed-hpc/blob/master/doc/speed-manual.pdf) section 3, please check that for more inforamtion. -1. Make sure you are in speed-scratch directory. Then Download OpenISS yolo3 project from [Github](https://github.com/NAG-DevOps/openiss-yolov3) to your speed-scratch proper diectory. -``` -cd /speed-scratch/$USER/ -git clone --depth=1 https://github.com/NAG-DevOps/openiss-yolov3.git -``` -2. Starting by loading anaconda module -``` -module load anaconda3/2023.03/default -``` -3. Switch to the project directoy. Create anaconda virtual environment, and configure development librires. The name of the environment can by any name here as an example named YOLO. Activate the conda environment YOLOInteractive. -``` -cd /speed-scratch/$USER/openiss-yolov3 -conda create -p /speed-scratch/$USER/YOLO -conda activate /speed-scratch/$USER/YOLO -``` -4. Install all required libraries you need and upgrade pip to install `opencv-contrib-python` library + #### Environment Setup + To set up the virtual development environment, refer to section 2.11 of the Speed manual [Creating Virtual Environments](https://nag-devops.github.io/speed-hpc/#anaconda) for detailed information. -``` -conda install python=3.5 -conda install Keras=2.1.5 -conda install Pillow -conda install matplotlib -conda install -c menpo opencv -pip install --upgrade pip -pip install opencv-contrib-python -``` + +### Configuration and execution +- Log into SPEED and navigate to your `speed-scratch` directory: -5. Validate conda environemnt and installed packages using following commands. Make sure the version of python and keras are same as requred. -``` -conda info --env -conda list -``` -if you need to delete the created virtual environment -``` -conda deactivate -conda env remove -p /speed-scratch/$USER/YOLO -``` + ssh $USER@speed.encs.concordia.ca + cd /speed-scratch/$USER/ - -### Run Interactive Script + **Note**: To see a live video in an interactive session, enable X11 forwarding. Linux can run X11, however, to run X server on: + - Windows: use MobaXterm or Putty + - MacOS: use XQuarz with its xterm -File `openiss-yolo-interactive.sh` is the speed script to run video example to run it you follow these steps: -1. Run interactive job we need to keep `ssh -X` option enabled and `Xming` server in your windows is working (MobaXterm provides an alternative; on macOS use XQuartz). -2. The `sbatch` is not the proper command since we have to keep direct ssh connection to the computational node, so `salloc` will be used. -3. Enter `salloc` in the `speed-submit`. The `salloc` will find an approriate computational node then it will allow you to have direct `ssh -X` login to that node. Make sure you are in the right directory and activate conda environment again. -``` -salloc --x11=first -t 60 -n 16 --mem=40G -p pg -cd /speed-scratch/$USER/openiss-yolov3 -conda activate /speed-scratch/$USER/YOLO -``` -4. Before you run the script you need to add permission access to the project files, then start run the script `./openiss-yolo-interactive.sh` -``` -chmod u+x *.sh -./openiss-yolo-interactive.sh -``` -5. A pop up window will show a classifed live video. + For more information refer to [How to Launch X11 applications](https://www.concordia.ca/ginacody/aits/support/faq/xserver.html) -Please note that since we have limited number of nodes with GPU support `salloc` the interactive sessions are time-limited to max 24h. +- Clone the [OpenISS-YOLOv3 Github repository](https://github.com/NAG-DevOps/openiss-yolov3) - -### Run Non-interactive Script + git clone --depth=1 https://github.com/NAG-DevOps/openiss-yolov3.git + cd /speed-scratch/$USER/openiss-yolov3 -Before you run the script you need to add permission access to the project files using `chmod` command. -``` -chmod u+x *.sh -``` -To run the script you will use `sbatch`, you can run the task on CPU or GPU compute nodes as follwoing: -1. For CPU nodes use `openiss-yolo-cpu.sh` file -``` -sbatch ./openiss-yolo-cpu.sh -``` + +#### Run Non-interactive Script + - Download and run `openiss-yolo-speed.sh` script from [Speed-HPC Github repository](https://github.com/NAG-DevOps/speed-hpc/tree/master/src). -2. For GPU nodes use `openiss-yolo-gpu.sh` file with option -p to specify a GPU partition (`pg`) for submission. -``` -sbatch -p pg ./openiss-yolo-gpu.sh -``` + sbatch ./openiss-yolo-speed.sh -For Tiny YOLOv3, just do in a similar way, just specify model path and anchor path with `--model model_file` and `--anchors anchor_file`. +The script performs the following: + - Configures job resources and paths for Conda environments. + - Creates, or activates the Conda environment, and installs required packages if necessary. + - Downloads YOLOv3 weights. + - Converts the Darknet YOLO model to Keras format. + - Runs YOLO inference on a sample video. + - Deactivates the Conda environment and exits. - -### Performance comparison + +#### Run Interactive Script + *Note* To run interactive job we need to use `ssh -X` + - Request resources with `salloc` command -Time is in minutes, run Yolo with different hardware configurations GPU types V100 and Tesla P6. Please note that there is an issue to run Yolo project on more than one GPU in case of teasla P6. The project use keras.utils library calling `multi_gpu_model()` function, which cause hardware faluts and force to restart the server. GPU name for V100 (gpu32), for P6 (gpu16) you can find that in scripts shell. + salloc --x11=first --mem=60G -n 32 --gpus=1 -p pt + - Download and run `openiss-yolo-interactive.sh` script from [Speed-HPC Github repository](https://github.com/NAG-DevOps/speed-hpc/tree/master/src). You need to add permission access to the project files. + + chmod u+x *.sh + ./openiss-yolo-interactive.sh + + - A pop up window will show a classifed live video. + +The script does the following: + - Prepare and create Conda environment based on `environment.yml` + - Download YOLOv3 Weights + - Convert the Darknet YOLO model into a Keras model using `convert.py` + - Run YOLO inference on a sample video in an intaractive mode + +**Note**: If you need to delete the created virtual environment + + conda deactivate + conda env remove -p /speed-scratch/$USER/envs/yolo_env + + For Tiny YOLOv3, it can be run in the same way, but you will need to specify model path and anchor path with `--model model_file` and `--anchors anchor_file`. + + +### Performance comparison + +Time is in minutes, run Yolo with different hardware configurations GPU types V100 and Tesla P6. Please note that there is an issue to run Yolo project on more than one GPU in case of tesla P6. The project uses keras.utils library calling `multi_gpu_model()` function, which cause hardware faluts and force to restart the server. GPU name for V100 is gpu32, and for P6 is gpu16, you can find that in scripts shell. | 1GPU-P6 | 1GPU-V100 | 2GPU-V100 | 32CPU | | --------------|-------------- |-------------- |----------------| | 22.45 | 17.15 | 23.33 | 60.42 | | 22.15 | 17.54 | 23.08 | 60.18 | | 22.18 | 17.18 | 23.13 | 60.47 | - ## OpenISS Person Re-Identification Baseline The following are the steps required to run the *OpenISS Person Re-Identification Baseline* Project (https://github.com/NAG-DevOps/openiss-reid-tfk) on the *Speed* cluster. This implementatoin is based on tensorflow and keras - + ### Prerequisites #### Dataset diff --git a/src/openiss-yolo-cpu.sh b/src/openiss-yolo-cpu.sh deleted file mode 100755 index 593f944..0000000 --- a/src/openiss-yolo-cpu.sh +++ /dev/null @@ -1,36 +0,0 @@ -#!/encs/bin/tcsh - -# Give job a name -#SBATCH -J oi-yolo-batch-cpu - -# Set output directory to current -#SBATCH --chdir=./ - -# Send an email when the job starts, finishes or if it is aborted. -#SBATCH --mail-type=ALL - -# Request GPU -# #SBATCH --gpus=2 - -# Request CPU with maximum memoy size = 80GB -#SBATCH --mem=80G - -# Request CPU slots -#SBATCH -n 16 - -#sleep 30 - -# Specify the output file name -#SBATCH -o openiss-yolo-batch-cpu.log - -module load anaconda3/2023.03/default - -conda activate /speed-scratch/$USER/YOLO - -# Image example -#srun python yolo_video.py --model model_data/yolo.h5 --classes model_data/coco_classes.txt --image --gpu_num 2 - -# Video example -srun python yolo_video.py --input video/v1.avi --output video/001.avi #--gpu_num 2 - -conda deactivate diff --git a/src/openiss-yolo-gpu.sh b/src/openiss-yolo-gpu.sh deleted file mode 100755 index 0445d9c..0000000 --- a/src/openiss-yolo-gpu.sh +++ /dev/null @@ -1,38 +0,0 @@ -#!/encs/bin/tcsh - -# Give job a name -#SBATCH -J oi-yolo-gpu - -# Set output directory to current -#SBATCH --chdir=./ - -# Send an email when the job starts, finishes or if it is aborted. -#SBATCH --mail-type=ALL - -# Request GPU -#SBATCH --gpus=2 - -# Request CPU with maximum memoy size = 40GB -#SBATCH --mem=40G - -# Request CPU slots -#SBATCH -n 16 - -#sleep 30 - -# Specify the output file name -#SBATCH -o openiss-yolo-batch-gpu.log - -module load anaconda3/2023.03/default - -conda activate /speed-scratch/$USER/YOLO - -# Image example -#srun python yolo_video.py --model model_data/yolo.h5 --classes model_data/coco_classes.txt --image --gpu_num 2 - -# Video example -srun python yolo_video.py --input video/v1.avi --output video/002.avi --gpu_num 2 - -conda deactivate - -# EOF diff --git a/src/openiss-yolo-interactive.sh b/src/openiss-yolo-interactive.sh index 9e2534c..4ce9026 100755 --- a/src/openiss-yolo-interactive.sh +++ b/src/openiss-yolo-interactive.sh @@ -1,14 +1,90 @@ #!/encs/bin/tcsh -## since it is salloc no need to configure cluster setting because -## it would choose the proper computational node +# Load required modules +module load anaconda3/2023.03/default +module load cuda/9.2/default -conda activate /speed-scratch/$USER/YOLO +# Define environment name and path +set ENV_NAME = "yolo_env" +set ENV_DIR = "/speed-scratch/$USER/envs" +set ENV_PATH = "$ENV_DIR/$ENV_NAME" +set TMP_DIR = "/speed-scratch/$USER/envs/tmp" +set PKGS_DIR = "/speed-scratch/$USER/envs/pkgs" -# Image example -#python yolo_video.py --model model_data/yolo.h5 --classes model_data/coco_classes.txt --image +mkdir -p $ENV_DIR +mkdir -p $TMP_DIR +mkdir -p $PKGS_DIR -# Video example -srun python yolo_video.py --input video/v1.avi --output video/003.avi --interactive +setenv TMP $TMP_DIR +setenv TMPDIR $TMP_DIR +setenv CONDA_PKGS_DIRS $PKGS_DIR + +# Check if the environment exists +conda env list | grep "$ENV_NAME" +if ($status == 0) then + echo "Environment $ENV_NAME already exists. Activating it..." + echo "======================================================" + conda activate "$ENV_PATH" + + if ($status != 0) then + echo "Error: Failed to activate Conda environment." + exit 1 + endif +else + echo "Creating Conda environment $ENV_NAME at $ENV_PATH..." + echo "====================================================" + conda create -y -p "$ENV_PATH" + + echo "Activating environment $ENV_NAME..." + echo "===================================" + conda activate "$ENV_PATH" + + if ($status != 0) then + echo "Error: Failed to activate Conda environment." + exit 1 + endif + + echo "Installing required packages..." + echo "===============================" + conda install -y -c conda-forge python=3.5.6 + conda install -y Keras=2.1.5 + conda install -y pillow matplotlib h5py + pip install --upgrade pip + pip install opencv-python==4.1.2.30 + pip install opencv-contrib-python==4.1.2.30 +endif + +echo "Conda environemnt summary..." +echo "============================" +conda info --envs +conda list + +# Download YOLOv3 weights +if (! -e yolov3.weights || -z yolov3.weights) then + echo "Downloading YOLOv3 weights..." + echo "=============================" + wget https://pjreddie.com/media/files/yolov3.weights +else + echo "YOLOv3 weights already exist. Skipping download..." + echo "==================================================" +endif + +sleep 30 + +# Convert the Darknet YOLO model to a Keras model +if (! -e model_data/yolo.h5 || -z model_data/yolo.h5) then + echo "Keras model NOT found. Converting Darknet YOLO model to Keras format..." + echo "=======================================================================" + srun python convert.py yolov3.cfg yolov3.weights model_data/yolo.h5 +else + echo "Keras model already exists. Skipping conversion..." + echo "==================================================" +endif + +# Run YOLO video processing - video example (interactive) +echo "Running interactive YOLO video processing..." +echo "============================================" +srun python yolo_video.py --input video/v1.avi --output video/002.avi --interactive conda deactivate +exit \ No newline at end of file diff --git a/src/openiss-yolo-speed.sh b/src/openiss-yolo-speed.sh new file mode 100644 index 0000000..463cd34 --- /dev/null +++ b/src/openiss-yolo-speed.sh @@ -0,0 +1,103 @@ +#!/encs/bin/tcsh + +#SBATCH --job-name openiss-yolo +#SBATCH --mail-type=ALL +#SBATCH --chdir=./ +#SBATCH -o output-%A.log + +# Request Resources +#SBATCH --mem=60G +#SBATCH -n 32 +#SBATCH --gpus=1 +#SBATCH -p pt + +# Load required modules +module load anaconda3/2023.03/default +module load cuda/9.2/default + +# Define environment name and path +set ENV_NAME = "yolo_env" +set ENV_DIR = "/speed-scratch/$USER/envs" +set ENV_PATH = "$ENV_DIR/$ENV_NAME" +set TMP_DIR = "/speed-scratch/$USER/envs/tmp" +set PKGS_DIR = "/speed-scratch/$USER/envs/pkgs" + +mkdir -p $ENV_DIR +mkdir -p $TMP_DIR +mkdir -p $PKGS_DIR + +setenv TMP $TMP_DIR +setenv TMPDIR $TMP_DIR +setenv CONDA_PKGS_DIRS $PKGS_DIR + +# Check if the environment exists +conda env list | grep "$ENV_NAME" +if ($status == 0) then + echo "Environment $ENV_NAME already exists. Activating it..." + echo "======================================================" + conda activate "$ENV_PATH" + + if ($status != 0) then + echo "Error: Failed to activate Conda environment." + exit 1 + endif +else + echo "Creating Conda environment $ENV_NAME at $ENV_PATH..." + echo "====================================================" + conda create -y -p "$ENV_PATH" python=3.5 keras=2.1.5 -c conda-forge + + echo "Activating environment $ENV_NAME..." + echo "===================================" + conda activate "$ENV_PATH" + + if ($status != 0) then + echo "Error: Failed to activate Conda environment." + exit 1 + endif + + echo "Installing required packages..." + echo "===============================" + pip install --upgrade pip + pip install pillow matplotlib h5py + #pip install tensorflow-gpu==1.8 + pip install opencv-python==4.1.2.30 + pip install opencv-contrib-python==4.1.2.30 +endif + +echo "Conda environemnt summary..." +echo "============================" +conda info --envs +conda list + +# Download YOLOv3 weights +if (! -e yolov3.weights || -z yolov3.weights) then + echo "Downloading YOLOv3 weights..." + echo "=============================" + wget https://pjreddie.com/media/files/yolov3.weights +else + echo "YOLOv3 weights already exist. Skipping download..." + echo "==================================================" +endif + +sleep 30 + +# Convert the Darknet YOLO model to a Keras model +if (! -e model_data/yolo.h5 || -z model_data/yolo.h5) then + echo "Keras model NOT found. Converting Darknet YOLO model to Keras format..." + echo "=======================================================================" + srun python convert.py yolov3.cfg yolov3.weights model_data/yolo.h5 +else + echo "Keras model already exists. Skipping conversion..." + echo "==================================================" +endif + +# Run YOLO video processing - image example +#srun python yolo_video.py --model model_data/yolo.h5 --classes model_data/coco_classes.txt --image + +# Run YOLO video processing - video example (non-interactive) +echo "Running non-interactive YOLO video processing..." +echo "================================================" +srun python yolo_video.py --input video/v1.avi --output video/001.avi #--gpu_num 1 + +conda deactivate +exit \ No newline at end of file