Skip to content

Commit

Permalink
Update CIDA_computing_resources.Rmd and CIDA_BIOS_Cluster.Rmd (#38)
Browse files Browse the repository at this point in the history
* Working on CSPH Biostats HPC article.

* Fixed image display issue

* Actually fixed image issue

* Added more HPC article content

* Editing and spelling for HPC article, removed in-progress section stubs.

* Updated CIDA Computing Resources to remove SLCE and add Biostats HPC.

* Edited and updated wording about computing resources.

* Corrected spelling error in CIDA_computing_resourced.Rmd

* Modified wording of HIPAA/PHI policy
  • Loading branch information
Andrew0Hill authored Dec 10, 2024
1 parent 9b1d4c5 commit 2558856
Show file tree
Hide file tree
Showing 11 changed files with 282 additions and 24 deletions.
Binary file added vignettes/Biostats_HPC_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
262 changes: 262 additions & 0 deletions vignettes/CIDA_BIOS_Cluster.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
---
title: "Computing on the CSPH Biostats Cluster"
author: "Research Tools Committee"
date: "Last Updated: `r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Computing on the CSPH Biostats Cluster}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Definitions
- <a name="def_hpc"></a>**HPC** - High-performance Computing
- <a name="def_node"></a>**Node** - A single computer in the cluster's network. Most HPC clusters have a *head node* and one or more *compute nodes*
- <a name="def_headnode"></a>**Head Node** - A node within the cluster which serves as the user's access/login point to the cluster. Depending on cluster configuration, the head node may also be responsible for scheduling and distributing jobs to the compute nodes.
- <a name="def_compnode"></a>**Compute Node** - A node within the cluster which is designated for running user-submitting jobs. Compute nodes usually offer large amounts of computational resources (CPU cores, RAM, etc).
- <a name="def_clust"></a>**Cluster** - A group of networked computers, usually running cluster-management software to coordinate resource sharing among multiple users.
- <a name="def_job"></a>**Job** - A computational task executed on the cluster. A submitted job will be scheduled and executed on one of the compute nodes.

## Introduction to HPC (High-Performance Computing)

An HPC cluster is a group of networked high-performance computers (*nodes*).

A typical cluster will have multiple *compute nodes* which can perform heavy computation, as well as a *head node* which serves as the user's access point to the cluster, and may also be responsible for scheduling jobs among the *compute nodes*.
The *compute nodes* in an cluster typically have compute resources (CPU Cores, RAM, Disk Space) which far exceed those of a typical laptop or desktop computer.

Because HPC clusters are intended to serve a group of people (i.e. A biostatistics department) rather than a single user, HPC clusters use the concept of [jobs](#def_job) to allow for multiple users to effectively share the cluster's resources.

When a user is ready to run something (an analysis, processing pipeline, etc) on the cluster, they will submit a new *job* to the cluster. The cluster will then schedule and run the job as soon as compute resources are available.

```{r, echo=FALSE, out.width="80%", fig.align="center"}
knitr::include_graphics("cluster_basic_diagram.png")
```

The example figure above shows an example cluster with three users. Each user connects to the Head Node to submit their jobs.

User 1 has submitted two jobs, which are both running on Compute Node 1. User 2 has submitted a single job, which is also running on Compute Node 2. User 3 has submitted a job which requires a larger amount of compute resources (CPU cores, RAM, etc). This job runs on Compute Node 2 to provide the user with the resources they requested.

HPC clusters are useful for:

1. Performing analyses which take a long time to run (i.e. A large-scale analysis which takes hours to complete)
2. Performing analyses which are too resource-intensive (require too much RAM, Disk Space, etc) to run on a typical computer.

## CSPH Biostats HPC Cluster

The CSPH Biostats cluster consists of four nodes. The `csphbiostats.ucdenver.pvt` node serves as both a head node and one of the compute nodes (i.e. Submitted jobs may also run on this node), and the `cidalappc[1-3].ucdenver.pvt` nodes serve as compute nodes.

```{r, echo=FALSE, out.width="100%", fig.align="center"}
knitr::include_graphics("Biostats_HPC_diagram.png")
```

The CIDA/Biostats server uses the [SLURM](https://slurm.schedmd.com) system to manage job scheduling and resource management on the cluster.

## Accessing the CSPH Biostats Cluster

To access the CSPH Biostats cluster, first submit a support ticket on the [SOM IS web page](https://medschool.cuanschutz.edu/informationservices) requesting:

1. Access to the 'CSPH/CIDA Biostats Cluster'.
2. (optional) A directory under `/biostats_share` (i.e. `/biostats_share/<your_username>`).

Once approved, an account will be created for you on the server.

To log in to the CSPH Biostats Cluster, you can use SSH from the command line or an SSH client of your choice.

```{r, echo=FALSE, out.width="80%", fig.align="center"}
knitr::include_graphics("biostats_hpc_ssh.png")
```

If you are connecting from the command line (like the above example), run:

```
ssh <your_username>@csphbiostats.ucdenver.pvt
```

where `<your_username>` is your CU system username (i.e Your username for UCDAccess, Outlook, etc)

On login, you will be prompted for a password which will be your CU system password (i.e. Your password for UCDAccess, Outlook, etc).

Once you've successfully logged in, you should see a prompt like the final line in the screenshot, showing that you are logged in to `csphbiostats.ucdenver.pvt`.

To make future logins more convenient, you could configure an SSH config profile and RSA key pair, which will enable password-less login.

## Submitting a Job on the CSPH Biostats Cluster

In the sections below, I will describe a few different ways of submitting a job on the cluster, along with their potential use cases.

### Using `sbatch`
The most common way of submitting a job on the cluster is to use the `sbatch` command.

To submit a job using `sbatch`, you should first create a batch script which will list the commands to be executed as part of your job.

The below example shows a simple batch script `my_batch.sh` which executes a single R script.

```
#!/bin/bash
#SBATCH --job-name=my_batch
#SBATCH --output=my_batch.log
#SBATCH --error=my_batch.err
Rscript my_analysis.R
```

The first line:

```#!/bin/bash```

is required, and is used to determine how your script will be executed. In this case, the script will be executed using `bash`.

The next few lines will be parsed by SLURM to set parameters/options for your batch job.

```
#SBATCH --job-name=my_batch
#SBATCH --output=my_batch.log
#SBATCH --error=my_batch.err
```

In order:

1. `--job-name` - Sets a name for your job.
2. `--output` - Sets the output log file for your job. Any log messsages or outputs from your script will be sent to this file.
3. `--error` - Sets the error log file for your job. Any log or error messages produced by your script will be sent to this file.

Any comment line containing `#SBATCH` before the first command in your script will be parsed by SLURM.

Although `#SBATCH` lines are not required, I recommend at least providing the `--output` and `--error` options, since otherwise your output and error streams will both be directed to the default file `slurm-<job_id>.out`

We can submit this job by running:

```
sbatch my_batch.sh
```

```{r, echo=FALSE, out.width="80%", fig.align="center"}
knitr::include_graphics("cluster_sbatch.png")
```

When we execute the `sbatch` command, SLURM will assign the job an ID and schedule the job to execute. In this case, our job is assigned ID `6179` (first arrow in the figure).

By using the `squeue` command, we can obtain the status of all jobs currently scheduled across the cluster (including those submitted by other users).

Using the assigned job ID, we can use the `JOBID` column to identify our job in the queue (second arrow in the figure).

The `squeue` also provides some useful information about our job:

- The `ST` column tells us that our job is in the running (`R`) state.
- The `TIME` column tells us that our job has been running for 9 seconds.
- The `NODELIST(REASON)` column tells us that our job is executing on the `csphbiostats.ucdenver.pvt` node.


#### When to use `sbatch`
Using `sbatch` is most beneficial for long-running scripts or analyses. After the job has been submitted using `sbatch`, your job will execute on the cluster until completion. You can continue to work on the cluster (submitting other jobs, etc) or log out without affecting any of your running jobs.

You can monitor the progress of your job using by checking your log (`--output`) and error (`--error`) output files to see any outputs/messages printed by your script.

Additionally, you can use the `squeue` command to obtain other information about your job, including the current state (`ST`), the elapsed runtime (`TIME`), and location of your job (`NODELIST`).

### Using `srun`

Submitting a job using SLURM's `srun` command will schedule and run the job as soon as possible. When the job begins to run, you will see the output of the job in your terminal as it executes.

Similar to running a script locally on the command line (i.e. `./my_script.sh`), you will not be able to execute other commands in your terminal window until the submitted job is complete.

An example below shows execution of a simple shell script using `srun`.

The command:

```
srun --exclude=csphbiostats.ucdenver.pvt ./my_script.sh
```

will schedule a new job to run the script `./my_script.sh`.

The `--exclude` flag tells SLURM to schedule the job on any node except the node(s) listed. In this case, we exclude the `csphbiostats.ucdenver.pvt` node to ensure our job runs on one of the `cidalappc[1-3].ucdenver.pvt` nodes (and the first line of the script output shows our job ran on `cidalappc01`).

```{r, echo=FALSE, out.width="80%", fig.align="center"}
knitr::include_graphics("cluster_srun.png")
```

#### Interactive Jobs
`srun` is also useful for running *interactive* jobs. An interactive job opens a terminal session on a compute node, allowing you to run commands and script interactively.

Interactive jobs are useful for running quick commands or scripts, or anytime you'd like to directly monitor the output of your script/command.

The command:

```
srun --pty --exclude=csphbiostats.ucdenver.pvt /bin/bash -i
```

will schedule a new interactive job on a compute node.

This command is similar to the previous `srun` command we used but has a few key differences:

1. The addition of the `--pty` flag tells SLURM that this is an interactive job.
2. The command `/bin/bash -i` will execute an interactive shell on the compute node.

After submitting the job, we see the prompt string has changed from:

```[hillandr@csphbiostats job_example]```

to

```[hillandr@cidalappc01 job_example]```.

This indicates that we are now running an interactive job on the compute node `cidalappc01`.

At this point, we can execute any commands or scripts normally.

To end the interactive job, type `exit`.

```{r, echo=FALSE, out.width="80%", fig.align="center"}
knitr::include_graphics("interactive_job.png")
```

#### When to use `srun`

`srun` is useful for running interactive jobs with fast-running commands/scripts, or for debugging issues with a larger script you will eventually submit using `sbatch`.

I don't recommend using `srun` for long-running jobs, as if you disconnect from the cluster while the `srun` command is still executing, the job may be cancelled.

## Uploading Data to the CSPH Biostats Cluster

To upload and download data from the cluster, it is most convenient to use SFTP. There are multiple ways to use SFTP, including through the command line using the `sftp` command (Mac/Unix-like systems only)

On Mac, the [Cyberduck](https://cyberduck.io) application is a free and intuitive GUI SFTP client. On Windows systems, [WinSCP](https://winscp.net/eng/index.php) is another popular choice.

The below screenshots show an example of connecting to `csphbiostats.ucdenver.pvt` using Cyberduck on Mac.

First, open Cyberduck and click the 'Open Connection' button on the top bar.

```{r, echo=FALSE, out.width="80%", fig.align="center"}
knitr::include_graphics("cyberduck_1.png")
```

Next, ensure that the 'SFTP' option is selected in the dropdown, then input your SSH credentials.

```{r, echo=FALSE, out.width="80%", fig.align="center"}
knitr::include_graphics("cyberduck_2.png")
```

If successful, you should see a file browser interface showing your home directory on `csphbiostats.ucdenver.pvt`.

You can use the interface to navigate and download and existing files. You can also drag-and-drop files from your local machine to upload them to the cluster.

```{r, echo=FALSE, out.width="80%", fig.align="center"}
knitr::include_graphics("cyberduck_3.png")
```

### Networked Storage

The CSPH Biostats cluster uses a shared filesystem for many important directories, including `/home` and `/biostats_share`. This means that any files (scripts, data, etc) you upload to one of these directories on `csphbiostats.ucdenver.pvt` will be accessible from any node in the cluster.

Similarly, any output files generated from a job running on a compute node will also be accessible from `csphbiostats.ucdenver.pvt`, making it easy to download the results of a job.

### Other SLURM Resources

The SLURM system has an excellent [website](https://slurm.schedmd.com/documentation.html), with documentation for each command. I recommend reading the documentation for at least the [sbatch](https://slurm.schedmd.com/sbatch.html) command, as it contains information about many configuration options not listed in this document.

44 changes: 20 additions & 24 deletions vignettes/CIDA_computing_resources.Rmd
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "CIDA Computing Resources"
author: "Research Tools Committee"
date: '2023-02-7'
date: '2024-12-06'
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{CIDA Computing Resources}
Expand All @@ -23,39 +23,35 @@ Who does this apply to: All members of CIDA (Professor, RA, RI, Senior RI, PRA,


### Computing Resources Available to CIDA Members
This table lists the computing resources available to CIDA members. Please note that none of these computing resources are HIPPA compliant and therefore no PHI can be stored on these machines. (please see the data storage guideline for more information).
SLCE machines are not HIPAA compliant
This table lists the computing resources available to CIDA members. Please note that each compute resource has differing policies about storage of PHI/HIPAA data. Please see the data storage guideline for more information.

* Backups are available via Wasabi: https://wasabi.com/hot-cloud-storage/.



| Computing Resource | Description |
| --- | --- |
| Personal Computer/Laptop | Free for CIDA members|
| SLCE | Free for CIDA members|
| Alpine | Almost Free for CIDA members (if requested resources <= default allocation)|
| Commercial computing platforms such as AWS, IBM, ... | Cost depends on the type of requested resouces |
| Computing Resource | Description | HIPAA/PHI Data |
| --- | --- | --- |
| Personal Computer/Laptop | Free for CIDA members| Limited |
| CIDA/Biostats HPC | Free for CIDA members | Limited |
| Alpine | Almost Free for CIDA members (if requested resources <= default allocation)| No |
| Commercial computing platforms such as AWS, IBM, ... | Cost depends on the type of requested resouces | No |

#### Personal Computer\

CIDA provides all its members with a PC or Macintosh based on their preference.


#### CIDA SLCE \
SLCE is the local computing resource funded by CIDA (BERDxxxx) and consists of 5 workstations to address the computing resources required for vriety of tasks.

#### CIDA/Biostats HPC Cluster\

The CIDA/Biostats HPC Cluster is an HPC cluster running the SLURM cluster management software.

This cluster is useful for computations which require large amounts of computing resource or involve long-running tasks.

| |SLCE # 1|SLCE # 2| SLCE # 3| SLCE # 4| SLCE # 5|
| --- | --- |---| --- |---| --- |
|CPU | Intel Xeon Gold 6152 2.1G X (2) CPU, 44 cores|Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (2 cores)|TBD|TBD|TBD
|Disk storage|50TB|4TB|4TB|2TB|2TB|
|Memory Size | 1TB | 128 GB| 128 GB| 128 GB| 128 GB|
|OS |CentOS 7.x |Linux|Windows|Windows|Windows|
|Software|R, R Studio, MatLab, Python, and Java|Fiji,Python, R, RStudio|R, RStudio, SAS, PASS, Git, Anaconda, Stata|R, RStudio, SAS, PASS, Git, Anaconda, Stata|R, RStudio, SAS, PASS, Git, Anaconda, Stata|
Instructions for accessing and using the CIDA/Biostats HPC are available [here](CIDA_BIOS_Cluster.html).

| |csphbiostats|cidalappc01|cidalappc02|cidalappc03|
| --- | --- |---| --- |---|
|CPU |2x Intel Xeon Gold 61522 22-core CPU | 2x AMD EPYC 7H12 64-Core CPU | 2x AMD EPYC 7H12 64-Core CPU | 2x AMD EPYC 7H12 64-Core CPU |
|Disk storage| 50TB (Shared across all nodes) | " | " | "
|Memory Size | 1TB | 1TB | 512GB | 1TB |
|OS | Rocky Linux 9.5 (RHEL) | " | " | " |
|Software| R, Python, RStudio Server, Jupyter Lab | " | " | " |


#### Alpine\
Expand All @@ -64,7 +60,7 @@ Alpine is the University of Colorado Boulder Research Computing’s third-genera
Alpine can be securely accessed anywhere, anytime using OpenOnDemand or ssh connectivity to the CURC system. Step-by-step instruction to access Alpine is available at: https://curc.readthedocs.io/en/latest/clusters/alpine/quick-start.html


|| Alphine ||
|| Alpine ||
|---| :---: |---:|
|Processor |General compute nodes|GPU |
| Nodes | 64 |11 |
Expand Down
Binary file added vignettes/biostats_hpc_ssh.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added vignettes/cluster_basic_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added vignettes/cluster_sbatch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added vignettes/cluster_srun.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added vignettes/cyberduck_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added vignettes/cyberduck_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added vignettes/cyberduck_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added vignettes/interactive_job.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 2558856

Please sign in to comment.