Skip to content

Commit

Permalink
getting-started branch (#66)
Browse files Browse the repository at this point in the history
* Added a Getting Started section.

* fix relative paths, add terminal page to nav

* small formatting fixes

* fixes to formatting for software loading page

* small fixes to formatting in remaining docs

* Improved navigatioin for getting_started by enabling navigation.indexes and restructuring file hiearchy.

* Fixed formatting that were flagged by codacy.

* fix relative paths

* fix remaining codacy issues with references in code fences

* Used pymdownx.snippets to enable reuse of md content in different nav categories.

* Fixed formatting flagged by codacy.

* Fixed more formatting flagged by codacy.

* Fixed more formatting flagged by codacy.

* fix relative paths

* add snippets to not-in-nav section of config

* include jupyter-ondemand.md under web-portals

* attempt to fix codacy error about atx headings

* actual attempt at fixing atx headers

* pull unnecessary heading

---------

Co-authored-by: Kim Wong <[email protected]>
Co-authored-by: Comeani <[email protected]>
  • Loading branch information
3 people authored Aug 27, 2024
1 parent c3d6124 commit 7b894fb
Show file tree
Hide file tree
Showing 31 changed files with 1,714 additions and 837 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_assets/img/web-portals/MobaXterm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_assets/img/web-portals/iTerm2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 24 additions & 0 deletions docs/getting-started/getting-started-step1-account.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
hide:
- toc
---

# Step 1- Getting an Account

Access to the CRC Ecosystem requires a CRC account and an accompanying resource allocation.
All active Pitt faculty, instructors, or center directors can request a resource allocation using the webform from
our [**service catalog**](https://crc.pitt.edu/service-request-forms).

The CRC Ecosystem is hosted at the Pitt data center and is firewalled within PittNet. You will first need to establish
a [**VPN**](https://services.pitt.edu/TDClient/33/Portal/KB/ArticleDet?ID=293) in order to gain access.

A schematic of this part of the process is highlighted below.

![GETTING-STARTED-MAP](../_assets/img/getting-started/getting-started-step-1.png)

<ins>**Definitions**</ins>

* **Resource allocation** -- an allotment of computing time and/or data storage quota
* **Client** -- this is your computer or internet-connected device
* **PittNet** -- the internal University network
* **Firewall** -- a network security device that monitors and filters incoming and outgoing network traffic based on an organization's previously established security policies
78 changes: 78 additions & 0 deletions docs/getting-started/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
hide:
- toc
---

# Big Picture Overview

The University of Pittsburgh provides its research community access to high performance computing and data storage
resources. These systems are maintained and supported through the Center for Research Computing (CRC) and Pitt IT.
To get started, you will need a CRC account, with which you will use to login to Access Portals to interact with
the CRC Ecosystem.

A schematic of the process is depicted below.

![GETTING-STARTED-MAP](../_assets/img/getting-started/getting-started-map.png)

<ins>**Definitions**</ins>

* **Client** -- this is your computer or internet-connected device
* **Access Portal** -- one of several remote servers used to submit jobs to the high performance computing clusters or to perform
data management operations
* **CRC Ecosystem** -- the total footprint of the CRC infrastructure, including high performance computing
clusters, data storage systems, networking equipment, and software

<ins>**Available Resources**</ins>

<link rel="stylesheet" href="https://cdn.datatables.net/1.13.4/css/jquery.dataTables.min.css">
<table class="display cell-border" id="aTable">
<thead>
<tr>
<td>Cluster Acronym</td>
<td>Full Form of Acronym</td>
<td>Description of Use Cases</td>
</tr>
</thead>
<tbody>
<tr>
<td>mpi</td>
<td>Message Passing Interface
<td>For tightly coupled parallel codes that use the Message Passing Interface APIs for distributing computation
across multiple nodes, each with its own memory space
</tr>
<tr>
<td>htc</td>
<td>High Throughput Computing</td>
<td>For genomics and other health sciences-related workflows that can run on a single node
</tr>
<tr>
<td>smp</td>
<td>Shared Memory Processing</td>
<td>For jobs that can run on a single node where the CPU cores share a common memory space</td>
</tr>
<tr>
<td>gpu</td>
<td>Graphics Processing Unit</td>
<td>For AI/ML applications and physics-based simulation codes that had been written to take advantage of accelerated
computing on GPU cores</td>
</tr>
</tbody>
</table>

<script type="text/javascript" src="https://code.jquery.com/jquery-3.7.0.min.js"></script>
<script type="text/javascript" src="https://cdn.datatables.net/1.13.4/js/jquery.dataTables.min.js"></script>

<script type="text/javascript">
$(document).ready(function() {
$('#aTable').DataTable({
"paging": false,
"bPaginate": false,
"bLengthChange": false,
"bFilter": true,
"bInfo": false,
"bAutoWidth": false,
"searching": false,
"ordering": false
});
});
</script>
1 change: 1 addition & 0 deletions docs/getting-started/jupyter-hub.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
--8<-- "jupyter-hub.md"
1 change: 1 addition & 0 deletions docs/getting-started/open-ondemand.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
--8<-- "open-ondemand.md"
32 changes: 32 additions & 0 deletions docs/getting-started/step2/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
hide:
- toc
---

# Step 2- Login to Access Portals

Once you have established a VPN to PittNet, you can access the CRC advanced computing and storage resources via
several portals, including

* [**SSH connection using a terminal**](../terminal.md)
* [**Linux Desktop webportal**](../viz.md)
* [**Open OnDemand webportal**](../open-ondemand.md)
* [**JupyterHub webportal**](../jupyter-hub.md)

A schematic of this part of the process is highlighted below.

![GETTING-STARTED-MAP](../../_assets/img/getting-started/getting-started-step-2.png)

## Guidance on appropriate usage of access portals

Many users are logged into the CRC login nodes. These are the gateways everyone uses to perform interactive
work like editing code, submitting and checking the status of jobs, etc.

Executing processing scripts or commands on these nodes can cause substantial slowdowns for the rest of the users.
For this reason, it is important to make sure that this kind of work is done in either an interactive session on a node
from one of the clusters, or as a batch job submission.

Resource-intensive processes found to be running on the login nodes may be killed at anytime.

<ins>**The CRC team reserves the right to revoke cluster access of any user who repeatedly causes slowdowns on the login
nodes with processes that can otherwise be run on the compute nodes.**</ins>
140 changes: 140 additions & 0 deletions docs/getting-started/step3/getting-started-step3-manage-jobs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
---
hide:
- toc
---

# How to Manage Computing Jobs

??? abstract "Skip to Table of Commands"
| Command | Description|
| :------------------------------- | :--------------------------------------------------------- |
| `sbatch <job_script>` | Submit `<job_script>` to the Slurm scheduler |
| `squeue -M <cluster> -u $USER` | Display my queued jobs that were submitted to `<cluster>` |
| `scontrol -M <cluster> show job <JobID>` | Display details about job `<JobID>` on `<cluster>` |
| `scancel -M <cluster> <JobID>` | Cancel job `<JobID>` that was submitted to `<cluster>` |

Now that you have crafted a job submission script, how do you submit it and manage the job? We will use the Amber example, that can be found in

```bash
/ihome/crc/getting_started/mocvnhlysm_1L40S.1C
```

to drive the discussion. The command for submitting a job script is `sbatch <job_script>`, (1) where `<job_script>` is a text file containing
Slurm directives and commands that will be executed from top to bottom. It does not matter if the job submission script ends with a `.slurm`
extension or not. Our recommendation is to adopt a convention to make it simple to spot the job submission script among all your files. To submit
the Amber job to Slurm, execute on the commandline `sbatch amber.slurm`: (2)
{ .annotate }

1. Throughout the examples, we use the conventional syntax `<variable>` to represent a placeholder for an expected value that the user
will provide.
2. ![content_tabs](../../_assets/img/help-annotation/mkdocs_example_tabs.png)

!!! example "sbatch &lt;job_script>"

=== "command"
```commandline
sbatch amber.slurm
```

=== "output"
```bash
[[email protected] mocvnhlysm_1L40S.1C]$sbatch amber.slurm
Submitted batch job 956929 on cluster gpu
[[email protected] mocvnhlysm_1L40S.1C]$
```

!!! note
Every job submission will have associated with it an assigned Job ID. In this example, the Job ID is 956929.

How do you get a summary on the status of your submitted jobs? The command is `squeue -M <cluster> -u $USER`, where the value for the
`<cluster>` variable can be any combinations of comma separate list of clusters, including `smp`, `htc`, `mpi`, and `gpu`. The value `all`
for the `-M` flag will output jobs for all the clusters. If you leave out the `-u $USER` option, `squeue` will output the status for
all jobs on the cluster(s).

!!! example "squeue -M &lt;cluster> -u $USER"

=== "command"
```commandline
squeue -M gpu -u $USER
```

=== "output"
```bash
[[email protected] mocvnhlysm_1L40S.1C]$squeue -M gpu -u $USER
CLUSTER: gpu
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
956929 l40s gpus-1 kimwong R 0:24 1 gpu-n63
[[email protected] mocvnhlysm_1L40S.1C]$
```
!!! note
The output shows that job 956929 on the l40s partition of the gpu cluster has been running (ST=R; "the state of job
is Running") for 24 seconds on gpu-n63.

To obtain detailed information about a submitted job, you can use the `scontrol` command with the `JobID`:

!!! example "scontrol -M &lt;cluster> show job &lt;JobID>""

=== "command"
```commandline
scontrol -M gpu show job 956929
```

=== "output"
```bash
[[email protected] mocvnhlysm_1L40S.1C]$scontrol -M gpu show job 956929
JobId=956929 JobName=gpus-1
UserId=kimwong(15083) GroupId=sam(16036) MCS_label=N/A
Priority=14128 Nice=0 Account=sam QOS=gpu-l40s-s
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:01:23 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2024-08-14T15:14:09 EligibleTime=2024-08-14T15:14:09
AccrueTime=2024-08-14T15:14:09
StartTime=2024-08-14T15:14:09 EndTime=2024-08-15T15:14:09 Deadline=N/A
PreemptEligibleTime=2024-08-14T15:14:09 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-08-14T15:14:09 Scheduler=Main
Partition=l40s AllocNode:Sid=login4:29378
ReqNodeList=(null) ExcNodeList=(null)
NodeList=gpu-n63
BatchHost=gpu-n63
NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=16,mem=125G,node=1,billing=8,gres/gpu=1
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=8000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/ihome/crc/how_to_run/amber/ZZZ_test_amber24/mocvnhlysm_1L40S.1C/amber.slurm
WorkDir=/ihome/crc/how_to_run/amber/ZZZ_test_amber24/mocvnhlysm_1L40S.1C
StdErr=/ihome/crc/how_to_run/amber/ZZZ_test_amber24/mocvnhlysm_1L40S.1C/gpus-1.out
StdIn=/dev/null
StdOut=/ihome/crc/how_to_run/amber/ZZZ_test_amber24/mocvnhlysm_1L40S.1C/gpus-1.out
Power=
CpusPerTres=gpu:16
TresPerNode=gres:gpu:1
[[email protected] mocvnhlysm_1L40S.1C]$
```

Lastly, if you had submitted a job and realized that you have made a mistake in the submission file, you can
use the `scancel` command to delete job identified by the `JobID`:

!!! example "scancel -M &lt;cluster> &lt;JobID>""

=== "command"
```commandline
scancel -M gpu 956929
```

=== "output"
```bash
[[email protected] mocvnhlysm_1L40S.1C]$scancel -M gpu 956929
[[email protected] mocvnhlysm_1L40S.1C]$squeue -M gpu -u $USER
CLUSTER: gpu
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
956929 l40s gpus-1 kimwong CG 2:47 1 gpu-n63
[[email protected] mocvnhlysm_1L40S.1C]$squeue -M gpu -u $USER
CLUSTER: gpu
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[[email protected] mocvnhlysm_1L40S.1C]$
```
Loading

0 comments on commit 7b894fb

Please sign in to comment.