-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Added a Getting Started section. * fix relative paths, add terminal page to nav * small formatting fixes * fixes to formatting for software loading page * small fixes to formatting in remaining docs * Improved navigatioin for getting_started by enabling navigation.indexes and restructuring file hiearchy. * Fixed formatting that were flagged by codacy. * fix relative paths * fix remaining codacy issues with references in code fences * Used pymdownx.snippets to enable reuse of md content in different nav categories. * Fixed formatting flagged by codacy. * Fixed more formatting flagged by codacy. * Fixed more formatting flagged by codacy. * fix relative paths * add snippets to not-in-nav section of config * include jupyter-ondemand.md under web-portals * attempt to fix codacy error about atx headings * actual attempt at fixing atx headers * pull unnecessary heading --------- Co-authored-by: Kim Wong <[email protected]> Co-authored-by: Comeani <[email protected]>
- Loading branch information
1 parent
c3d6124
commit 7b894fb
Showing
31 changed files
with
1,714 additions
and
837 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
--- | ||
hide: | ||
- toc | ||
--- | ||
|
||
# Step 1- Getting an Account | ||
|
||
Access to the CRC Ecosystem requires a CRC account and an accompanying resource allocation. | ||
All active Pitt faculty, instructors, or center directors can request a resource allocation using the webform from | ||
our [**service catalog**](https://crc.pitt.edu/service-request-forms). | ||
|
||
The CRC Ecosystem is hosted at the Pitt data center and is firewalled within PittNet. You will first need to establish | ||
a [**VPN**](https://services.pitt.edu/TDClient/33/Portal/KB/ArticleDet?ID=293) in order to gain access. | ||
|
||
A schematic of this part of the process is highlighted below. | ||
|
||
![GETTING-STARTED-MAP](../_assets/img/getting-started/getting-started-step-1.png) | ||
|
||
<ins>**Definitions**</ins> | ||
|
||
* **Resource allocation** -- an allotment of computing time and/or data storage quota | ||
* **Client** -- this is your computer or internet-connected device | ||
* **PittNet** -- the internal University network | ||
* **Firewall** -- a network security device that monitors and filters incoming and outgoing network traffic based on an organization's previously established security policies |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
--- | ||
hide: | ||
- toc | ||
--- | ||
|
||
# Big Picture Overview | ||
|
||
The University of Pittsburgh provides its research community access to high performance computing and data storage | ||
resources. These systems are maintained and supported through the Center for Research Computing (CRC) and Pitt IT. | ||
To get started, you will need a CRC account, with which you will use to login to Access Portals to interact with | ||
the CRC Ecosystem. | ||
|
||
A schematic of the process is depicted below. | ||
|
||
![GETTING-STARTED-MAP](../_assets/img/getting-started/getting-started-map.png) | ||
|
||
<ins>**Definitions**</ins> | ||
|
||
* **Client** -- this is your computer or internet-connected device | ||
* **Access Portal** -- one of several remote servers used to submit jobs to the high performance computing clusters or to perform | ||
data management operations | ||
* **CRC Ecosystem** -- the total footprint of the CRC infrastructure, including high performance computing | ||
clusters, data storage systems, networking equipment, and software | ||
|
||
<ins>**Available Resources**</ins> | ||
|
||
<link rel="stylesheet" href="https://cdn.datatables.net/1.13.4/css/jquery.dataTables.min.css"> | ||
<table class="display cell-border" id="aTable"> | ||
<thead> | ||
<tr> | ||
<td>Cluster Acronym</td> | ||
<td>Full Form of Acronym</td> | ||
<td>Description of Use Cases</td> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td>mpi</td> | ||
<td>Message Passing Interface | ||
<td>For tightly coupled parallel codes that use the Message Passing Interface APIs for distributing computation | ||
across multiple nodes, each with its own memory space | ||
</tr> | ||
<tr> | ||
<td>htc</td> | ||
<td>High Throughput Computing</td> | ||
<td>For genomics and other health sciences-related workflows that can run on a single node | ||
</tr> | ||
<tr> | ||
<td>smp</td> | ||
<td>Shared Memory Processing</td> | ||
<td>For jobs that can run on a single node where the CPU cores share a common memory space</td> | ||
</tr> | ||
<tr> | ||
<td>gpu</td> | ||
<td>Graphics Processing Unit</td> | ||
<td>For AI/ML applications and physics-based simulation codes that had been written to take advantage of accelerated | ||
computing on GPU cores</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
|
||
<script type="text/javascript" src="https://code.jquery.com/jquery-3.7.0.min.js"></script> | ||
<script type="text/javascript" src="https://cdn.datatables.net/1.13.4/js/jquery.dataTables.min.js"></script> | ||
|
||
<script type="text/javascript"> | ||
$(document).ready(function() { | ||
$('#aTable').DataTable({ | ||
"paging": false, | ||
"bPaginate": false, | ||
"bLengthChange": false, | ||
"bFilter": true, | ||
"bInfo": false, | ||
"bAutoWidth": false, | ||
"searching": false, | ||
"ordering": false | ||
}); | ||
}); | ||
</script> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
--8<-- "jupyter-hub.md" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
--8<-- "open-ondemand.md" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
--- | ||
hide: | ||
- toc | ||
--- | ||
|
||
# Step 2- Login to Access Portals | ||
|
||
Once you have established a VPN to PittNet, you can access the CRC advanced computing and storage resources via | ||
several portals, including | ||
|
||
* [**SSH connection using a terminal**](../terminal.md) | ||
* [**Linux Desktop webportal**](../viz.md) | ||
* [**Open OnDemand webportal**](../open-ondemand.md) | ||
* [**JupyterHub webportal**](../jupyter-hub.md) | ||
|
||
A schematic of this part of the process is highlighted below. | ||
|
||
![GETTING-STARTED-MAP](../../_assets/img/getting-started/getting-started-step-2.png) | ||
|
||
## Guidance on appropriate usage of access portals | ||
|
||
Many users are logged into the CRC login nodes. These are the gateways everyone uses to perform interactive | ||
work like editing code, submitting and checking the status of jobs, etc. | ||
|
||
Executing processing scripts or commands on these nodes can cause substantial slowdowns for the rest of the users. | ||
For this reason, it is important to make sure that this kind of work is done in either an interactive session on a node | ||
from one of the clusters, or as a batch job submission. | ||
|
||
Resource-intensive processes found to be running on the login nodes may be killed at anytime. | ||
|
||
<ins>**The CRC team reserves the right to revoke cluster access of any user who repeatedly causes slowdowns on the login | ||
nodes with processes that can otherwise be run on the compute nodes.**</ins> |
140 changes: 140 additions & 0 deletions
140
docs/getting-started/step3/getting-started-step3-manage-jobs.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
--- | ||
hide: | ||
- toc | ||
--- | ||
|
||
# How to Manage Computing Jobs | ||
|
||
??? abstract "Skip to Table of Commands" | ||
| Command | Description| | ||
| :------------------------------- | :--------------------------------------------------------- | | ||
| `sbatch <job_script>` | Submit `<job_script>` to the Slurm scheduler | | ||
| `squeue -M <cluster> -u $USER` | Display my queued jobs that were submitted to `<cluster>` | | ||
| `scontrol -M <cluster> show job <JobID>` | Display details about job `<JobID>` on `<cluster>` | | ||
| `scancel -M <cluster> <JobID>` | Cancel job `<JobID>` that was submitted to `<cluster>` | | ||
|
||
Now that you have crafted a job submission script, how do you submit it and manage the job? We will use the Amber example, that can be found in | ||
|
||
```bash | ||
/ihome/crc/getting_started/mocvnhlysm_1L40S.1C | ||
``` | ||
|
||
to drive the discussion. The command for submitting a job script is `sbatch <job_script>`, (1) where `<job_script>` is a text file containing | ||
Slurm directives and commands that will be executed from top to bottom. It does not matter if the job submission script ends with a `.slurm` | ||
extension or not. Our recommendation is to adopt a convention to make it simple to spot the job submission script among all your files. To submit | ||
the Amber job to Slurm, execute on the commandline `sbatch amber.slurm`: (2) | ||
{ .annotate } | ||
|
||
1. Throughout the examples, we use the conventional syntax `<variable>` to represent a placeholder for an expected value that the user | ||
will provide. | ||
2. ![content_tabs](../../_assets/img/help-annotation/mkdocs_example_tabs.png) | ||
|
||
!!! example "sbatch <job_script>" | ||
|
||
=== "command" | ||
```commandline | ||
sbatch amber.slurm | ||
``` | ||
|
||
=== "output" | ||
```bash | ||
[[email protected] mocvnhlysm_1L40S.1C]$sbatch amber.slurm | ||
Submitted batch job 956929 on cluster gpu | ||
[[email protected] mocvnhlysm_1L40S.1C]$ | ||
``` | ||
|
||
!!! note | ||
Every job submission will have associated with it an assigned Job ID. In this example, the Job ID is 956929. | ||
|
||
How do you get a summary on the status of your submitted jobs? The command is `squeue -M <cluster> -u $USER`, where the value for the | ||
`<cluster>` variable can be any combinations of comma separate list of clusters, including `smp`, `htc`, `mpi`, and `gpu`. The value `all` | ||
for the `-M` flag will output jobs for all the clusters. If you leave out the `-u $USER` option, `squeue` will output the status for | ||
all jobs on the cluster(s). | ||
|
||
!!! example "squeue -M <cluster> -u $USER" | ||
|
||
=== "command" | ||
```commandline | ||
squeue -M gpu -u $USER | ||
``` | ||
|
||
=== "output" | ||
```bash | ||
[[email protected] mocvnhlysm_1L40S.1C]$squeue -M gpu -u $USER | ||
CLUSTER: gpu | ||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
956929 l40s gpus-1 kimwong R 0:24 1 gpu-n63 | ||
[[email protected] mocvnhlysm_1L40S.1C]$ | ||
``` | ||
!!! note | ||
The output shows that job 956929 on the l40s partition of the gpu cluster has been running (ST=R; "the state of job | ||
is Running") for 24 seconds on gpu-n63. | ||
|
||
To obtain detailed information about a submitted job, you can use the `scontrol` command with the `JobID`: | ||
|
||
!!! example "scontrol -M <cluster> show job <JobID>"" | ||
|
||
=== "command" | ||
```commandline | ||
scontrol -M gpu show job 956929 | ||
``` | ||
|
||
=== "output" | ||
```bash | ||
[[email protected] mocvnhlysm_1L40S.1C]$scontrol -M gpu show job 956929 | ||
JobId=956929 JobName=gpus-1 | ||
UserId=kimwong(15083) GroupId=sam(16036) MCS_label=N/A | ||
Priority=14128 Nice=0 Account=sam QOS=gpu-l40s-s | ||
JobState=RUNNING Reason=None Dependency=(null) | ||
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 | ||
RunTime=00:01:23 TimeLimit=1-00:00:00 TimeMin=N/A | ||
SubmitTime=2024-08-14T15:14:09 EligibleTime=2024-08-14T15:14:09 | ||
AccrueTime=2024-08-14T15:14:09 | ||
StartTime=2024-08-14T15:14:09 EndTime=2024-08-15T15:14:09 Deadline=N/A | ||
PreemptEligibleTime=2024-08-14T15:14:09 PreemptTime=None | ||
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-08-14T15:14:09 Scheduler=Main | ||
Partition=l40s AllocNode:Sid=login4:29378 | ||
ReqNodeList=(null) ExcNodeList=(null) | ||
NodeList=gpu-n63 | ||
BatchHost=gpu-n63 | ||
NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* | ||
TRES=cpu=16,mem=125G,node=1,billing=8,gres/gpu=1 | ||
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* | ||
MinCPUsNode=1 MinMemoryCPU=8000M MinTmpDiskNode=0 | ||
Features=(null) DelayBoot=00:00:00 | ||
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) | ||
Command=/ihome/crc/how_to_run/amber/ZZZ_test_amber24/mocvnhlysm_1L40S.1C/amber.slurm | ||
WorkDir=/ihome/crc/how_to_run/amber/ZZZ_test_amber24/mocvnhlysm_1L40S.1C | ||
StdErr=/ihome/crc/how_to_run/amber/ZZZ_test_amber24/mocvnhlysm_1L40S.1C/gpus-1.out | ||
StdIn=/dev/null | ||
StdOut=/ihome/crc/how_to_run/amber/ZZZ_test_amber24/mocvnhlysm_1L40S.1C/gpus-1.out | ||
Power= | ||
CpusPerTres=gpu:16 | ||
TresPerNode=gres:gpu:1 | ||
[[email protected] mocvnhlysm_1L40S.1C]$ | ||
``` | ||
|
||
Lastly, if you had submitted a job and realized that you have made a mistake in the submission file, you can | ||
use the `scancel` command to delete job identified by the `JobID`: | ||
|
||
!!! example "scancel -M <cluster> <JobID>"" | ||
|
||
=== "command" | ||
```commandline | ||
scancel -M gpu 956929 | ||
``` | ||
|
||
=== "output" | ||
```bash | ||
[[email protected] mocvnhlysm_1L40S.1C]$scancel -M gpu 956929 | ||
[[email protected] mocvnhlysm_1L40S.1C]$squeue -M gpu -u $USER | ||
CLUSTER: gpu | ||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
956929 l40s gpus-1 kimwong CG 2:47 1 gpu-n63 | ||
[[email protected] mocvnhlysm_1L40S.1C]$squeue -M gpu -u $USER | ||
CLUSTER: gpu | ||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
[[email protected] mocvnhlysm_1L40S.1C]$ | ||
``` |
Oops, something went wrong.