getting-started branch (#66)

* Added a Getting Started section. * fix relative paths, add terminal page to nav * small formatting fixes * fixes to formatting for software loading page * small fixes to formatting in remaining docs * Improved navigatioin for getting_started by enabling navigation.indexes and restructuring file hiearchy. * Fixed formatting that were flagged by codacy. * fix relative paths * fix remaining codacy issues with references in code fences * Used pymdownx.snippets to enable reuse of md content in different nav categories. * Fixed formatting flagged by codacy. * Fixed more formatting flagged by codacy. * Fixed more formatting flagged by codacy. * fix relative paths * add snippets to not-in-nav section of config * include jupyter-ondemand.md under web-portals * attempt to fix codacy error about atx headings * actual attempt at fixing atx headers * pull unnecessary heading --------- Co-authored-by: Kim Wong <[email protected]> Co-authored-by: Comeani <[email protected]>
pitt-crc · Aug 27, 2024 · 7b894fb · 7b894fb
1 parent c3d6124
commit 7b894fb
Show file tree

Hide file tree

Showing 31 changed files with 1,714 additions and 837 deletions.
diff --git a/docs/_assets/img/getting-started/getting-started-map.png b/docs/_assets/img/getting-started/getting-started-map.png
diff --git a/docs/_assets/img/getting-started/getting-started-step-1.png b/docs/_assets/img/getting-started/getting-started-step-1.png
diff --git a/docs/_assets/img/getting-started/getting-started-step-2.png b/docs/_assets/img/getting-started/getting-started-step-2.png
diff --git a/docs/_assets/img/getting-started/getting-started-step-3.png b/docs/_assets/img/getting-started/getting-started-step-3.png
diff --git a/docs/_assets/img/help-annotation/mkdocs_example_tabs.png b/docs/_assets/img/help-annotation/mkdocs_example_tabs.png
diff --git a/docs/_assets/img/web-portals/MobaXterm.png b/docs/_assets/img/web-portals/MobaXterm.png
diff --git a/docs/_assets/img/web-portals/iTerm2.png b/docs/_assets/img/web-portals/iTerm2.png
diff --git a/docs/getting-started/getting-started-step1-account.md b/docs/getting-started/getting-started-step1-account.md
@@ -0,0 +1,24 @@
+---
+hide:
+  - toc
+---
+
+# Step 1- Getting an Account
+
+Access to the CRC Ecosystem requires a CRC account and an accompanying resource allocation. 
+All active Pitt faculty, instructors, or center directors can request a resource allocation using the webform from
+our [**service catalog**](https://crc.pitt.edu/service-request-forms). 
+
+The CRC Ecosystem is hosted at the Pitt data center and is firewalled within PittNet. You will first need to establish 
+a [**VPN**](https://services.pitt.edu/TDClient/33/Portal/KB/ArticleDet?ID=293) in order to gain access.
+
+A schematic of this part of the process is highlighted below.
+
+![GETTING-STARTED-MAP](../_assets/img/getting-started/getting-started-step-1.png)
+
+<ins>**Definitions**</ins>
+
+*   **Resource allocation** -- an allotment of computing time and/or data storage quota
+*   **Client** -- this is your computer or internet-connected device
+*   **PittNet** -- the internal University network
+*   **Firewall** -- a network security device that monitors and filters incoming and outgoing network traffic based on an organization's previously established security policies
diff --git a/docs/getting-started/index.md b/docs/getting-started/index.md
@@ -0,0 +1,78 @@
+---
+hide:
+  - toc
+---
+
+# Big Picture Overview
+
+The University of Pittsburgh provides its research community access to high performance computing and data storage 
+resources. These systems are maintained and supported through the Center for Research Computing (CRC) and Pitt IT.
+To get started, you will need a CRC account, with which you will use to login to Access Portals to interact with 
+the CRC Ecosystem.
+
+A schematic of the process is depicted below.
+
+![GETTING-STARTED-MAP](../_assets/img/getting-started/getting-started-map.png)
+
+<ins>**Definitions**</ins>
+
+*   **Client** -- this is your computer or internet-connected device
+*   **Access Portal** -- one of several remote servers used to submit jobs to the high performance computing clusters or to perform
+data management operations
+*   **CRC Ecosystem** -- the total footprint of the CRC infrastructure, including high performance computing 
+clusters, data storage systems, networking equipment, and software 
+
+<ins>**Available Resources**</ins>
+
+<link rel="stylesheet" href="https://cdn.datatables.net/1.13.4/css/jquery.dataTables.min.css">
+<table class="display cell-border" id="aTable">
+    <thead>
+        <tr>
+            <td>Cluster Acronym</td>
+            <td>Full Form of Acronym</td>
+            <td>Description of Use Cases</td>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td>mpi</td>
+            <td>Message Passing Interface
+            <td>For tightly coupled parallel codes that use the Message Passing Interface APIs for distributing computation 
+                across multiple nodes, each with its own memory space
+        </tr>
+        <tr>
+            <td>htc</td>
+            <td>High Throughput Computing</td>
+            <td>For genomics and other health sciences-related workflows that can run on a single node
+        </tr>
+        <tr>
+            <td>smp</td>
+            <td>Shared Memory Processing</td>
+            <td>For jobs that can run on a single node where the CPU cores share a common memory space</td>
+        </tr>
+        <tr>
+            <td>gpu</td>
+            <td>Graphics Processing Unit</td>
+            <td>For AI/ML applications and physics-based simulation codes that had been written to take advantage of accelerated
+                computing on GPU cores</td>
+        </tr>
+    </tbody>
+</table>
+
+<script type="text/javascript" src="https://code.jquery.com/jquery-3.7.0.min.js"></script>
+<script type="text/javascript" src="https://cdn.datatables.net/1.13.4/js/jquery.dataTables.min.js"></script>
+
+<script type="text/javascript">
+    $(document).ready(function() {
+        $('#aTable').DataTable({
+            "paging": false,
+            "bPaginate": false,
+            "bLengthChange": false,
+            "bFilter": true,
+            "bInfo": false,
+            "bAutoWidth": false,
+            "searching": false,
+            "ordering": false
+        });
+    });
+</script>
diff --git a/docs/getting-started/jupyter-hub.md b/docs/getting-started/jupyter-hub.md
@@ -0,0 +1 @@
+--8<-- "jupyter-hub.md"
diff --git a/docs/getting-started/open-ondemand.md b/docs/getting-started/open-ondemand.md
@@ -0,0 +1 @@
+--8<-- "open-ondemand.md"
diff --git a/docs/getting-started/step2/index.md b/docs/getting-started/step2/index.md
@@ -0,0 +1,32 @@
+---
+hide:
+  - toc
+---
+
+# Step 2- Login to Access Portals
+
+Once you have established a VPN to PittNet, you can access the CRC advanced computing and storage resources via
+several portals, including
+
+* [**SSH connection using a terminal**](../terminal.md)
+* [**Linux Desktop webportal**](../viz.md)
+* [**Open OnDemand webportal**](../open-ondemand.md)
+* [**JupyterHub webportal**](../jupyter-hub.md)
+
+A schematic of this part of the process is highlighted below.
+
+![GETTING-STARTED-MAP](../../_assets/img/getting-started/getting-started-step-2.png)
+
+## Guidance on appropriate usage of access portals
+
+Many users are logged into the CRC login nodes. These are the gateways everyone uses to perform interactive 
+work like editing code, submitting and checking the status of jobs, etc.
+
+Executing processing scripts or commands on these nodes can cause substantial slowdowns for the rest of the users. 
+For this reason, it is important to make sure that this kind of work is done in either an interactive session on a node 
+from one of the clusters, or as a batch job submission.
+
+Resource-intensive processes found to be running on the login nodes may be killed at anytime.
+
+<ins>**The CRC team reserves the right to revoke cluster access of any user who repeatedly causes slowdowns on the login 
+nodes with processes that can otherwise be run on the compute nodes.**</ins>
diff --git a/docs/getting-started/step3/getting-started-step3-manage-jobs.md b/docs/getting-started/step3/getting-started-step3-manage-jobs.md
@@ -0,0 +1,140 @@
+---
+hide:
+  - toc
+---
+
+# How to Manage Computing Jobs
+
+??? abstract "Skip to Table of Commands"
+    | Command  | Description|
+    | :-------------------------------            | :--------------------------------------------------------- |
+    | `sbatch <job_script>`                       | Submit `<job_script>` to the Slurm scheduler               |
+    | `squeue -M <cluster> -u $USER`              | Display my queued jobs that were submitted to `<cluster>`  |
+    | `scontrol -M <cluster> show job <JobID>`    | Display details about job `<JobID>` on `<cluster>`         |
+    | `scancel -M <cluster> <JobID>`              | Cancel job `<JobID>` that was submitted to `<cluster>`     |
+
+Now that you have crafted a job submission script, how do you submit it and manage the job? We will use the Amber example, that can be found in
+
+```bash
+/ihome/crc/getting_started/mocvnhlysm_1L40S.1C
+```
+
+to drive the discussion. The command for submitting a job script is `sbatch <job_script>`, (1) where `<job_script>` is a text file containing
+Slurm directives and commands that will be executed from top to bottom. It does not matter if the job submission script ends with a `.slurm`
+extension or not. Our recommendation is to adopt a convention to make it simple to spot the job submission script among all your files. To submit
+the Amber job to Slurm, execute on the commandline `sbatch amber.slurm`: (2)
+{ .annotate }
+
+1.  Throughout the examples, we use the conventional syntax `<variable>` to represent a placeholder for an expected value that the user
+    will provide.
+2. ![content_tabs](../../_assets/img/help-annotation/mkdocs_example_tabs.png)
+
+!!! example "sbatch &lt;job_script>"
+
+    === "command"
+        ```commandline
+        sbatch amber.slurm
+        ```
+
+    === "output"
+        ```bash
+        [[email protected] mocvnhlysm_1L40S.1C]$sbatch amber.slurm
+        Submitted batch job 956929 on cluster gpu
+        [[email protected] mocvnhlysm_1L40S.1C]$
+        ```
+
+        !!! note
+            Every job submission will have associated with it an assigned Job ID. In this example, the Job ID is 956929. 
+
+How do you get a summary on the status of your submitted jobs? The command is `squeue -M <cluster> -u $USER`, where the value for the 
+`<cluster>` variable can be any combinations of comma separate list of clusters, including `smp`, `htc`, `mpi`, and `gpu`. The value `all` 
+for the `-M` flag will output jobs for all the clusters. If you leave out the `-u $USER` option, `squeue` will output the status for
+all jobs on the cluster(s).
+
+!!! example "squeue -M &lt;cluster> -u $USER"
+
+    === "command"
+        ```commandline
+        squeue -M gpu -u $USER
+        ```
+
+    === "output"
+        ```bash
+        [[email protected] mocvnhlysm_1L40S.1C]$squeue -M gpu -u $USER
+        CLUSTER: gpu
+                     JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+                    956929      l40s   gpus-1  kimwong  R       0:24      1 gpu-n63
+        [[email protected] mocvnhlysm_1L40S.1C]$
+        ```
+        !!! note
+            The output shows that job 956929 on the l40s partition of the gpu cluster has been running (ST=R; "the state of job 
+            is Running") for 24 seconds on gpu-n63.
+
+To obtain detailed information about a submitted job, you can use the `scontrol` command with the `JobID`:
+
+!!! example "scontrol -M &lt;cluster> show job &lt;JobID>""
+
+    === "command"
+        ```commandline
+        scontrol -M gpu show job 956929
+        ```
+
+    === "output"
+        ```bash
+        [[email protected] mocvnhlysm_1L40S.1C]$scontrol -M gpu show job 956929
+        JobId=956929 JobName=gpus-1
+           UserId=kimwong(15083) GroupId=sam(16036) MCS_label=N/A
+           Priority=14128 Nice=0 Account=sam QOS=gpu-l40s-s
+           JobState=RUNNING Reason=None Dependency=(null)
+           Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
+           RunTime=00:01:23 TimeLimit=1-00:00:00 TimeMin=N/A
+           SubmitTime=2024-08-14T15:14:09 EligibleTime=2024-08-14T15:14:09
+           AccrueTime=2024-08-14T15:14:09
+           StartTime=2024-08-14T15:14:09 EndTime=2024-08-15T15:14:09 Deadline=N/A
+           PreemptEligibleTime=2024-08-14T15:14:09 PreemptTime=None
+           SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-08-14T15:14:09 Scheduler=Main
+           Partition=l40s AllocNode:Sid=login4:29378
+           ReqNodeList=(null) ExcNodeList=(null)
+           NodeList=gpu-n63
+           BatchHost=gpu-n63
+           NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
+           TRES=cpu=16,mem=125G,node=1,billing=8,gres/gpu=1
+           Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
+           MinCPUsNode=1 MinMemoryCPU=8000M MinTmpDiskNode=0
+           Features=(null) DelayBoot=00:00:00
+           OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
+           Command=/ihome/crc/how_to_run/amber/ZZZ_test_amber24/mocvnhlysm_1L40S.1C/amber.slurm
+           WorkDir=/ihome/crc/how_to_run/amber/ZZZ_test_amber24/mocvnhlysm_1L40S.1C
+           StdErr=/ihome/crc/how_to_run/amber/ZZZ_test_amber24/mocvnhlysm_1L40S.1C/gpus-1.out
+           StdIn=/dev/null
+           StdOut=/ihome/crc/how_to_run/amber/ZZZ_test_amber24/mocvnhlysm_1L40S.1C/gpus-1.out
+           Power=
+           CpusPerTres=gpu:16
+           TresPerNode=gres:gpu:1
+        
+        
+        [[email protected] mocvnhlysm_1L40S.1C]$
+        ```
+
+Lastly, if you had submitted a job and realized that you have made a mistake in the submission file, you can 
+use the `scancel` command to delete job identified by the `JobID`:
+
+!!! example "scancel -M &lt;cluster> &lt;JobID>""
+
+    === "command"
+        ```commandline
+        scancel -M gpu 956929
+        ```
+
+    === "output"
+        ```bash
+        [[email protected] mocvnhlysm_1L40S.1C]$scancel -M gpu 956929
+        [[email protected] mocvnhlysm_1L40S.1C]$squeue -M gpu -u $USER
+        CLUSTER: gpu
+                     JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+                    956929      l40s   gpus-1  kimwong CG       2:47      1 gpu-n63
+        [[email protected] mocvnhlysm_1L40S.1C]$squeue -M gpu -u $USER
+        CLUSTER: gpu
+                     JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+        [[email protected] mocvnhlysm_1L40S.1C]$
+        ```