Add Slurm support.

glotzerlab · May 7, 2024 · 6968dfc · 6968dfc
1 parent dbd1aa3
commit 6968dfc
Show file tree

Hide file tree

Showing 65 changed files with 4,857 additions and 878 deletions.
diff --git a/.github/workflows/test.yaml b/.github/workflows/test.yaml
@@ -106,7 +106,8 @@ jobs:
         chmod a+x "$HOME/.cargo/bin/mdbook-linkcheck"
     - name: Add linkcheck configuration
       run: |
-        echo -e "[output.linkcheck]\nfollow-web-links=true" >> doc/book.toml
+        # echo -e "[output.linkcheck]\nfollow-web-links=true" >> doc/book.toml #TODO: enable web-link checks after row is public
+        echo -e "[output.linkcheck]\nfollow-web-links=false" >> doc/book.toml
         cat doc/book.toml
     - name: Build documentation
       run: mdbook build doc

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -27,3 +27,10 @@ repos:
   rev: v1.6.27
   hooks:
     - id: actionlint
+- repo: https://github.com/astral-sh/ruff-pre-commit
+  rev: 'v0.3.4'
+  hooks:
+  - id: ruff-format
+  - id: ruff
+
+# TODO: add fix-license-header
diff --git a/.ruff.toml b/.ruff.toml
@@ -0,0 +1,49 @@
+target-version = "py312"
+line-length = 100
+
+[lint]
+
+extend-select = [
+    "A",
+    "B",
+    "D",
+    "E501",
+    "EM",
+    "I",
+    "ICN",
+    "ISC",
+    "N",
+    "NPY",
+    "PL",
+    "PT",
+    "RET",
+    "RUF",
+    "UP",
+    "W",
+]
+
+ignore = [
+  "N806", "N803",  # Allow occasional use of uppercase variable and argument names (e.g. N).
+  "D107", # Do not document __init__ separately from the class.
+  "PLR09", # Allow "too many" statements/arguments/etc...
+  "N816", # Allow mixed case names like kT.
+  "RUF012", # Do not use typing hints.
+]
+
+[lint.pydocstyle]
+convention = "google"
+
+[lint.flake8-import-conventions]
+# Prefer no import aliases
+aliases = {}
+# Always import hoomd and gsd without 'from'
+banned-from = ["hoomd", "gsd"]
+
+# Ban standard import conventions and force common packages to be imported by their actual name.
+[lint.flake8-import-conventions.banned-aliases]
+"numpy" = ["np"]
+"pandas" = ["pd"]
+"matplotlib" = ["mpl"]
+
+[format]
+quote-style = "single"
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -8,6 +8,7 @@ clap = { version = "4.5.3", features = ["derive", "env"] }
 clap-verbosity-flag = "2.2.0"
 console = "0.15.8"
 env_logger = "0.11.3"
+home = "0.5.9"
 human_format = "1.1.0"
 indicatif = "0.17.8"
 indicatif-log-bridge = "0.2.2"

diff --git a/DESIGN.md b/DESIGN.md
@@ -51,11 +51,12 @@ Row is yet another workflow engine that automates the process of executing **act
 
 Ideas:
 * List scheduler jobs and show useful information.
+* Cancel scheduler jobs specific to actions and/or directories.
 * Command to uncomplete an action for a set of directories. This would remove the product files and
   update the cache.
-* Option for `scan` to clear the cache. This would allow users to discover
-  changed action names, changed products, and manually uncompleted actions.
-* Require the use of a launcher when requesting more than one process?
+* Some method to clear any cache (maybe this instead of uncomplete?). This would allow
+  users to discover changed action names, changed products, manually uncompleted
+  actions, and deal with corrupt cache files.
 
 ## Overview
 
@@ -69,11 +70,12 @@ dispatches calls to the library.
 
 * `row`
   * `cli` - Top level command line commands.
-  * `cluster` - Read the cluster configuration file, determine the active cluster, and
-    make the settings available.
-  * `launcher` - TODO: Separate from cluster/scheduler? Or embedded within?
-  * `project` - Combine the workflow, state, and scheduler into one object and provide methods
-    that work with the project as a whole.
+  * `cluster` - Read the `clusters.toml` configuration file, determine the active
+    cluster, and make the settings available.
+  * `launcher` - Read the `launchers.toml` configuration file, provide code to construct
+    commands with launcher prefixes.
+  * `project` - Combine the workflow, state, and scheduler into one object and provide
+    methods that work with the project as a whole.
   * `scheduler` - Generic scheduler interface and implementations for shell and SLURM.
   * `state` - Row's internal state of the workspace.
   * `workflow` - Type defining workflow and utility functions to find and read it.
@@ -91,7 +93,7 @@ executed, Row checks in the current working directory (and recursively in parent
 - The **workspace**
   - path
   - A static **value file**
-- cluster specific
+- cluster-specific `submit_options`
   - account
   - options
   - setup script
@@ -106,7 +108,7 @@ executed, Row checks in the current working directory (and recursively in parent
     - threads_per_process
     - gpus_per_process
     - walltime (either per_submission or per_directory)
-  - Cluster specific
+  - Cluster- and action-specific `submit_options`
     - options
     - setup
     - partition
@@ -140,8 +142,8 @@ Row maintains the state of the workflow in several files:
   * Cached copies of the user-provided static value file.
 * `completed.postcard`
   * Completion status for each **action**.
-* `TODO: determine filename`
-  * The last submitted scheduler job ID for each **action**.
+* `submitted.postcard`
+  * The last submitted job ID, referenced by action, directory, and cluster.
 
 When Row updates the state, it collects the completion staging files and updates the entries in the
 state accordingly. It also checks the scheduler for all known job IDs and removes any job IDs that
@@ -170,9 +172,6 @@ resource usage and asks for confirmation before submitting the **job(s)** to the
 submitting, Row updates the **state** with a record of which scheduler **job** IDs were submitted
 for each **action**/**directory** combination.
 
-When run in an interactive terminal, show a progress bar when there is more than one job to submit.
-TODO: Also print the job IDs submitted? Or only the job IDs and no progress bar?
-
 Provide a --dry-run option that shows the user what **job** script(s) would be submitted.
 
 End the remaining submission sequence on an error return value from the scheduler. Save the cache
@@ -191,17 +190,30 @@ generated by user input.
 The group defining options **include** and **sort_by** use JSON pointer
 syntax. This allows users to select any element of their value when defining groups.
 
+### Launcher configuration
+
+Launchers define prefixes that go in front of commands. These prefixes (e.g.
+OMP_NUM_THREADS, srun) take arguments when the user requests certain resources. **Row**
+provides built in support for OpenMP and MPI on the built-in clusters that **row
+supports. Users can override these and provide new launchers in `launchers.toml`.
+
+Each launcher optionally emits an `executable`, and arguments for
+* total number of processes
+* threads per process
+* gpus per process
+when both the launcher defines such an argument and the user requests the relevant
+resource.
+
 ### Cluster configuration
 
 Row provides default configurations for many national HPC systems. Users can override these defaults
 or define new systems in `$HOME/.config/row/clusters.toml`.
 
 A single cluster defines:
 * name
-* launcher
-  * TODO: determine how to define launchers
-* autodetect
-  * TODO: determine how to autodetect clusters
+* identify: one of
+  * by_environment: [string, string]
+  * always: bool
 * scheduler
 * partition (listed in priority order)
   * name

diff --git a/README.md b/README.md
@@ -7,6 +7,8 @@ actions have been submitted on which directories so that you don't submit the sa
 twice. Once a job completes, subsequent actions become eligible allowing you to process
 your entire workflow to completion over many submissions.
 
+The name is "row" as in "row, row, row your boat".
+
 Notable features:
 * Support both arbitrary directories and [signac](https://signac.io) workspaces.
 * Execute actions via arbitrary shell commands.
@@ -16,9 +18,11 @@ Notable features:
 * Execute groups in serial or parallel.
 * Schedule CPU and GPU resources.
 * Automatically determine the partition based on the batch job size.
-* Includes configure for many national and university HPC systems.
+* Built-in configurations for many national and university HPC systems.
 * Add custom cluster definitions for your resources.
 
+TODO: better demo script to get output for README and row show documentation examples.
+
 For example:
 ```bash
 > row show status
@@ -29,7 +33,7 @@ two            0       200      800    1000   8K GPU-hours
 
 ```bash
 > row show directories --value "/value"
-Directory Status        Job /value
+Directory Status     Job ID /value
 dir1      submitted 1432876    0.9
 dir2      submitted 1432876    0.8
 dir3      submitted 1432876    0.7
@@ -41,5 +45,3 @@ dir6      completed            0.3
 
 **Row** is a spiritual successor to
 [signac-flow](https://docs.signac.io/projects/flow/en/latest/).
-
-The name is "row" as in "row, row, row your boat".
diff --git a/doc/src/SUMMARY.md b/doc/src/SUMMARY.md
@@ -9,34 +9,45 @@
   - [Hello, workflow!](guide/tutorial/hello.md)
   - [Managing multiple actions](guide/tutorial/multiple.md)
   - [Grouping directories](guide/tutorial/group.md)
-  - [Submitting jobs to a scheduler]()
-  - [Best practices for actions]()
+  - [Submitting jobs manually](guide/tutorial/scheduler.md)
+  - [Requesting resources with row](guide/tutorial/resources.md)
+  - [Submitting jobs with row](guide/tutorial/submit.md)
 - [Using row with Python and signac](guide/python/index.md)
   - [Working with signac projects](guide/python/signac.md)
   - [Writing action commands in Python](guide/python/actions.md)
 - [Concepts](guide/concepts/index.md)
+  - [Best practices](guide/concepts/best-practices.md)
+  - [Process parallelism](guide/concepts/process-parallelism.md)
+  - [Thread parallelism](guide/concepts/thread-parallelism.md)
   - [Directory status](guide/concepts/status.md)
-  - [The row cache](guide/concepts/cache.md)
   - [JSON pointers](guide/concepts/json-pointers.md)
+  - [The row cache](guide/concepts/cache.md)
 # Reference
 
 - [row](row/index.md)
   - [init](row/init.md)
-  - [show status](row/show-status.md)
-  - [show directories](row/show-directories.md)
   - [submit](row/submit.md)
+  - [show](row/show/index.md)
+    - [show status](row/show/status.md)
+    - [show directories](row/show/directories.md)
+    - [show cluster](row/show/cluster.md)
+    - [show launchers](row/show/launchers.md)
   - [scan](row/scan.md)
   - [uncomplete](row/uncomplete.md)
 
 - [`workflow.toml`](workflow/index.md)
   - [workspace](workflow/workspace.md)
-  - [cluster](workflow/cluster.md)
+  - [submit_options](workflow/submit-options.md)
   - [action](workflow/action/index.md)
     - [group](workflow/action/group.md)
     - [resources](workflow/action/resources.md)
-    - [cluster](workflow/action/cluster.md)
+    - [submit_options](workflow/action/submit-options.md)
 - [`clusters.toml`](clusters/index.md)
+  - [cluster](clusters/cluster.md)
   - [Built-in clusters](clusters/built-in.md)
+- [`launchers.toml`](launchers/index.md)
+  - [Launcher configuration](launchers/launcher.md)
+  - [Built-in launchers](launchers/built-in.md)
 - [Environment variables](env.md)
 
 # Appendix

diff --git a/doc/src/clusters/built-in.md b/doc/src/clusters/built-in.md
@@ -1,3 +1,50 @@
 # Built-in clusters
 
-TODO: Write this document.
+**Row** includes built-in support for the following clusters.
+
+## Anvil (Purdue)
+
+[Anvil documentation](https://www.rcac.purdue.edu/knowledge/anvil).
+
+**Row** automatically selects from the following partitions:
+* `shared`
+* `wholenode`
+* `gpu`
+
+Other partitions may be selected manually.
+
+There is no need to set `--mem-per-*` options on Anvil as the cluster automatically
+chooses the largest amount of memory available per core by default.
+
+## Delta (NCSA)
+
+[Delta documentation](https://docs.ncsa.illinois.edu/systems/delta).
+
+**Row** automatically selects from the following partitions:
+* `cpu`
+* `gpuA100x4`
+
+Other partitions may be selected manually.
+
+Delta jobs default to a small amount of memory per core. **Row** inserts `--mem-per-cpu`
+or `--mem-per-gpu` to select the maximum amount of memory possible that allows full-node
+jobs and does not incur extra charges.
+
+## Great Lakes (University of Michigan)
+
+[Great Lakes documentation](https://arc.umich.edu/greatlakes/).
+
+**Row** automatically selects from the following partitions:
+* `standard`
+* `gpu_mig40,gpu`
+* `gpu`
+
+Other partitions may be selected manually.
+
+Great Lakes jobs default to a small amount of memory per core. **Row** inserts
+`--mem-per-cpu` or `--mem-per-gpu` to select the maximum amount of memory possible that
+allows full-node jobs and does not incur extra charges.
+
+> Note: The `gpu_mig40,gpu` partition is selected only when there is one GPU per job.
+> This is a combination of 2 partitions which decreases queue wait time due to the
+> larger number of nodes that can run your job.