Skip to content

Commit

Permalink
Add Slurm support.
Browse files Browse the repository at this point in the history
  • Loading branch information
joaander committed May 7, 2024
1 parent dbd1aa3 commit 6968dfc
Show file tree
Hide file tree
Showing 65 changed files with 4,857 additions and 878 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,8 @@ jobs:
chmod a+x "$HOME/.cargo/bin/mdbook-linkcheck"
- name: Add linkcheck configuration
run: |
echo -e "[output.linkcheck]\nfollow-web-links=true" >> doc/book.toml
# echo -e "[output.linkcheck]\nfollow-web-links=true" >> doc/book.toml #TODO: enable web-link checks after row is public
echo -e "[output.linkcheck]\nfollow-web-links=false" >> doc/book.toml
cat doc/book.toml
- name: Build documentation
run: mdbook build doc
Expand Down
7 changes: 7 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,10 @@ repos:
rev: v1.6.27
hooks:
- id: actionlint
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: 'v0.3.4'
hooks:
- id: ruff-format
- id: ruff

# TODO: add fix-license-header
49 changes: 49 additions & 0 deletions .ruff.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
target-version = "py312"
line-length = 100

[lint]

extend-select = [
"A",
"B",
"D",
"E501",
"EM",
"I",
"ICN",
"ISC",
"N",
"NPY",
"PL",
"PT",
"RET",
"RUF",
"UP",
"W",
]

ignore = [
"N806", "N803", # Allow occasional use of uppercase variable and argument names (e.g. N).
"D107", # Do not document __init__ separately from the class.
"PLR09", # Allow "too many" statements/arguments/etc...
"N816", # Allow mixed case names like kT.
"RUF012", # Do not use typing hints.
]

[lint.pydocstyle]
convention = "google"

[lint.flake8-import-conventions]
# Prefer no import aliases
aliases = {}
# Always import hoomd and gsd without 'from'
banned-from = ["hoomd", "gsd"]

# Ban standard import conventions and force common packages to be imported by their actual name.
[lint.flake8-import-conventions.banned-aliases]
"numpy" = ["np"]
"pandas" = ["pd"]
"matplotlib" = ["mpl"]

[format]
quote-style = "single"
10 changes: 10 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ clap = { version = "4.5.3", features = ["derive", "env"] }
clap-verbosity-flag = "2.2.0"
console = "0.15.8"
env_logger = "0.11.3"
home = "0.5.9"
human_format = "1.1.0"
indicatif = "0.17.8"
indicatif-log-bridge = "0.2.2"
Expand Down
50 changes: 31 additions & 19 deletions DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,12 @@ Row is yet another workflow engine that automates the process of executing **act

Ideas:
* List scheduler jobs and show useful information.
* Cancel scheduler jobs specific to actions and/or directories.
* Command to uncomplete an action for a set of directories. This would remove the product files and
update the cache.
* Option for `scan` to clear the cache. This would allow users to discover
changed action names, changed products, and manually uncompleted actions.
* Require the use of a launcher when requesting more than one process?
* Some method to clear any cache (maybe this instead of uncomplete?). This would allow
users to discover changed action names, changed products, manually uncompleted
actions, and deal with corrupt cache files.

## Overview

Expand All @@ -69,11 +70,12 @@ dispatches calls to the library.

* `row`
* `cli` - Top level command line commands.
* `cluster` - Read the cluster configuration file, determine the active cluster, and
make the settings available.
* `launcher` - TODO: Separate from cluster/scheduler? Or embedded within?
* `project` - Combine the workflow, state, and scheduler into one object and provide methods
that work with the project as a whole.
* `cluster` - Read the `clusters.toml` configuration file, determine the active
cluster, and make the settings available.
* `launcher` - Read the `launchers.toml` configuration file, provide code to construct
commands with launcher prefixes.
* `project` - Combine the workflow, state, and scheduler into one object and provide
methods that work with the project as a whole.
* `scheduler` - Generic scheduler interface and implementations for shell and SLURM.
* `state` - Row's internal state of the workspace.
* `workflow` - Type defining workflow and utility functions to find and read it.
Expand All @@ -91,7 +93,7 @@ executed, Row checks in the current working directory (and recursively in parent
- The **workspace**
- path
- A static **value file**
- cluster specific
- cluster-specific `submit_options`
- account
- options
- setup script
Expand All @@ -106,7 +108,7 @@ executed, Row checks in the current working directory (and recursively in parent
- threads_per_process
- gpus_per_process
- walltime (either per_submission or per_directory)
- Cluster specific
- Cluster- and action-specific `submit_options`
- options
- setup
- partition
Expand Down Expand Up @@ -140,8 +142,8 @@ Row maintains the state of the workflow in several files:
* Cached copies of the user-provided static value file.
* `completed.postcard`
* Completion status for each **action**.
* `TODO: determine filename`
* The last submitted scheduler job ID for each **action**.
* `submitted.postcard`
* The last submitted job ID, referenced by action, directory, and cluster.

When Row updates the state, it collects the completion staging files and updates the entries in the
state accordingly. It also checks the scheduler for all known job IDs and removes any job IDs that
Expand Down Expand Up @@ -170,9 +172,6 @@ resource usage and asks for confirmation before submitting the **job(s)** to the
submitting, Row updates the **state** with a record of which scheduler **job** IDs were submitted
for each **action**/**directory** combination.

When run in an interactive terminal, show a progress bar when there is more than one job to submit.
TODO: Also print the job IDs submitted? Or only the job IDs and no progress bar?

Provide a --dry-run option that shows the user what **job** script(s) would be submitted.

End the remaining submission sequence on an error return value from the scheduler. Save the cache
Expand All @@ -191,17 +190,30 @@ generated by user input.
The group defining options **include** and **sort_by** use JSON pointer
syntax. This allows users to select any element of their value when defining groups.

### Launcher configuration

Launchers define prefixes that go in front of commands. These prefixes (e.g.
OMP_NUM_THREADS, srun) take arguments when the user requests certain resources. **Row**
provides built in support for OpenMP and MPI on the built-in clusters that **row
supports. Users can override these and provide new launchers in `launchers.toml`.

Each launcher optionally emits an `executable`, and arguments for
* total number of processes
* threads per process
* gpus per process
when both the launcher defines such an argument and the user requests the relevant
resource.

### Cluster configuration

Row provides default configurations for many national HPC systems. Users can override these defaults
or define new systems in `$HOME/.config/row/clusters.toml`.

A single cluster defines:
* name
* launcher
* TODO: determine how to define launchers
* autodetect
* TODO: determine how to autodetect clusters
* identify: one of
* by_environment: [string, string]
* always: bool
* scheduler
* partition (listed in priority order)
* name
Expand Down
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ actions have been submitted on which directories so that you don't submit the sa
twice. Once a job completes, subsequent actions become eligible allowing you to process
your entire workflow to completion over many submissions.

The name is "row" as in "row, row, row your boat".

Notable features:
* Support both arbitrary directories and [signac](https://signac.io) workspaces.
* Execute actions via arbitrary shell commands.
Expand All @@ -16,9 +18,11 @@ Notable features:
* Execute groups in serial or parallel.
* Schedule CPU and GPU resources.
* Automatically determine the partition based on the batch job size.
* Includes configure for many national and university HPC systems.
* Built-in configurations for many national and university HPC systems.
* Add custom cluster definitions for your resources.

TODO: better demo script to get output for README and row show documentation examples.

For example:
```bash
> row show status
Expand All @@ -29,7 +33,7 @@ two 0 200 800 1000 8K GPU-hours

```bash
> row show directories --value "/value"
Directory Status Job /value
Directory Status Job ID /value
dir1 submitted 1432876 0.9
dir2 submitted 1432876 0.8
dir3 submitted 1432876 0.7
Expand All @@ -41,5 +45,3 @@ dir6 completed 0.3

**Row** is a spiritual successor to
[signac-flow](https://docs.signac.io/projects/flow/en/latest/).

The name is "row" as in "row, row, row your boat".
25 changes: 18 additions & 7 deletions doc/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,34 +9,45 @@
- [Hello, workflow!](guide/tutorial/hello.md)
- [Managing multiple actions](guide/tutorial/multiple.md)
- [Grouping directories](guide/tutorial/group.md)
- [Submitting jobs to a scheduler]()
- [Best practices for actions]()
- [Submitting jobs manually](guide/tutorial/scheduler.md)
- [Requesting resources with row](guide/tutorial/resources.md)
- [Submitting jobs with row](guide/tutorial/submit.md)
- [Using row with Python and signac](guide/python/index.md)
- [Working with signac projects](guide/python/signac.md)
- [Writing action commands in Python](guide/python/actions.md)
- [Concepts](guide/concepts/index.md)
- [Best practices](guide/concepts/best-practices.md)
- [Process parallelism](guide/concepts/process-parallelism.md)
- [Thread parallelism](guide/concepts/thread-parallelism.md)
- [Directory status](guide/concepts/status.md)
- [The row cache](guide/concepts/cache.md)
- [JSON pointers](guide/concepts/json-pointers.md)
- [The row cache](guide/concepts/cache.md)
# Reference

- [row](row/index.md)
- [init](row/init.md)
- [show status](row/show-status.md)
- [show directories](row/show-directories.md)
- [submit](row/submit.md)
- [show](row/show/index.md)
- [show status](row/show/status.md)
- [show directories](row/show/directories.md)
- [show cluster](row/show/cluster.md)
- [show launchers](row/show/launchers.md)
- [scan](row/scan.md)
- [uncomplete](row/uncomplete.md)

- [`workflow.toml`](workflow/index.md)
- [workspace](workflow/workspace.md)
- [cluster](workflow/cluster.md)
- [submit_options](workflow/submit-options.md)
- [action](workflow/action/index.md)
- [group](workflow/action/group.md)
- [resources](workflow/action/resources.md)
- [cluster](workflow/action/cluster.md)
- [submit_options](workflow/action/submit-options.md)
- [`clusters.toml`](clusters/index.md)
- [cluster](clusters/cluster.md)
- [Built-in clusters](clusters/built-in.md)
- [`launchers.toml`](launchers/index.md)
- [Launcher configuration](launchers/launcher.md)
- [Built-in launchers](launchers/built-in.md)
- [Environment variables](env.md)

# Appendix
Expand Down
49 changes: 48 additions & 1 deletion doc/src/clusters/built-in.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,50 @@
# Built-in clusters

TODO: Write this document.
**Row** includes built-in support for the following clusters.

## Anvil (Purdue)

[Anvil documentation](https://www.rcac.purdue.edu/knowledge/anvil).

**Row** automatically selects from the following partitions:
* `shared`
* `wholenode`
* `gpu`

Other partitions may be selected manually.

There is no need to set `--mem-per-*` options on Anvil as the cluster automatically
chooses the largest amount of memory available per core by default.

## Delta (NCSA)

[Delta documentation](https://docs.ncsa.illinois.edu/systems/delta).

**Row** automatically selects from the following partitions:
* `cpu`
* `gpuA100x4`

Other partitions may be selected manually.

Delta jobs default to a small amount of memory per core. **Row** inserts `--mem-per-cpu`
or `--mem-per-gpu` to select the maximum amount of memory possible that allows full-node
jobs and does not incur extra charges.

## Great Lakes (University of Michigan)

[Great Lakes documentation](https://arc.umich.edu/greatlakes/).

**Row** automatically selects from the following partitions:
* `standard`
* `gpu_mig40,gpu`
* `gpu`

Other partitions may be selected manually.

Great Lakes jobs default to a small amount of memory per core. **Row** inserts
`--mem-per-cpu` or `--mem-per-gpu` to select the maximum amount of memory possible that
allows full-node jobs and does not incur extra charges.

> Note: The `gpu_mig40,gpu` partition is selected only when there is one GPU per job.
> This is a combination of 2 partitions which decreases queue wait time due to the
> larger number of nodes that can run your job.
Loading

0 comments on commit 6968dfc

Please sign in to comment.