Skip to content

Commit

Permalink
Add new job submission limits to policy page (#59)
Browse files Browse the repository at this point in the history
* add job submission limits to policy page

* Add exact values for MaxArraySize
  • Loading branch information
Comeani authored Jun 18, 2024
1 parent 2e4fb01 commit 73c4425
Showing 1 changed file with 33 additions and 4 deletions.
37 changes: 33 additions & 4 deletions docs/policies/job-scheduling-policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,35 @@ that has been marked as a dependency.</td>
</script>

### Reasons related to exceeding a usage limit:

#### JobArrayTaskLimit, QOSMaxJobsPerUserLimit and QOSMaxJobsPerAccountLimit
One or more of your jobs have exceeded limits in place on the number of jobs you can have in the queue
**that are actively accruing priority**. Jobs with this status will remain in the queue, but will not being accruing
priority until other jobs from the submitting user have completed.

In most cases the **per-account limit is 500 jobs**, and the **per-user limit is 100 jobs**. You can use
`sacctmgr show qos format=Name%20,MaxJobsPA,MaxJobsPU,MaxSubmitJobsPA,MaxSubmitJobsPU,MaxTresPA%20` to view the limits
for any given QOS.

The maximum job array size is 100 on SMP, MPI, and HTC. The array size limits are defined at the cluster configuration level:
```
[nlc60@login1 ~] : for cluster in smp mpi gpu htc; do echo $cluster; scontrol -M $cluster show config | grep MaxArraySize; done
smp
MaxArraySize = 100
mpi
MaxArraySize = 100
gpu
MaxArraySize = 1001
htc
MaxArraySize = 100
```

These limits exist to prevent users who batch submit large quantities of jobs in a loop or job array from having all of
their jobs at a higher priority than one-off submissions simply due to having submitted them all at once.

A hard limit on the maximum number of submitted jobs (including in a job array) is 1000. This separate limit exists to
prevent any one user from overwhelming the workload manager with a singular, very large request for resources.

#### MaxMemoryPerAccount
The job exceeds the current within-group memory quota. The maximum quota available depends on the cluster and partition.
The table below gives the maximum memory (in GB) for each QOS in the clusters/partitions it is defined.
Expand Down Expand Up @@ -237,13 +266,13 @@ your completed jobs:
Memory Efficiency: 14.29% of 900.00 GB
```

#### AssocGrpBillingRunMinutesLimit
There are a few possible reasons for this:

#### AssocGrpBillingMinutes
- Your group's Allocation ("service units") usage has surpassed the limit specified in your active resource Allocation,
or your Allocation has expired. You can double-check this with `crc-usage`.
or your active Allocations have expired. You can double-check this with `crc-usage`.
[Please submit a new Resource Allocation Request following our guidelines](https://crc.pitt.edu/Pitt-CRC-Allocation-Proposal-Guidelines).



#### MaxTRESPerAccount, MaxCpuPerAccount, or MaxGRESPerAccount
In the table below, the group based CPU (GPUs for the gpu cluster) limits are presented for each QOS walltime length.
If your group requests more CPU/GPUs than in this table you will be forced to wait until your group's jobs finish.
Expand Down

0 comments on commit 73c4425

Please sign in to comment.