More flexible Job listing/killing for Slurm #173

ja-thomas · 2018-02-05T15:26:59Z

We frequently change clusters/partition on our HPC and setting the clusters in makeClusterFunctionSlurm() is not really practical (multiple users are linking to the same template/config file).

So we handle the clusters/partitions via resources. To make listing/killing of the jobs possible we need to set the squeue arguments in the functions accordingly.

This is not really nice, but I can't think of another solution. As far as I know you can't change the clusters argument of makeClusterFunctionSlurm not after or while creating the registry.

coveralls · 2018-02-05T15:43:58Z

Coverage remained the same at 93.744% when pulling eda37ea on ja-thomas:master into 5c008ee on mllg:master.

mllg · 2018-02-05T16:46:32Z

Can you call squeue without specifying --clusters? Can there be duplicated job.ids if you query multiple clusters?

ja-thomas · 2018-02-05T17:15:45Z

no, squeue without the clusters argument will always return no jobs (the partition argument is optional and only used if a specific partition is actually used, which means there is a default partition but no default cluster).

There are no duplicated job ids as far as I know. I can double check, but until now the job.id was always a unique identifier over all clusters

mllg · 2018-02-05T17:34:09Z

How about this approach:

You specify a comma separated list of clusters to the constructor
For submitJobs you select one of the clusters via a resource
The job listing functions will iterate over all available clusters and returns the union of all job ids. Job ids are later matched against the job ids in the data base, so it is okay if the cluster functions return a superset here. But duplicated ids would lead to inconsistencies.

mllg · 2018-02-05T17:36:39Z

Ah wait, this does not work for killJobs() 😞

berndbischl · 2018-03-14T17:40:53Z

@mllg why dont you simply expose the "args" from listJobs in the constructor, with your settings as default?

and then users can overwrite this flexibly? isnt that the normal trick? and this changes nothing for anybody else or the internal code?

berndbischl · 2018-03-14T17:42:05Z

this here:

 listJobsQueued = function(reg) {
 args = c("-h", "-o %i", "-u $USER", "-t PD", sprintf("--clusters=%s", clusters))

just expose this as args.listjobsqueued (or whatever), with the string as a default?

mllg · 2018-03-15T10:19:02Z

#179

Can you please comment if

This is now flexible enough for you guys
if we still need the clusters argument
if this PR is now obsolete

ja-thomas · 2018-03-16T09:25:06Z

I don't think this helps.
The problem is that the args are evaluated (at least when we have them optional with sprintf) at creation time of the clusterFunctions and not when they are actually called.

I hope we don't need the clusters argument anymore if we get that to work.

I think I'll take this rather ugly fix here and keep them as clusterFunctionsSlurLRZ or whatever in the config repository for our project on the lrz. Since all cluster users are linking against my batchtools.conf file anyways...

The perfect solution for us would be that clusters + partitions are resources that can be set on a job level (which is already possible, I think) and have the listing/killing calls take the values from there

mllg · 2018-03-16T10:18:18Z

#180 ?

mllg · 2018-03-16T10:23:30Z

I could do the same thing for partitions, but I really don't know what I'm doing. 😕

lcomm · 2018-03-21T15:44:22Z

This does not solve the original cluster/partition issue, but @berndbischl's suggestion of exposing the arguments would solve a problem I am encountering where all of my SLURM jobs show up as expired until done. My computing cluster has its own version of squeue (see here), but it only recognizes --noheader and not -h as assumed by the listJobs functions. Allowing users to tweak the listJobsQueued and listJobsRunning args would make it easier to use with nonstandard SLURM configurations.

mllg · 2018-03-22T13:42:27Z

@lcomm The arguments will be exported in the next version of batchtools, I'm just waiting for some feedback on #180 before exposing the args.

@ja-thomas @berndbischl

mllg · 2018-03-22T13:50:52Z

Ok it looks like --no-header is supported now by rc-squeue. I've changed the Slurm cluster functions to always use the longer command line arguments nevertheless.

ja-thomas added 2 commits February 5, 2018 15:51

more flexible listing in Slurm

8c8802d

...

eda37ea

mllg force-pushed the master branch from 64c6206 to f5a91d5 Compare May 24, 2018 11:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More flexible Job listing/killing for Slurm #173

More flexible Job listing/killing for Slurm #173

ja-thomas commented Feb 5, 2018

coveralls commented Feb 5, 2018

mllg commented Feb 5, 2018

ja-thomas commented Feb 5, 2018

mllg commented Feb 5, 2018

mllg commented Feb 5, 2018

berndbischl commented Mar 14, 2018

berndbischl commented Mar 14, 2018

mllg commented Mar 15, 2018

ja-thomas commented Mar 16, 2018

mllg commented Mar 16, 2018

mllg commented Mar 16, 2018

lcomm commented Mar 21, 2018

mllg commented Mar 22, 2018

mllg commented Mar 22, 2018

More flexible Job listing/killing for Slurm #173

Are you sure you want to change the base?

More flexible Job listing/killing for Slurm #173

Conversation

ja-thomas commented Feb 5, 2018

coveralls commented Feb 5, 2018

mllg commented Feb 5, 2018

ja-thomas commented Feb 5, 2018

mllg commented Feb 5, 2018

mllg commented Feb 5, 2018

berndbischl commented Mar 14, 2018

berndbischl commented Mar 14, 2018

mllg commented Mar 15, 2018

ja-thomas commented Mar 16, 2018

mllg commented Mar 16, 2018

mllg commented Mar 16, 2018

lcomm commented Mar 21, 2018

mllg commented Mar 22, 2018

mllg commented Mar 22, 2018