Clustermq multiprocess for main jobs and ssh for worker jobs #198

mattwarkentin · 2020-10-19T16:39:06Z

Hi Will,

I haven't thought this through well enough to really assess its feasibility, but I wanted to scribble my thoughts down and get your input. Not sure if we need to loop @mschubert in or not.

As you know, for one of my current projects I am using the "mostly local, sometimes remote" approach - my project lives on my local machine, but some computationally intensive tasks are selectively sent to the HPC via SSH thanks to clustermq. This works great.

However, when using options(clustermq.scheduler = "ssh"), you have only two options, run jobs locally and sequentially in the "main" process, or send the job via ssh. The majority of the tasks run in the "main" R process and are forced to run sequentially, all for the ability to send a few select jobs to HPC.

So long story short, I am wondering if would somehow be possible to use "multiprocess" for jobs with deployment = "main" and "ssh" for targets with deployment = "worker". I know this is a convoluted use-case, but I am actually constrained to using this workflow for this particular project and was just wondering if something like that could possibly work.

Reasons why I don't just run everything via ssh:

Some of the tasks are trivial and quick, and the overhead of sending them to the HPC over sockets is unnecessary
Some of the targets rely on the local NFS for access to files which cannot be moved to the cluster or cloud

Reasons why I don't just run everything locally:

Memory constraints on my local machine
There are fewer computationally intensive tasks, but they take days to run sometimes

The text was updated successfully, but these errors were encountered:

mattwarkentin · 2020-10-19T16:45:37Z

Maybe withr could help? Run main tasks with one set of clustermq options, and worker tasks with another...

Just thinking out loud

wlandau · 2020-10-19T17:28:25Z

This is a rare edge case, and trying to build something directly into targets would break a lot of existing infrastructure. So just like #196 and #193, targets itself is not going to change to accommodate this.

For your use case, it seems like what we really want is heterogeneous transient workers, a problem best suited for the future and clustermq packages. There has been some recent discussion about https://github.com/HenrikBengtsson/future.clustermq, which Henrik recently made public, but it still not ready for serious use. That might be a way to handle this.

mattwarkentin · 2020-10-19T17:49:04Z

For sure this is an edge case. I didn't think targets should change to accommodate this but was hoping to discuss if its possible with existing infrastructure. This actually works, but it's a bit hacky. I had to set the callr_function = NULL or else it errored out, not exactly sure why.

library(targets)

tar_script({
  tmpl <- list(
    job_name = "name",
    partition = "partition",
    node = "node"
  )
  
  tar_option_set(
    resources = tmpl,
    deployment = "worker",
    storage = "main",
    retrieval = "main"
  )
  
  tar_pipeline(
    tar_target(id, 1:5),
    tar_target(
      foo,
      Sys.sleep(id),
      pattern = map(id)
    ),
    tar_target(
      bar,
      Sys.sleep(id),
      pattern = map(id)
    )
  )
}
)

# Run certain targets as multiprocess
withr::with_options(
  list(clustermq.scheduler = "multiprocess"),
  tar_make_clustermq(
    names = foo, 
    workers = 2L,
    callr_function = NULL
    )
)

# Run others as HPC via SSH
withr::with_options(
  list(clustermq.scheduler = "ssh", clustermq.ssh.host = "USER@DOMAIN"),
  tar_make_clustermq(
    names = bar,
    workers = 2L,
    callr_function = NULL
    )
)

wlandau · 2020-10-19T20:09:04Z

Seems like that should work (even with the default callr function). I have done something similar with a pipeline in which different groups of targets have different sets of performance tradeoffs. An alternative may be

wlandau · 2020-10-19T20:48:45Z

Just thought of a super simple way to get around this in targets: allow target-specifc future::plan()s through the resources argument of tar_target(). Relies on futureverse/future#263 (comment) (thanks @HenrikBengtsson). Now implemented in 1702325 and 05a8472. Try it out:

# _targets.R:
library(targets)
future::plan(future::sequential)
tar_pipeline(
  tar_target(x, seq_len(4)),
  tar_target(
    y,
    Sys.sleep(30),
    pattern = map(x),
    resources = list(plan = future.callr::callr)
  )
)

# R console:
library(targets)
tar_make_future(workers = 4)

The catch here is that it won't actually help you with clustermq until https://github.com/HenrikBengtsson/future.clustermq is ready to use. And even then, different transient workers will have different common data, which could add overhead. Right now, the only way to get on a cluster with future is with future.batchtools, which tends to be slow in my experience.

wlandau · 2020-10-19T21:01:14Z

Not sure why multisession futures don't work this way though. I will post details to another thread.

mattwarkentin · 2020-10-19T21:08:46Z

Cool, thanks for sharing. I haven't really ever taken the time to develop a good mental model for future, which is why I default to using clustermq - I just really like its API and I'm familiar with it.

Related sets of question:

Is there any compelling reason to possibly want to have the "main" or "worker" processes run in R sessions with special options declarations?
- In other words, is there any possible desire/reason for tar_option_set(withr_options = ...) or tar_target(withr_options = ...) or something like that...

Obviously my rare use-case fits in, but wondering about whether it could be more generally useful.

Is there any option inheritance going on when the "main" and "worker" processes spawn from the...calling process (lack of better term)? Or are the children process only inheriting options defined in .Rprofile?

wlandau · 2020-10-19T22:32:37Z

I haven't really ever taken the time to develop a good mental model for future, which is why I default to using clustermq - I just really like its API and I'm familiar with it.

Admittedly, I only use it for parallel processing on non-lazy transient workers. It does go a lot deeper in terms of the abstraction and asynchronicity.

Is there any compelling reason to possibly want to have the "main" or "worker" processes run in R sessions with special options declarations?

I'm not sure what you mean. tar_option_set() sets the defaults, and the arguments to tar_target() override them on a target-by-target basis. So you could set tar_option_set(deployment = "main") for the majority of targets and then call tar_target(deployment = "worker") for a small number of HPC targets. This is useful because, as you have observed, for some targets there is no point in using HPC at all.

Is there any option inheritance going on when the "main" and "worker" processes spawn from the...calling process (lack of better term)? Or are the children process only inheriting options defined in .Rprofile?

I wouldn't count on global options with the options() function carrying over to HPC workers unless you set them in a local .Rprofile at the project root.

futureverse/future#430, #198

mattwarkentin added the type: question label Oct 19, 2020

wlandau closed this as completed Oct 19, 2020

wlandau mentioned this issue Oct 19, 2020

Trouble changing plans futureverse/future#430

Open

wlandau-lilly added a commit that referenced this issue Oct 20, 2020

Delay the cleanup of psock clusters

1ede0fb

futureverse/future#430, #198

mattwarkentin mentioned this issue Oct 30, 2020

Assign targets to families in order to refer them more easily #208

Closed

6 tasks

PhDyellow mentioned this issue Oct 28, 2021

Some targets need more resources MathMarEcol/pdyer_aus_bio#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustermq multiprocess for main jobs and ssh for worker jobs #198

Clustermq multiprocess for main jobs and ssh for worker jobs #198

mattwarkentin commented Oct 19, 2020 •

edited

Loading

mattwarkentin commented Oct 19, 2020

wlandau commented Oct 19, 2020

mattwarkentin commented Oct 19, 2020 •

edited

Loading

wlandau commented Oct 19, 2020

wlandau commented Oct 19, 2020 •

edited

Loading

wlandau commented Oct 19, 2020

mattwarkentin commented Oct 19, 2020 •

edited

Loading

wlandau commented Oct 19, 2020

Clustermq multiprocess for main jobs and ssh for worker jobs #198

Clustermq multiprocess for main jobs and ssh for worker jobs #198

Comments

mattwarkentin commented Oct 19, 2020 • edited Loading

mattwarkentin commented Oct 19, 2020

wlandau commented Oct 19, 2020

mattwarkentin commented Oct 19, 2020 • edited Loading

wlandau commented Oct 19, 2020

wlandau commented Oct 19, 2020 • edited Loading

wlandau commented Oct 19, 2020

mattwarkentin commented Oct 19, 2020 • edited Loading

wlandau commented Oct 19, 2020

mattwarkentin commented Oct 19, 2020 •

edited

Loading

mattwarkentin commented Oct 19, 2020 •

edited

Loading

wlandau commented Oct 19, 2020 •

edited

Loading

mattwarkentin commented Oct 19, 2020 •

edited

Loading