Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustermq multiprocess for main jobs and ssh for worker jobs #198

Closed
mattwarkentin opened this issue Oct 19, 2020 · 8 comments
Closed

Clustermq multiprocess for main jobs and ssh for worker jobs #198

mattwarkentin opened this issue Oct 19, 2020 · 8 comments

Comments

@mattwarkentin
Copy link
Contributor

mattwarkentin commented Oct 19, 2020

Hi Will,

I haven't thought this through well enough to really assess its feasibility, but I wanted to scribble my thoughts down and get your input. Not sure if we need to loop @mschubert in or not.

As you know, for one of my current projects I am using the "mostly local, sometimes remote" approach - my project lives on my local machine, but some computationally intensive tasks are selectively sent to the HPC via SSH thanks to clustermq. This works great.

However, when using options(clustermq.scheduler = "ssh"), you have only two options, run jobs locally and sequentially in the "main" process, or send the job via ssh. The majority of the tasks run in the "main" R process and are forced to run sequentially, all for the ability to send a few select jobs to HPC.

So long story short, I am wondering if would somehow be possible to use "multiprocess" for jobs with deployment = "main" and "ssh" for targets with deployment = "worker". I know this is a convoluted use-case, but I am actually constrained to using this workflow for this particular project and was just wondering if something like that could possibly work.

Reasons why I don't just run everything via ssh:

  1. Some of the tasks are trivial and quick, and the overhead of sending them to the HPC over sockets is unnecessary

  2. Some of the targets rely on the local NFS for access to files which cannot be moved to the cluster or cloud

Reasons why I don't just run everything locally:

  1. Memory constraints on my local machine

  2. There are fewer computationally intensive tasks, but they take days to run sometimes

@mattwarkentin
Copy link
Contributor Author

Maybe withr could help? Run main tasks with one set of clustermq options, and worker tasks with another...

Just thinking out loud

@wlandau
Copy link
Member

wlandau commented Oct 19, 2020

This is a rare edge case, and trying to build something directly into targets would break a lot of existing infrastructure. So just like #196 and #193, targets itself is not going to change to accommodate this.

For your use case, it seems like what we really want is heterogeneous transient workers, a problem best suited for the future and clustermq packages. There has been some recent discussion about https://github.com/HenrikBengtsson/future.clustermq, which Henrik recently made public, but it still not ready for serious use. That might be a way to handle this.

@wlandau wlandau closed this as completed Oct 19, 2020
@mattwarkentin
Copy link
Contributor Author

mattwarkentin commented Oct 19, 2020

For sure this is an edge case. I didn't think targets should change to accommodate this but was hoping to discuss if its possible with existing infrastructure. This actually works, but it's a bit hacky. I had to set the callr_function = NULL or else it errored out, not exactly sure why.

library(targets)

tar_script({
  tmpl <- list(
    job_name = "name",
    partition = "partition",
    node = "node"
  )
  
  tar_option_set(
    resources = tmpl,
    deployment = "worker",
    storage = "main",
    retrieval = "main"
  )
  
  tar_pipeline(
    tar_target(id, 1:5),
    tar_target(
      foo,
      Sys.sleep(id),
      pattern = map(id)
    ),
    tar_target(
      bar,
      Sys.sleep(id),
      pattern = map(id)
    )
  )
}
)

# Run certain targets as multiprocess
withr::with_options(
  list(clustermq.scheduler = "multiprocess"),
  tar_make_clustermq(
    names = foo, 
    workers = 2L,
    callr_function = NULL
    )
)

# Run others as HPC via SSH
withr::with_options(
  list(clustermq.scheduler = "ssh", clustermq.ssh.host = "USER@DOMAIN"),
  tar_make_clustermq(
    names = bar,
    workers = 2L,
    callr_function = NULL
    )
)

@wlandau
Copy link
Member

wlandau commented Oct 19, 2020

Seems like that should work (even with the default callr function). I have done something similar with a pipeline in which different groups of targets have different sets of performance tradeoffs. An alternative may be

@wlandau
Copy link
Member

wlandau commented Oct 19, 2020

Just thought of a super simple way to get around this in targets: allow target-specifc future::plan()s through the resources argument of tar_target(). Relies on futureverse/future#263 (comment) (thanks @HenrikBengtsson). Now implemented in 1702325 and 05a8472. Try it out:

# _targets.R:
library(targets)
future::plan(future::sequential)
tar_pipeline(
  tar_target(x, seq_len(4)),
  tar_target(
    y,
    Sys.sleep(30),
    pattern = map(x),
    resources = list(plan = future.callr::callr)
  )
)
# R console:
library(targets)
tar_make_future(workers = 4)

The catch here is that it won't actually help you with clustermq until https://github.com/HenrikBengtsson/future.clustermq is ready to use. And even then, different transient workers will have different common data, which could add overhead. Right now, the only way to get on a cluster with future is with future.batchtools, which tends to be slow in my experience.

@wlandau
Copy link
Member

wlandau commented Oct 19, 2020

Not sure why multisession futures don't work this way though. I will post details to another thread.

@mattwarkentin
Copy link
Contributor Author

mattwarkentin commented Oct 19, 2020

Cool, thanks for sharing. I haven't really ever taken the time to develop a good mental model for future, which is why I default to using clustermq - I just really like its API and I'm familiar with it.

Related sets of question:

  • Is there any compelling reason to possibly want to have the "main" or "worker" processes run in R sessions with special options declarations?
    • In other words, is there any possible desire/reason for tar_option_set(withr_options = ...) or tar_target(withr_options = ...) or something like that...

Obviously my rare use-case fits in, but wondering about whether it could be more generally useful.

  • Is there any option inheritance going on when the "main" and "worker" processes spawn from the...calling process (lack of better term)? Or are the children process only inheriting options defined in .Rprofile?

@wlandau
Copy link
Member

wlandau commented Oct 19, 2020

I haven't really ever taken the time to develop a good mental model for future, which is why I default to using clustermq - I just really like its API and I'm familiar with it.

Admittedly, I only use it for parallel processing on non-lazy transient workers. It does go a lot deeper in terms of the abstraction and asynchronicity.

Is there any compelling reason to possibly want to have the "main" or "worker" processes run in R sessions with special options declarations?

I'm not sure what you mean. tar_option_set() sets the defaults, and the arguments to tar_target() override them on a target-by-target basis. So you could set tar_option_set(deployment = "main") for the majority of targets and then call tar_target(deployment = "worker") for a small number of HPC targets. This is useful because, as you have observed, for some targets there is no point in using HPC at all.

Is there any option inheritance going on when the "main" and "worker" processes spawn from the...calling process (lack of better term)? Or are the children process only inheriting options defined in .Rprofile?

I wouldn't count on global options with the options() function carrying over to HPC workers unless you set them in a local .Rprofile at the project root.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants