-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot find symbols exported to node by parallel::clusterExport #339
Comments
It's because the workers global environment is intentionally wiped before (and after if There's a document "misfeature" that allows you to disable this (see argument Having said that, what you're asking for suggests that there might be room for a way to control the default, initial state of futures, e.g. global variables, options, and env vars that are always set. I'll add it to the list of feature requests. |
Sticky globalsBTW, you can also do: library(future)
cl <- makeClusterPSOCK(2)
plan(cluster, workers = cl)
# Export "sticky" globals to all workers
test1 <- rnorm(10000)
my_globals <- list(test1 = test1)
parallel::clusterExport(cl, "my_globals")
dummy <- parallel::clusterEvalQ(cl, { attach(my_globals, name="my_globals"); rm(my_globals); }) With this, you'll see: > s %<-% search()
> s
[1] ".GlobalEnv" "my_globals" "package:stats"
[4] "package:graphics" "package:grDevices" "package:utils"
[7] "package:datasets" "toolbox:default" "package:methods"
[10] "Autoloads" "package:base"
> y %<-% sum(test1)
> y
[1] -211.5222 |
@renkun-ken, let's revisit sticky globals. There's actually a non-exported rudimentary prototype of this in future 1.17.0. The following illustrates how it can be used right now: library(future)
## Set up PSOCK workers with sticky globals
cl <- makeClusterPSOCK(2)
test1 <- rnorm(n=10000)
future:::clusterExportSticky(cl, "test1")
plan(cluster, workers=cl)
a <- 42
f <- future({
sum(a * test1)
}, globals=structure(TRUE, ignore="test1"))
v <- value(f)
print(v)
## [1] 6255.971 Note that To convince ourselves that options(future.globals.maxSize=0.9*object.size(test1)) such that there will be an error if options(future.globals.maxSize=0.9*object.size(test1))
f <- future({
a*sum(test1)
}, globals = structure(TRUE, ignore="test1"))
v <- value(f)
print(v)
## [1] 6255.971 still works the following throws an error as expected: f <- future({
a*sum(test1)
})
## Error in getGlobalsAndPackages(expr, envir = envir, persistent = persistent, :
## The total size of the 2 globals that need to be exported for the future expression
## ('{; a * sum(test1); }') is 78.23 KiB. This exceeds the maximum allowed size of
## 70.35 KiB (option 'future.globals.maxSize'). There are two globals: 'test1'
## (78.17 KiB of class 'numeric') and 'a' (56 bytes of class 'numeric'). Please see if this provides the minimal basics that you need. The big challenge will be to avoid having to specify worker <- cl[1] ## the worked allocated to the current future
## All identified globals
names <- c("a", "test1")
globals <- mget(names)
## The checksums of globals in the main R session
checksums <- vapply(globals, FUN = digest::digest, FUN.VALUE = NA_character_)
## Compare to the checksums of sticky globals on the worker
skip <- parallel::clusterCall(worker, fun = function(checksums) {
env <- as.environment("future:sticky_globals")
skip <- logical(length = length(checksums))
names(skip) <- names(checksums)
for (name in names(checksums)) {
if (!exists(name, envir = env, inherits = FALSE)) next
obj <- get(name, envir = env, inherits = FALSE)
checksum <- digest::digest(obj)
skip[name] <- (checksum == checksums[[name]])
}
skip
}, checksums = checksums)[[1]] such that we get: print(skip)
a test1
FALSE TRUE |
In my case, there are tens of After all, the very reason I need the sticky globals is that there are many big objects in the global environment that should not be touched in any form at all (e.g. export, digest) and some objects are exported once and for all (to be sticky) exactly in the need of low overhead before running futures. |
Therefore, the minimal API I think would work for me could be that I should be able to export a list of objects to the cluster prior to running any future and those exported objects are persistent across each run of futures so that they could run with minimal overhead (in my case, not detect any globals) but have direct access to those globals that already exist. The overall purpose for me in my use case is to reduce as much overhead as possible before running futures. |
I see. So, then there might be a need for sticky globals that are of class "trust-me-no-need-to-run-checksum". Such sticky globals will only be checked for their existence by name but not checksum BTW, what I didn't show in above mockup is that one could of course cache the checksums on the worker side, i.e. they only need to be calculated ones. Of course, if a new sticky global with the same name is exported, then it'll have to be checksum:ed again. Also, with mutable objects such data.table:s, there is a risk that the sticky global is changed on the worker end. This would invalidate any checksums. What is worse, it might no longer be the same object as intended. Point is, there's lots of things that can go wrong here and my concern is that one might end up with different result when running with Finally, ideally there would be a checksum field in the internal |
In the following example, I try to export certain variables to each cluster node before any future is created. However, the exported symbols cannot be found when a future is resolved.
I don't want
future()
to detect globals or export variables because in my use case, there are tens of futures and they will be called periodically (every several minutes), each run is time critical so that I don't want the same global variables to be detected and exported to the workers again and again.The text was updated successfully, but these errors were encountered: