-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sigpipe error when using Rscript and future #380
Comments
That smells like problems you see when parallelizing using forked processing (which multicore use via the 'parallel' package). I'd bet you get the same if you use Forked processing is know to be unstable in various setting, e.g. it should not be use in RStudio. Whether forked processing is stable or not depends also what is parallelized. There many reports out there like this one. There's no magic sause to fix this. I recommend that you use multisession instead. |
@HenrikBengtsson thank you so much for answering. This is very helpful. The fact is that I am using Could there be a conflict when the cores try to access a shared object (dataframe)? Is this why the multicore processing can be unstable? Thanks! |
Good, so then we can rule out RStudio.
Hard to guess. Multi-threaded code, which some package use, is a common suspect, so, if you use forked parallelization over such multi-threaded code, that could be one reason. I suggest that you confirm that you get the same problem when using |
what is strange though is that I get the same error when I use |
Not strange at all; I'm leaning more and more toward deprecated that |
thank you again @HenrikBengtsson , this is very helpful. I am trying with |
yes... as expected, |
It tries to copy only what's needed. It can't do magic, that is, you can still write code that is inefficient. Hard to say without reproducible example. Have you considered that your original problem might be that your also running out of memory? Have you profiled your code in sequential mode - do you know what are the memory hungry parts? |
Hi @HenrikBengtsson here is a short example of what I am struggling with (of course, my real-life problem uses a much more complex function that is not vectorizable over all rows directly) library(dplyr)
library(stringr)
library(furrr)
library(tictoc)
mydata <- tibble(mytext = rep('Henrik is a great programmer', times = 1000))
myfunc <- function(mytext){
tibble(test = str_detect(mytext, 'Henrik'),
value = 2)
}
tic()
mydata %>% mutate(myoutput = future_map(mytext, ~myfunc(.x))) %>% tail()
toc() which gives:
As you can see, this is embarassingly parallelizable and when the number of rows becomes large (say 100k and more) one must use multiprocessing because doing so sequentially is too slow. My go to solution was to use Thanks! EDIT 2020-06-30: Explicitly attaching all packages needed for this example to work. /Henrik |
I do not think your issue is related to future at all. The problem seems to be that you have been misled by the hocus-pocus of the tidyverse approach, leading to overly complicated and inefficient code. E.g.,
Just compare your code with this (only base-R): mydata <- data.frame(
mytext = rep('Henrik is a great programmer', times = 1000),
stringsAsFactors = FALSE
)
myfunc <- function(x){
data.frame(test = grepl('Henrik', x), value = 2L)
}
system.time(
mydata$myoutput <- lapply(mydata$mytext, myfunc)
) And the data.table-way: library(data.table)
mydata <- data.table(
mytext = rep('Henrik is a great programmer', times = 1000)
)
myfunc <- function(x){
data.table(test = grepl('Henrik', x), value = 2L)
}
system.time(
mydata[, myoutput := lapply(mytext, myfunc)]
) Instead of If you have an embarassingly parallel problem, and you run out-of-RAM (hard to believe with 300GB), you have to profile your code, and check which part of your code uses an excessive amount of memory. It can also happen that you keep one large object in your master session, that grows inefficiently, and this causes your havoc. E.g., instead of storing the results in a list and later row-bind them, you can do it efficiently in library(data.table)
out <- data.table()
for (x in mytext) {
out <- rbind(out, myfunc(x))
} You can of course split your text to larger chunks and process each chunk in a future, etc. But really, try to avoid using tidyverse for every single piece of functionality that you do in R. And if you must work with largish data and data.frame-like structures, go for data.table. |
thank you @tdeenes for this very interesting take! You make some really good points here. However, I do think my issue is related to I really like your idea of using You suggest using Thank you!! |
@randomgambit , your toy example is not really appropriate to demonstrate the efficient use of multiprocessing. First, we have fast vectorized routines implemented at the C level (e.g., If you use any kind of multiprocessing on the same host (that is, all processes run on the same machine), in general you can face the following bottlenecks:
There is no general recipe for efficient parallel processing. As Henrik stated previously, you first have to identify where your current bottleneck is. If you find something which you think is a bug in R or in any of the packages that your calculations depend on, you have to create a minimal reproducible example (which of course must demonstrate the bug). |
@tdeenes @HenrikBengtsson first of all, let me thank you for your very kind and informative comments. It is very hard to find package maintainers that are both at the bleeding edge and eager to help users like me. Thanks! I will keep working on reproducing the error and I hope you will not mind if I post another Thanks! |
@randomgambit, I tried to do distill a smaller example from your example; library(tibble)
library(stringr)
library(furrr)
plan(multicore)
myfunc <- function(mytext) {
tibble(test=str_detect(mytext, "Henrik"), value=2)
}
n <- 10e3
mytext <- rep("Henrik is a great programmer", times=n)
y <- future_map(mytext, ~myfunc(.x)) Does the above also produce those critical errors when you increase BTW, you never mentioned your |
FWIW, regarding the |
hi @HenrikBengtsson thank you so much for your feedback. When I try your small program, even with What is puzzling here is that I am not using a cluster where usually a worker (on another machine) can crash. Here I am just using all the processors of my local machine (this is what Thanks! |
Hello there,
I apologize in advance if the question is a bit broad, but I am running into various
SIGPIPE
warnings of the form:when I use
Rscript
andfuture
in multicore mode on a linux computer (I use the packagefurrr
).I would be happy to create a reproducible example but was not able to yet. Do you know what could cause the issue here? Is a
sigpipe
warning something that can happen with multiprocessing? I am not getting any error when i run my code sequentially.Again, sorry for the general question but I am not sure where to start. Any hints greatly appreciated!
Thanks!
The text was updated successfully, but these errors were encountered: