-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tuning/reducing worker overhead costs #437
Comments
Hi.
For these levels of turn arounds/latencies, I think you want to stick with an as-bare-bone parallelization framework as possible, where I expect something like persistent PSOCK cluster workers should be the fastest. Looking at the PSOCK code I think one can squeeze out a little more (e.g. different serialization protocols) but it's pretty low-level with minimal overhead. In contrast, the future framework does quite a bit more when it comes to orchestration the parallelization. Here's some sources of overhead that I can think of:
Some of the above overhead can be controlled by the developer, e.g. Important: In case someone else reads this - the overhead is basically a constant. Then run-time of the actual code that we run in parallel is effectively the same for all parallelization framework. So, for longer running tasks, the relative impact from the overhead can often be ignored. I haven't done proper profiling so I don't know which are the dominant ones here(*). I also have attempted to optimize any of these beyond trying not to do silly things while writing the code. A wild guess is that one might bring done the current overhead to ~50%. It might be that one can push it even further if one allows for running in "risky" mode, i.e. drop validation and sanity checks that protect against common and corner-case developer mistakes. However, I doubt that the orchestration taking place in the future framework will ever be able to compete with a barebone setup with near-zero validation and almost no condition or output handling. (*) I'd like to get to the point where one can enable internal journaling that logs the different steps performed on a future, on the main R process, and on the R worker. This will make it possible to generate flame graphs and similarly displaying the lifespan of a future. This can be particularly handy when studying the CPU utilization across multiple futures, e.g. in map-reduce calls. The internal journaling framework should be optional and should have zero overhead when not in use. The latter can be achieved by code-injections during package load (the most famous example is debugme). This in turn requires some type of "pragma" syntax to be invented. So, in summary, I'm interested in good profiling to be able to narrow down obvious things that can be improved, but there are several things that need to be in place first for that to happen. With a good profiling framework, we'll also be able to troubleshooting and optimize other future backends, e.g. future.batchtools. |
Thank you for the thoughtful response, I really appreciate it. I fear I may have stumbled on a separate issue that makes Linux (CPU = AMD Ryzen 7 3700X) microbenchmark(
`parallel:::send/recv` = {parallel:::sendCall(cl_parallel[[1]], fun = function() iris, args = list());parallel:::recvResult(cl_parallel[[1]])},
`r_session$call` = {rs$call(function() iris);while (is.null(rs$read())) {}},
`future` = value(future(iris, globals = FALSE)),
times = 100
)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# parallel:::send/recv 239.379 303.2235 19901.29 18728.37 39707.16 40193.50 100 a
# r_session$call 39939.493 40068.1780 44392.58 40204.60 40601.45 77935.59 100 b
# future 79647.501 79978.8595 80927.29 80002.85 80231.84 88276.01 100 c But this all seemed far too slow for such small data, so I ran the the same benchmarks on Windows and got very different results--the Windows (CPU = AMD Ryzen 7 4800H) Unit: microseconds
expr min lq mean median uq max neval
parallel:::send/recv 217.501 361.501 532.3019 478.751 533.3005 7326.001 100
r_session$call 45902.501 46942.651 51566.8450 47727.751 51883.0505 92967.701 100
future 8376.401 9366.501 11825.5280 11036.752 13393.6515 24709.701 100 Have you experienced such discrepancies when working with PSOCK connections across platforms? On the Linux side, I tested with Ubuntu 18.04 and 20.04 derivatives on a couple different processors (including Intel) and got the same poor results. Regarding serialization protocols, the simplest I could think of was to tweak cl_bigendian <- makeCluster(1)
cl_lilendian <- makeCluster(1, useXDR = FALSE)
microbenchmark(
clusterEvalQ(cl_bigendian, iris),
clusterEvalQ(cl_lilendian, iris),
clusterEvalQ(cl_bigendian, rep(iris, 10)),
clusterEvalQ(cl_lilendian, rep(iris, 10)),
clusterEvalQ(cl_bigendian, rep(iris, 100)),
clusterEvalQ(cl_lilendian, rep(iris, 100)),
clusterEvalQ(cl_bigendian, rep(iris, 1000)),
clusterEvalQ(cl_lilendian, rep(iris, 1000))
)
#Unit: microseconds
# expr min lq mean median uq max neval
# clusterEvalQ(cl_bigendian, iris) 298.201 512.7510 668.385 641.3010 832.2510 1102.502 100
# clusterEvalQ(cl_lilendian, iris) 294.301 489.8005 605.130 581.6015 705.4515 1104.400 100
# clusterEvalQ(cl_bigendian, rep(iris, 10)) 563.001 941.1510 1397.917 1054.6015 1245.1010 21638.101 100
# clusterEvalQ(cl_lilendian, rep(iris, 10)) 454.700 760.9015 890.981 884.5510 1063.0510 1287.401 100
# clusterEvalQ(cl_bigendian, rep(iris, 100)) 3048.901 3980.4015 14271.212 5332.9005 5544.6510 318265.501 100
# clusterEvalQ(cl_lilendian, rep(iris, 100)) 2226.801 2986.3510 16206.174 3881.2010 4108.5010 317174.301 100
# clusterEvalQ(cl_bigendian, rep(iris, 1000)) 23705.901 46664.3000 220999.680 342916.4015 348548.1010 675678.602 100
# clusterEvalQ(cl_lilendian, rep(iris, 1000)) 15136.101 20645.3010 71831.841 32377.1005 34967.1015 655633.900 100 |
That's interesting; no I haven't noticed/explored PSOCK performance differences across platforms. That (=the parallel results) might be worth bringing up on the R-devel list - it could be that some there has a good explanation for it. Regarding XDR: In the next release of future, all functions related to the PSOCK cluster have been moved to the parallelly package and will use that package for setting up PSOCK cluster. Changing the default to be |
For the record, @jeffkeller87 posted 'parallel PSOCK connection latency is greater on Linux?' to R-devel on 2020-11-01 (https://stat.ethz.ch/pipermail/r-devel/2020-November/080060.html) and the above slowness on Linux has already been answered and suggestions for improvements have also been posted (including from one R Core member). |
Some good news. I've finally got around to speed up the creation of the R expression that is compiled from the future expression and then sent to the worker. It's in the develop branch. You should see a significant improvement; probably something like twice as fast compared with future 1.21.0. It might be that there's room for further improvements on expression creation - this is just the first iteration of a new approach. The gist is that previously I relied on |
FYI, it seems like there has been some progress on the TCP_NODELAY front in wch/r-source@82369f7 (thanks to Katie at RStudio for pointing this out). I tried it out but didn't see any difference in latency. library(parallel)
library(microbenchmark)
options(socketOptions = "no-delay")
cl <- makeCluster(1)
(x <- microbenchmark(clusterEvalQ(cl, iris), times = 100, unit = "us"))
# Unit: microseconds
# expr min lq mean median uq max neval
# clusterEvalQ(cl, iris) 172.297 43987.13 44104.53 43998.25 44011.02 518496.6 100 |
Thanks for the followup.
I see that it was added back on 2021-03-30. Maybe it's worth following up with a note on your https://stat.ethz.ch/pipermail/r-devel/2020-November/080060.html post regarding this to sort out what to expect from those changes. Hopefully, it can be resolved in time before the R 4.2.0 freeze mid March 2022 or so. |
I misunderstood the usage of the new library(parallel)
library(microbenchmark)
cl <- makeCluster(1)
cl_nd <- makeCluster(1, rscript_args="-e 'options(socketOptions=\"no-delay\")'")
cl_nd_nxdr <- makeCluster(1, rscript_args="-e 'options(socketOptions=\"no-delay\")'", useXDR = FALSE)
(x <- microbenchmark(
clusterEvalQ(cl, iris),
clusterEvalQ(cl_nd, iris),
clusterEvalQ(cl_nd_nxdr, iris),
times = 100, unit = "us"
))
# Unit: microseconds
# expr min lq mean median uq max neval
# clusterEvalQ(cl, iris) 137.64 42928.099 42324.3028 43636.5300 43993.72 48007.906 100
# clusterEvalQ(cl_nd, iris) 106.89 287.115 313.6334 325.3805 367.76 432.181 100
# clusterEvalQ(cl_nd_nxdr, iris) 93.69 259.905 285.2995 302.2800 347.33 386.740 100
Looks like it is now possible to have highly responsive pre-heated workers on Linux! |
That's great. Thanks for this. FWIW, parallelly uses library(parallelly)
cl_nd_nxdr <- makeClusterPSOCK(1, rscript_startup=quote(options(socketOptions="no-delay"))) BTW, I think you've got a case for adding an argument cl <- makeClusterPSOCK(1, rscript_options=list(socketOptions="no-delay")) but also options(socketOptions="no-delay")
cl <- makeClusterPSOCK(1, rscript_options="socketOptions") This is not just neater but also less error prone than doing it via a code string. Added to the to-do list: futureverse/parallelly#70 |
…ers = 2, rscript_startup = quote(options(socketOptions="no-delay"))) [#437]
A small update here regarding using argument cl <- makeClusterPSOCK(n, rscript_startup = quote(options(socketOptions = "no-delay"))) It turned out that some work was needed in future to get: plan(cluster, workers = n, rscript_startup = quote(options(socketOptions = "no-delay")))
plan(multisession, workers = n, rscript_startup = quote(options(socketOptions = "no-delay"))) to work, but it now works in future (>= 1.23.0-9002). |
FYI, parallelly 1.29.0 is now on CRAN, and it sets cl <- parallelly::makeClusterPSOCK(n) is like: cl <- parallel::makeCluster(n, rscript_args=c("-e", shQuote('options(socketOptions="no-delay")')), useXDR = FALSE) |
I have an R application with extremely high performance requirements with compute budgets in the 10s of milliseconds in total. In the past, I've tended to initialize workers with
parallel
and keep them hot (pre-loaded with data) for when they are needed. I really like that a similar consideration was made withfuture
and I've been exploring a similar tactic usingfuture:::clusterExportSticky
(discussed here).I really enjoy the ease of use of
future
and how it abstracts away a lot of the details of managing a background process/session, but I noticed that its compute overhead was significantly greater than other options. Of course, the abstraction comes at some cost, but I didn't expect the cost to be this high.I ran some benchmarks in an attempt to isolate just the data transfer between processes (see below). I compared
parallel
callr
future
I ran into some garbage collection issues in the main process running the benchmark and also probably in the worker processes. In an attempt to minimize the impact of garbage collection, I ran the benchmark code via
Rscript
, set theR_GC_MEM_GROW
to the highest available setting, and am not considering the max benchmark time because this is almost certainly tainted by garbage collection.export R_GC_MEM_GROW=3 Rscript io_testing.R
For simplicity, I'm assuming that the compute cost is symmetrical, so am attempting to create a situation where the only meaningful data transfer is the return trip from the worker back to the main R process--the up-front data transfer should just be the input command.
Conclusion:
future
seems to be two orders of magnitude slower thanparallel
and is takes about twice the time ofcallr
.Are there options available to users of
future
to tune this overhead? If not, are there obvious opportunities to optimizefuture
code to bring its overhead closer toparallel
? I believe thatfuture
usesparallel
internally.The above results are obviously very hardware dependent. Below are the details regarding my R session, CPU, and RAM timings.
The text was updated successfully, but these errors were encountered: