-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Allow spilling from device memory to host memory by default #334
Comments
Personally, I don't share the same thinking. While I see your point that going OOM is terrible users will at least be notified of that. Now if we change the default so that spilling is always enabled this will most of the time allow workflows to complete, but will definitely not deliver good performance, what may discourage newcomers from using dask-cuda in general. To add to the problem, it is known that spilling is still very inefficient -- see #106 . |
It would be nice to see other opinions though. |
This is a good point, and something I thought about too. The performance / user experience tradeoff is important to consider. It might be valuable to learn about mainline Dask's experience setting defaults. I believe the spill-to-disk behavior is by default set to occur at |
I am not sure how Dask selected the 60% default threshold (happy to ask around). If I had to guess it may be because Dask defaults were configured to have Dask run without too many errors out of the box and then fine tuning could happen after a workflow was established. Perhaps this point, about whether to crash due to OOM vs complete and fine-tune should be considered. I honestly can see arguments for both. I would suggest leaving this issue open for some time to gather more feedback from other dask-cuda users. If users are always setting some device memory limit it may make sense to change to some reasonable default |
I think some of us have discussed this before, but will resurface here as well. We could consider handling OOM in Dask. Namely when a task raises a |
I think the spill and retry on |
Anecdotally, here is a report of a user setting a device memory limit of 16 GB on a 32GB V100 Azure setup. |
Though xgboost doesn’t use RMM today, which makes it a bit hard to tell what the issue is. Are we just running into memory contention between xgboost and RMM? |
Only included that example as a data point on the device memory limit threshold (though it appears they may be using 16 GB V100s in that example). That user doesn't appear to be using RMM, from what I can tell |
Sure, I guess what I'm saying is, some use cases involve a few different libraries that have different memory management techniques, when used together this can make the overall workflow more sensitive to memory usage. |
Very true. Super important to keep this in mind across the board |
At least to the XGBoost case, there's some work going on in PR ( dmlc/xgboost#5873 ) to support RMM. |
In some workflows, it's common to set a
device_memory_limit
to allow spilling from device to host memory once a memory threshold is reached. By default, thedevice_memory_limit
is configured to be the total memory of the GPU, which functionally means that spilling is off. For workflows that are not memory constrained, this is nice, as it lets us use the full memory availability without trying to spill at all.Many workflows are at least partially memory constrained, though, which then with default settings will pose significant risks of out-of-memory. It would be nice to default
device_memory_limit
to a lower percentage of GPU memory to trigger spilling rather than going out-of-memory.Anecdotally, it feels like 75 or 80% of total memory capacity might be a good default threshold. There may also be a better way to achieve spilling by default. I'm interested to hear others' thoughts.
The text was updated successfully, but these errors were encountered: