You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the paper I wrote with @jvansanten [1], I estimated that throughput could be increased by up to 40% in scenarios where memory is the bottleneck. The idea is to measure peak memory usage and replace user estimates after a while with actual resource usage.
We looked at the first 5% of the jobs with the same (data set, task_index) and then computed an optimized allocation size that balances over- and under-sizing wastage.
@dsschult, is this something you would consider for integration into the production system? There's a simple implementation of the optimization method in python (needs no special python modules, computationally very quick) [2]. I've been playing with extensions of this method, but found that the basic implementation works quite well already.
If yes, I could allocate one or two days to implement and test this. The most complex part would probably be to orchestrate the whole process of retrieving past measurements, swapping out estimates for optimized allocations, and handling edge cases (and to find my way around the code base)?
@carlwitt, yes, this is something I'd be interested in integrating. I would need to know more about how it works to tell you the right spot, though my guess is as part of iceprod/server/scheduled_tasks.
The only big problem I see is licensing. The license in [2] is GPL, so it cannot be incorporated into IceProd.
Just my 2 cents: I don’t know if there’s a huge point in actually using the Tovar code, given the infectious license terms. The algorithm is well described in the paper, and all the complexity is in gathering the input data.
@dsschult: Great to hear! I'll get back to you end of March/early April as soon as I'm back from my next business trip. @jvansanten: I agree, writing a new implementation is not a big deal, and I already have some code in place from my experiments.
In the paper I wrote with @jvansanten [1], I estimated that throughput could be increased by up to 40% in scenarios where memory is the bottleneck. The idea is to measure peak memory usage and replace user estimates after a while with actual resource usage.
We looked at the first 5% of the jobs with the same (data set, task_index) and then computed an optimized allocation size that balances over- and under-sizing wastage.
@dsschult, is this something you would consider for integration into the production system? There's a simple implementation of the optimization method in python (needs no special python modules, computationally very quick) [2]. I've been playing with extensions of this method, but found that the basic implementation works quite well already.
If yes, I could allocate one or two days to implement and test this. The most complex part would probably be to orchestrate the whole process of retrieving past measurements, swapping out estimates for optimized allocations, and handling edge cases (and to find my way around the code base)?
[1] @carlwitt, @jvansanten, Ulf Leser: "Learning Low-Wastage Memory Allocations for Scientific Workflows at IceCube", submitted to HPCS 2019
[2] https://github.com/cooperative-computing-lab/efficient-resource-allocations
The text was updated successfully, but these errors were encountered: