Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic Refinement of Memory Allocations #276

Open
carlwitt opened this issue Mar 12, 2019 · 3 comments
Open

Automatic Refinement of Memory Allocations #276

carlwitt opened this issue Mar 12, 2019 · 3 comments

Comments

@carlwitt
Copy link
Contributor

In the paper I wrote with @jvansanten [1], I estimated that throughput could be increased by up to 40% in scenarios where memory is the bottleneck. The idea is to measure peak memory usage and replace user estimates after a while with actual resource usage.
We looked at the first 5% of the jobs with the same (data set, task_index) and then computed an optimized allocation size that balances over- and under-sizing wastage.

@dsschult, is this something you would consider for integration into the production system? There's a simple implementation of the optimization method in python (needs no special python modules, computationally very quick) [2]. I've been playing with extensions of this method, but found that the basic implementation works quite well already.

If yes, I could allocate one or two days to implement and test this. The most complex part would probably be to orchestrate the whole process of retrieving past measurements, swapping out estimates for optimized allocations, and handling edge cases (and to find my way around the code base)?

[1] @carlwitt, @jvansanten, Ulf Leser: "Learning Low-Wastage Memory Allocations for Scientific Workflows at IceCube", submitted to HPCS 2019
[2] https://github.com/cooperative-computing-lab/efficient-resource-allocations

@dsschult
Copy link
Collaborator

@carlwitt, yes, this is something I'd be interested in integrating. I would need to know more about how it works to tell you the right spot, though my guess is as part of iceprod/server/scheduled_tasks.

The only big problem I see is licensing. The license in [2] is GPL, so it cannot be incorporated into IceProd.

@jvansanten
Copy link
Contributor

Just my 2 cents: I don’t know if there’s a huge point in actually using the Tovar code, given the infectious license terms. The algorithm is well described in the paper, and all the complexity is in gathering the input data.

@carlwitt
Copy link
Contributor Author

@dsschult: Great to hear! I'll get back to you end of March/early April as soon as I'm back from my next business trip.
@jvansanten: I agree, writing a new implementation is not a big deal, and I already have some code in place from my experiments.

Cheers,
Carl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants