Confusing memory allocation system gpu-memory-utilization
#8634
-
Hello, I'm working on a software that currently only uses llama.cpp as a backend. Because of tensor parallelism and faster inference I'm thinking about also supporting vLLM. With llama.cpp, you pick the model, set kv-cache quantization, context size and the model eats the VRAM it needs to eat. What's confusing with vLLM is that you "let it" take it a percentage of the GPU, while it being a percentage is not really an issue as it can be turned into an absolute value, is it possible to not specify any VRAM usage, and instead some kind of "take what you need, that's my problem, not yours", exactly like llama.cpp You can specify context len using With local llama server, you allocate yourself the context size (ctx*n_users) and you know your vram usage right after booting up, you know it can take two users at the time and vram usage will not increase. Sorry but this "take x %" is super confusing. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
paged attention paper will answer your question better than I can here. |
Beta Was this translation helpful? Give feedback.
paged attention paper will answer your question better than I can here.