-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: guided generation is very slow in offline mode #8313
Comments
@stas00 In my experience, guided generation is always slower than normal. I recommend you try |
Thank you for this suggestion, @Quang-elec44 - I understand that it'll be slower, but it should be marginally slower, not 20x slower. Possibly some problem in the integration? |
@stas00 is this a new issue on v0.6.0? |
@stas00 - looks like someone has just fixed the issue: |
Hey @Quang-elec44 - this is actually not true as of vllm's current release: https://blog.vllm.ai/2024/09/05/perf-update.html We also have more performance optimizations that will come out in v0.6.0: |
no, same with older versions - e.g. 0.5.5
Robert, I have already tried it to no avail. I'm working on a reproducible test case - will share as soon as I have it. |
@stas00 Have you (or can you) tried with vllm v0.5.2 if you observe the same performance issue? In my investigation, I found a regression in performance from v0.5.2 -> 0.5.3 which this PR fixes. However, in my benchmarks, there seems to be another significant performance regression in guided generation with Outlines from 0.5.3.post1 to 0.5.4 that I have not investigated yet. Note that in my tests, I only tested in online mode. |
@Lap1n, I can't try v0.5.2 since it didn't support guided generated in the offline mode. But I think the problem here is something entirely different, I'm trying to dig to the root of it. As I'm writing the offline repro scripts w/ TinyLlama-1.1B-Chat-v0.6 vllm is only 10% slower w/ outlines than w/o them once FSM has been compiled - which is absolutely normal and expected. So something else is going on - will update once I have a more clear picture. It's possible that there is a multitude of issues here, and it's possible that somehow the big overhead is model-dependent (even though in theory the type or a size of a model shouldn't matter at all overhead-wise). |
@robertgshaw2-neuralmagic, do you know why
w/o warmup we get about 6 tok/sec due to FSM setup overhead. if we warm up it's 166 tok/sec. re-running it the 2nd time w/o warm up repeats the same recompilation - shouldn't that be pulled from the cache and re-used, rather than recompiling it? |
so I'm waiting for my colleague to give me a repro case so that I could narrow it down for him. Meanwhile, why is this pure
I wrote it first to check that standalone but actually I discovered that it's very slow! On A100:
that's a 40x difference! Perhaps this is what we are hitting, but this is in reverse - vllm+outlines is 40x faster than outlines on its own. edit: I found the issue with the direct
now it's 2x slower than vllm integration:
not sure why this time. |
the other mismatch thing I noticed is that vllm strips spaces between elements of json:
I guess that saves a few tokens from needing to be generated w/o compromising the output structure - nice! |
ok, I see that vllm isn't using
this should be cached during subsequent requests of the same run! to see that it recompiles it run my script here: #8313 (comment)
The problem here is that if the generator runs out of max_tokens and the json structure isn't closed - i.e. missing say the very last token of edit: the problem with repro code is now defined here: #8350 |
@stas00 - Thank you for the detailed analysis here! QQ - are you planning to open up a PR to fix this? We would definitely appreciate a contribution if you have the bandwidth |
The summary of things so far:
For the bigger issue that started the current main issue in the OP I'm blocked by my colleague to give me a larger repro test case, so that I could verify it and if true reduce it to a small repro case, so please bear with me until they come through. |
@
Thanks for your information. I know that |
any progress? |
We are experimenting the same issues on our side, guided generation is a great feature but it is very slow in offline mode. |
Proposal to improve performance
With a single request / online mode I'm getting:
outlines
150 tok/sec (2x slower)lm-format-enforcer
90 tok/sec (~3x slower)with offline mode I get:
outlines
is about 10-20x slower than no guided generationlm-format-enforcer
is about 4x faster thanoutlines
(note that it is slower thanoutlines
for online)for online I was using this schema:
for offline I was using an even simpler schema:
the huge performance hit in the offline mode is very strange for both backends.
2x slow down in the online mode is pretty bad too as it's already a huge impact. The offline mode can actually tolerate 2x no problem as there is no human in the loop, but 10-20x is a way impractical.
vllm=0.6.0
andoutlines==0.0.46
The text was updated successfully, but these errors were encountered: