Store internal states of speculative decode #7522
vladislavkruglikov
started this conversation in
General
Replies: 2 comments 11 replies
-
@WoosukKwon @Yard1 @youkaichao @cadedaniel what are your thoughts on this? |
Beta Was this translation helpful? Give feedback.
0 replies
-
What's the motivation to add this to the live-serving scenario? Unfortunately we are very latency constrained, so additional features must be justified, otherwise we will dismiss the latency reduction in live serving. |
Beta Was this translation helpful? Give feedback.
11 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I want to get internal states of speculative decode such as history of proposals and scores in the response as well as completion
To achieve my goal I generally need to get proposals and proposal scores from model executor
vllm/vllm/engine/async_llm_engine.py
Lines 282 to 283 in d3d9cb6
After this pass to function that is responsible for update of internal state of sequence with outputs from model executor
vllm/vllm/engine/async_llm_engine.py
Lines 287 to 289 in d3d9cb6
But the problem that I face is that list[SamplerOutput] that is being returned from the model executor and speculative worker under the hood stores data about each token not a whole sequence. I mean I could possibly append proposals and proposal scores for every token in sampler output but this would be just weird and scary so I need to figure out some prettier way to do that
One way is to enrich return type of worker to allow it store some metadata except sampler output for example something like this would work
So change from this
vllm/vllm/spec_decode/spec_decode_worker.py
Lines 333 to 337 in d3d9cb6
To this
This data structure can be returned from speculative worker and after all returned by model executor and passed to post process function that would append proposals and proposal scores to specific sequence data items
My question is what do you think about extending response type of all workers to allow it to have additional data or do you have your own thoughts on this?
Beta Was this translation helpful? Give feedback.
All reactions