Add factories for logits_processors #38

maxdebayser · 2024-06-03T21:32:12Z

This is a proposed solution to the guided decoding crash problem.

In vLLM there are two requests that can be used to submit more than one sequence at the same time: the completion request in the legacy OpenAI API, and the generate batch request in the tgis gRPC API. In both cases the sampling params are validated only once and a single SamplingsParam object is shared between all sequences, even when they belong to different sequence groups. The SamplingsParam is mostly a data object, but it has a list of logits processors that are executable. If the logits processors are stateless there is no problem. However, the CFGFSM used by the CFGLogitsProcessor has an internal state that depends on the sequence generated so far and is updated at each iteration. This causes it to raise a KeyError and crash the asyncio event loop.

The first attempted solution added a seq_id parameter to the logits processor call so that it could manage state internally, but that changed the interface that is already used by some external libraries.

The solution proposed here is based on adding factories for stateful logits processors. The basic idea is:

We add processors and factories to the same list so that they are in the correct order
We add a logits_processors list to the SequenceGroupState object
When the SequenceGroup is created, we iterate over the sampling_params.logits_processors
and copy the logits_processors and call the factories to populate SequenceGroupState.logits_processors
The LogitsProcessor(nn.Module) will iterate over the SequenceGroupState.logits_processors instead of
the sampling_params.logits_processors

Here are some diagrams to illustrate the current code structure to better visualize the proposed changes:

The idea is quite simple, but the execution is a bit tricky due to the nature of async code in python where an async call can't call a non-async function that calls an async function. In the PR I tried to support both using LLMEngine directly as well as the AsyncLLMEngine used for serving.

@njhill, I was going to add support to return the processors to the factory, but I realized that it was a little bit more complicated because only the scheduler knows when the sequence is done. Maybe we can add a callback somewhere in the scheduler where we can add the deallocation call. Actually that might be required, because I realized that there is another hidden bug: when sequence are preempted with the recompute policy, that makes the state of the logits processor invalid.

This allows vllm to support stateful LPs that must be unique for each sequence. Signed-off-by: Max de Bayser <[email protected]>

njhill

Thanks @maxdebayser!

I feel like this could be simplified a bit, basically just change get_lm_format_enforcer_guided_decoding_logits_processor and get_outlines_guided_decoding_logits_processor to return factories instead of LPs

vllm/engine/llm_engine.py

vllm/entrypoints/openai/serving_chat.py

vllm/entrypoints/openai/serving_completion.py

vllm/sequence.py

vllm/model_executor/guided_decoding/outlines_decoding.py

vllm/model_executor/guided_decoding/__init__.py

vllm/model_executor/sampling_metadata.py

Signed-off-by: Max de Bayser <[email protected]>

To reduce the lines of the diff Signed-off-by: Max de Bayser <[email protected]>

maxdebayser · 2024-06-04T14:42:29Z

Thanks for the suggestions, @njhill . I didn't want to change too much of the original code so I ended up going overboard with the async cleverness. It's much simpler now.

njhill

Thanks @maxdebayser this looks much better!

The other thing we may need to look at is what to do when sequences are forked. I think this only applies to beam search. Is deep-copying the LPs the right thing to do? Could there be problems deep-copying arbitrary LPs?

vllm/sampling_params.py

vllm/model_executor/guided_decoding/__init__.py

vllm/model_executor/guided_decoding/outlines_decoding.py

vllm/sampling_params.py

njhill · 2024-06-06T22:43:18Z

@maxdebayser after addressing the simple comments above (not necessarily the pooling thing yet), maybe you could open an upstream PR? Then we can continue the discussions with others...

Signed-off-by: Max de Bayser <[email protected]>

maxdebayser · 2024-06-07T00:38:43Z

Here's the upstream PR: vllm-project/vllm#5329

Add factories for logits_processors

35de027

This allows vllm to support stateful LPs that must be unique for each sequence. Signed-off-by: Max de Bayser <[email protected]>

maxdebayser requested review from joerunde, tdoublep and njhill June 3, 2024 21:32

njhill reviewed Jun 3, 2024

View reviewed changes

maxdebayser added 2 commits June 4, 2024 11:36

Address review commments and simplify code

f803ca2

Signed-off-by: Max de Bayser <[email protected]>

Revert to original formatting

d5b47f5

To reduce the lines of the diff Signed-off-by: Max de Bayser <[email protected]>

maxdebayser mentioned this pull request Jun 4, 2024

[Frontend][Core] Update Outlines Integration from FSM to Guide vllm-project/vllm#4109

Merged

maxdebayser requested a review from ckadner June 4, 2024 18:05

Merge branch 'main' into lp_factories

3b9b2bb

njhill reviewed Jun 6, 2024

View reviewed changes

address review comments

0be582d

Signed-off-by: Max de Bayser <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add factories for logits_processors #38

Add factories for logits_processors #38

maxdebayser commented Jun 3, 2024

njhill left a comment

maxdebayser commented Jun 4, 2024

njhill left a comment

njhill commented Jun 6, 2024

maxdebayser commented Jun 7, 2024

Add factories for logits_processors #38

Are you sure you want to change the base?

Add factories for logits_processors #38

Conversation

maxdebayser commented Jun 3, 2024

njhill left a comment

Choose a reason for hiding this comment

maxdebayser commented Jun 4, 2024

njhill left a comment

Choose a reason for hiding this comment

njhill commented Jun 6, 2024

maxdebayser commented Jun 7, 2024