[Frontend][Core] Add Guidance backend for guided decoding #10217

JC1DA · 2024-11-11T08:11:42Z

This pull request extends guided decoding capabilities

Add guidance backend

guidance backend supports regex, choice, json and grammar.

relevant: #5245

Usage

JSON Generation

from pydantic import BaseModel, ConfigDict

model = "Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4"
llm = LLM(model=model)

class UserProfile(BaseModel):
    name: str
    age: int
    email: str

    model_config = ConfigDict(extra="forbid")

sampling_params = SamplingParams(
    temperature=0.0,
    top_p=0.95,
    max_tokens=512,
    guided_decoding=GuidedDecodingParams(
        json=UserProfile,
        backend="guidance",
    ),
)

outputs = llm.chat(
    messages=[
        [
            CustomChatCompletionMessageParam(
                role="system", content="You are a helpful assistant."
            ),
            CustomChatCompletionMessageParam(
                role="user",
                content="Tell me something about yourself (name, age, email) in JSON format.\n",
            ),
        ],
    ],
    sampling_params=[sampling_params],
)

Choices Generation

sampling_params = SamplingParams(
    temperature=0.0,
    top_p=0.95,
    max_tokens=512,
    guided_decoding=GuidedDecodingParams(
        choice=["3","4","5","6"],
        backend="guidance",
    ),
)

outputs = llm.chat(
    messages=[
        [
            CustomChatCompletionMessageParam(
                role="system", content="You are a 5 years-old helpful assistant."
            ),
            CustomChatCompletionMessageParam(
                role="user",
                content="How old are you?",
            ),
        ],
    ],
    sampling_params=[sampling_params],
)

Regex Generation via OpenAI Client

model = "Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4"
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="NOKEY",
)

completion = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "user",
            "content": "You are a 5 years-old helpful assistant.",
        },
        {
            "role": "user",
            "content": """How old are you?""",
        },
    ],
    extra_body={"guided_regex": "\\d+", "guided_decoding_backend": "guidance"}
)

Signed-off-by: Loc Huynh <[email protected]>

github-actions · 2024-11-11T08:11:53Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

njhill

Thanks @JC1DA for the great contribution!

A few other questions:

Presumably the parallelization speedup is due to the fact that the pytorch ops involved release the gil?
Were your outlines measurements also using the threadpool?
It would be good to also try with the latest outlines 0.1.x if possible which is apparently much faster than < 0.1. We would want to upgrade to that too in any case.

vllm/model_executor/guided_decoding/outlines_logits_processors.py

njhill · 2024-11-12T00:39:59Z

vllm/model_executor/guided_decoding/__init__.py

@@ -8,7 +8,7 @@ async def get_guided_decoding_logits_processor(
        guided_params: GuidedDecodingParams,
        tokenizer) -> Optional[LogitsProcessor]:
    # CFG grammar not supported by LMFE, so we use outlines instead
-    if guided_params.backend == 'outlines' or guided_params.grammar:
+    if guided_params.backend == 'outlines':


LMFE doesn't support grammar, we should retain the existing behaviour to fall back to a different backend in this case (perhaps it could now be guidance rather than outlines).

Should we add a check at the beginning to use outlines by default?

if guided_params.grammar and guided_params.backend not in [ 'outlines', 'guidance' ]: guided_params.backend = 'outlines'

vllm/model_executor/layers/logits_processor.py

vllm/model_executor/guided_decoding/guidance_decoding.py

vllm/model_executor/guided_decoding/guidance_logits_processors.py

njhill · 2024-11-13T06:10:27Z

vllm/model_executor/guided_decoding/guidance_logits_processors.py

+                mask = torch.tensor(mask,
+                                    dtype=logits.dtype,
+                                    device=logits.device)


Can the allocated mask tensor be reused between calls?

@njhill I have updated the code to reuse the logits variable. as we don't add thread pool into this PR anymore, it should work great with inplace ops.

mgoin · 2024-11-12T20:32:32Z

requirements-common.txt

@@ -19,6 +19,7 @@ prometheus-fastapi-instrumentator >= 7.0.0
 tiktoken >= 0.6.0  # Required for DBRX tokenizer
 lm-format-enforcer == 0.10.6
 outlines >= 0.0.43, < 0.1
+guidance>=0.2rc


What are the requirements of guidance? Does it have compiled binaries for specific python versions or CPU architectures?
Maybe this could be an optional dependency to start with, like we do for many quantization backends

Hey @mgoin, guidance does have a fair number of dependencies, but we're mostly depending on the lower-level guidance layer here llguidance. llguidance is compiled for Python 3.9+ on manylinux/Mac OS/Windows. My understanding is that vLLM only supports Linux on Python 3.9+ too so I think we should be good there.

We can change this PR in the near future to just use llguidance (which has no other dependencies: https://github.com/microsoft/llguidance/blob/b5ca97b2562b720c1ff3f567bfa45956338a1864/pyproject.toml#L8). We just need to port one last function down from the Python guidance library into the Rust layer first :).

@mgoin we replaced guidance with llguidance which has no extra dependencies. Hope it is good enough to merge :)

Signed-off-by: Loc Huynh <[email protected]>

JC1DA · 2024-11-14T00:49:57Z

Thanks @njhill for your quick review. Really appreciate it.

Presumably the parallelization speedup is due to the fact that the pytorch ops involved release the gil?

That's one reason, another one is the parser (llguidance) used in guidance was implemented in Rust, and it automatically releases GIL when called. So it would be more efficient to run guidance in parallel.

Were your outlines measurements also using the threadpool?

Yes, experiments were done using threadpool

It would be good to also try with the latest outlines 0.1.x if possible which is apparently much faster than < 0.1. We
would want to upgrade to that too in any case.

I haven't tested outlines 0.1.x yet, just used the current version in VLLM. However, I am not focusing too much on the benchmark for this PR. The focus is to make guidance available as another guided decoding backend to VLLM's community so people can choose what's best for them. :)

Signed-off-by: Loc Huynh <[email protected]>

JC1DA · 2024-11-14T21:00:07Z

I also figured out lm-format-enforcer is not thread-safe. It failed some tests when number of threads is larger than 1.
@njhill any suggestions for this?

Signed-off-by: Loc Huynh <[email protected]>

JC1DA · 2024-11-25T08:34:40Z

I also figured out lm-format-enforcer is not thread-safe. It failed some tests when number of threads is larger than 1. @njhill any suggestions for this?

Decided to rollback to single threaded version to not break lm-format-enforcer. The PR is coming with minimal changes to add llguidance as new logits processor.
Hope the current code is good for merging :) @njhill @mgoin

Signed-off-by: Loc Huynh <[email protected]>

mergify · 2024-12-03T07:17:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @JC1DA.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Loc Huynh <[email protected]>

JC1DA · 2024-12-03T10:43:02Z

Resolved conflict with newly merged xgrammar

mergify · 2024-12-03T18:33:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @JC1DA.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

JC1DA · 2024-12-05T19:54:31Z

Resolved conflict with newly merged xgrammar

@njhill @mgoin

JC1DA added 5 commits November 10, 2024 23:55

Add guidance dependency

23527bd

Signed-off-by: Loc Huynh <[email protected]>

Add guidance logit processor

46f3a81

Signed-off-by: Loc Huynh <[email protected]>

Use threadpool to post-process logits in parallel

905936c

Signed-off-by: Loc Huynh <[email protected]>

Add guidance to guided decoding testcase

75cd128

Signed-off-by: Loc Huynh <[email protected]>

Merge main branch

3b980c2

Signed-off-by: Loc Huynh <[email protected]>

mergify bot added the ci/build label Nov 11, 2024

njhill reviewed Nov 13, 2024

View reviewed changes

mgoin reviewed Nov 13, 2024

View reviewed changes

JC1DA added 8 commits November 13, 2024 14:11

Return mask instead of scores to make outlines work with threadpool

6cb1e30

Signed-off-by: Loc Huynh <[email protected]>

refactor guidance logits processor

ecc184e

Signed-off-by: Loc Huynh <[email protected]>

Add thread_pool property to LogitsProcessorWithLoRA

3cd8c23

Signed-off-by: Loc Huynh <[email protected]>

Use outlines by default if backend is not specified

4d1e868

Signed-off-by: Loc Huynh <[email protected]>

Refactor logits_processor.py and guidance_decoding.py

80b92f0

Signed-off-by: Loc Huynh <[email protected]>

Do not use context manager for threadpool

1d27491

Signed-off-by: Loc Huynh <[email protected]>

Force all invalid tokens to have 0 value

583d839

Signed-off-by: Loc Huynh <[email protected]>

Format code

4207251

Signed-off-by: Loc Huynh <[email protected]>

Use outlines by default in get_local_guided_decoding_logits_processor

ddc28af

Signed-off-by: Loc Huynh <[email protected]>

Harsha-Nori mentioned this pull request Nov 15, 2024

[Roadmap] vLLM Roadmap Q4 2024 #9006

Open

39 tasks

lochuynh1412 and others added 6 commits November 21, 2024 14:35

Add guidance_utils

217311a

Signed-off-by: Loc Huynh <[email protected]>

Use llguidance compiler for JSON

491c2a0

Signed-off-by: Loc Huynh <[email protected]>

Replace guidance library with llguidance

0d6c924

Signed-off-by: Loc Huynh <[email protected]>

Format code

e4bb6ac

Signed-off-by: Loc Huynh <[email protected]>

Use string serialized_grammar instead of dict

fc1e52d

Signed-off-by: Loc Huynh <[email protected]>

Rollback to use single-threaded version of logit processor

8e2d64e

Signed-off-by: Loc Huynh <[email protected]>

JC1DA closed this Nov 25, 2024

JC1DA reopened this Nov 25, 2024

lochuynh1412 and others added 10 commits November 25, 2024 14:26

Temporarily disable fast-forward tokens

b2526b5

Signed-off-by: Loc Huynh <[email protected]>

Update logits_processor.py

7e2d2f8

Signed-off-by: Loc Huynh <[email protected]>

Rollback lora layers

ac50d78

Signed-off-by: Loc Huynh <[email protected]>

Rollback lora layers

3540dc7

Signed-off-by: Loc Huynh <[email protected]>

Rollback outlines_logits_processors

9b0a20f

Signed-off-by: Loc Huynh <[email protected]>

Reuse logits for performance optimization

adc0d0b

Signed-off-by: Loc Huynh <[email protected]>

Reformat code

bdaa4ec

Signed-off-by: Loc Huynh <[email protected]>

Force logits to be 0 if not in the mask

132d49d

Signed-off-by: Loc Huynh <[email protected]>

Force logits for special tokens not in vocab to 0

343d7cc

Signed-off-by: Loc Huynh <[email protected]>

1) Fix incorrect 0-min shift 2) Remove unnecessary ops

4f2cea1

Signed-off-by: Loc Huynh <[email protected]>

mergify bot added the needs-rebase label Dec 3, 2024

Merge xgrammar from main

184b68b

Signed-off-by: Loc Huynh <[email protected]>

mergify bot removed the needs-rebase label Dec 3, 2024

mergify bot added the needs-rebase label Dec 3, 2024

Merge branch 'main' into support_guidance_logits_processor

81c0280

mergify bot removed the needs-rebase label Dec 3, 2024

JC1DA added 2 commits December 9, 2024 18:36

Merge branch 'main' into support_guidance_logits_processor

044197b

Merge branch 'main' into support_guidance_logits_processor

13a9e29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend][Core] Add Guidance backend for guided decoding #10217

[Frontend][Core] Add Guidance backend for guided decoding #10217

JC1DA commented Nov 11, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 11, 2024

njhill left a comment

njhill Nov 12, 2024

JC1DA Nov 14, 2024

njhill Nov 13, 2024

JC1DA Nov 14, 2024 •

edited

Loading

mgoin Nov 12, 2024

Harsha-Nori Nov 15, 2024

JC1DA Nov 25, 2024

JC1DA commented Nov 14, 2024 •

edited

Loading

JC1DA commented Nov 14, 2024

JC1DA commented Nov 25, 2024 •

edited

Loading

mergify bot commented Dec 3, 2024

JC1DA commented Dec 3, 2024 •

edited

Loading

mergify bot commented Dec 3, 2024

JC1DA commented Dec 5, 2024 •

edited

Loading

[Frontend][Core] Add Guidance backend for guided decoding #10217

Are you sure you want to change the base?

[Frontend][Core] Add Guidance backend for guided decoding #10217

Conversation

JC1DA commented Nov 11, 2024 • edited by github-actions bot Loading

Usage

github-actions bot commented Nov 11, 2024

njhill left a comment

Choose a reason for hiding this comment

njhill Nov 12, 2024

Choose a reason for hiding this comment

JC1DA Nov 14, 2024

Choose a reason for hiding this comment

njhill Nov 13, 2024

Choose a reason for hiding this comment

JC1DA Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

mgoin Nov 12, 2024

Choose a reason for hiding this comment

Harsha-Nori Nov 15, 2024

Choose a reason for hiding this comment

JC1DA Nov 25, 2024

Choose a reason for hiding this comment

JC1DA commented Nov 14, 2024 • edited Loading

JC1DA commented Nov 14, 2024

JC1DA commented Nov 25, 2024 • edited Loading

mergify bot commented Dec 3, 2024

JC1DA commented Dec 3, 2024 • edited Loading

mergify bot commented Dec 3, 2024

JC1DA commented Dec 5, 2024 • edited Loading

JC1DA commented Nov 11, 2024 •

edited by github-actions bot

Loading

JC1DA Nov 14, 2024 •

edited

Loading

JC1DA commented Nov 14, 2024 •

edited

Loading

JC1DA commented Nov 25, 2024 •

edited

Loading

JC1DA commented Dec 3, 2024 •

edited

Loading

JC1DA commented Dec 5, 2024 •

edited

Loading