[PoC]: Support encode only models by Workflow Defined Engine #8452

noooop · 2024-09-13T08:27:23Z

PTAL #8453

Briefly introduce

What new models need to be supported

These models are all from issues and are also very famous:

xlm_roberta
bge-m3
bge-reranker-v2-m3
bert
bge v1.5 family
Snowflake Arctic Embed (Family)
gte-Qwen2
This list is still growing

These models is roughly divided into three categories：

Encode only models. (Bidirectional Transformers, causal=False), Often fine-tuned as retriever and reranker etc.
Decode only models. (masked multi-head attention, causal=True). There are two interesting uses:
- Output last hidden states as a feature extractor
- Decode only retriever （I don't know of a better name），E.g. e5-mistral-7b （The only Embed model currently supported by vllm)
- Whether it has been fine-tuned or not, there is almost no difference in the code.
Enable bidirectional. LLM2Vec propose a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder.

What new features these new models have

What the above three categories have in common is that there is only the prefill stage. In order to make the terminology more precise, prefill only is used below.

You can think of prefill only as encode only fancy writing.

New features:

attention
- Prefill only models requires simpler attention implementations, no need to consider kvcache, no decoding phase
- We need to support enable_bidirectional flag manually or read hf config automatically, enable bidirectional.
scheduler
- Prefill only models requires simpler scheduler, no need to consider kvcache and preemption
- Prefill only models, there is no correlation between tasks, so it is easy to implement async scheduling
executer
- In order to support async scheduling, model_input_builder needs to be separated from the runner.
- The main thread executes scheduling and all CPU processing, and the gpu thread only executes h2d, execution model, d2h
- If async scheduling and async execution are implemented, data parallelism is also easy to implement. Data parallelism is more efficient for small models

How engine Architecture needs to support these features flexibly and efficiently.

If we directly add new functions to existing modules, these modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results

The most flexible and efficient way to support the prefill only models is to implement different modules for models of different architectures and load the required modules on demand.

I call this architecture Workflow Defined Engine, or WDE for short.

I divided the Engine into the following modules.

InputProcessor: The llm models inputs strings, the reranker inputs pairs, and the multimodal model input is more complex...
OutputProcessor: The retriever(embedding) models output embeddings, reranker models and classification models output Scores...
ModelInputBuilder: Building model inputs and attention metadata
AttnBackend: Support different AttnBackend and enable bidirectional
Tokenizer: There may be different tokenizers
Executor: Sync\Async\TP\PP\DP\maybe more
Worker & runner: Support different devices\maybe more
EngineArgs: Different models, different config may accept different parameters
maybe more

With wde, there is no need for one module to be compatible with all functions. You can use the dynamic loading feature of python to load different modules at the highest level, for different models and different needs.

Modules can be configured through Workflow, plug and play
Flexibly support plug-ins, and developers can load their own modules.
Workflow is really the best place to hide dirty codes.
Some models cannot use the common Workflow. When you don’t know where to put the dirty code, you can always create a new workflow and link the model architecture to the new workflow to avoid leaving dirty code everywhere for the sake of compatibility.

Let's start splitting this pr and try to merge it into main

wde core & bert model [Core]: (Last/N) Support prefill only models by Workflow Defined Engine #8964
- [Core]: (2/N) Support prefill only models by Workflow Defined Engine - Prefill only attention #9124
- [Core]: (1/N) Support prefill only models by Workflow Defined Engine - Prefill only scheduler #9181

github-actions · 2024-09-13T08:27:35Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

DarkLight1337 · 2024-09-17T07:51:57Z

This is a significant change from our current architecture. We'll consider incorporating this when we refactor our core framework.

noooop · 2024-09-18T03:53:39Z

As vllm supports more and more models and functions, they require different attention, scheduler, executor, and input output processor. . These modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results

With wde, there is no need for one module to be compatible with all functions.
You can always use the workflow to load new modules at the highest level to support new functions.

I hope you like this new architecture @DarkLight1337

DarkLight1337 · 2024-09-18T03:58:33Z

It looks nice for sure, but there are many abstractions that are difficult to adopt immediately. If you want us to use this architecture, I suggest that you split this PR up into smaller chunks and gradually refactor over the code base rather than doing everything all at once.

noooop · 2024-09-18T04:02:24Z

I want to do experiments on the encode only model and integrate it with the existing code only at entrypoints, so that the impact of modifications is minimal.

noooop · 2024-09-18T04:03:56Z

Do you have any better suggestions? @DarkLight1337

noooop · 2024-09-18T04:06:05Z

This PR is used as a demonstration. I can modify and resubmit a PR.

noooop · 2024-09-18T04:11:06Z

Don’t be too anxious
I will improve this PR and support more models and functions.
Until you find a suitable opportunity to merge it in

DarkLight1337 · 2024-09-18T05:18:21Z

Do you have any better suggestions? @DarkLight1337

Not really, just do as you have said.

DarkLight1337 · 2024-09-26T10:47:08Z

You should list out what are the features in your PR and how they correspond to #8779. Otherwise, people would have to read through your whole PR to understand what is going on.

noooop · 2024-09-26T10:48:12Z

I feel that although this PR has some similarities with #8779,

their focus is different and there is no way to compare their features one-to-one.

DarkLight1337 · 2024-09-26T11:01:23Z

I feel that although this PR has some similarities with #8779, their focus is different and there is no way to compare their features one-to-one.

I'd say your PR focuses on tackling the second goal in #8779 (the new architecture will be extensible and modular). You should explain in detail how this is being achieved (in particular, what type of abstractions you are using?). That way, we can consider those aspects when planning how to refactor the existing code.

noooop · 2024-09-26T11:13:09Z

Many things are simple to write code but very complicated to explain.

Can you let them look at the code?

DarkLight1337 · 2024-09-26T11:16:37Z

Many things are simple to write code but very complicated to explain.

Can you let them look at the code?

The thing is, people don't want to look at 10k lines of code to understand what is going on. If you want them to use this code, it is your responsibility to explain it.

noooop · 2024-09-26T11:18:35Z

OK, I'll try

DarkLight1337 · 2024-09-27T08:22:40Z

Yeah, that should be good enough to start with.

Some models cannot use the common Workflow. When you don’t know where to put the dirty code, you can always create a new workflow and link the model architecture to the new workflow to avoid leaving dirty code everywhere for the sake of compatibility.

Probably need to address this. e.g. which types of models are not supported under this workflow? If it's a large category then we'll have to find a solution to it.

DarkLight1337 · 2024-09-27T08:40:33Z

Looks better. You should also think about how to adopt this incrementally.

DarkLight1337 · 2024-09-27T09:33:14Z

Sorry was afk, you can post it whenever you like.

noooop · 2024-09-30T11:31:02Z

@DarkLight1337

I think the first part supporting bert is pretty good, but it's still 6,000 lines of code

PTAL #8964

noooop · 2024-09-30T12:01:20Z

6000 lines of code support bert, I feel like a clown

noooop · 2024-10-09T06:03:15Z

@DarkLight1337

#9166 has many similarities with Workflow Defined Engine.

Can you invite @WoosukKwon to participate in the discussion of this PR?

DarkLight1337 · 2024-10-09T06:07:26Z

I suggest you comment directly on his PR. You can also join our Slack workspace (see README) and ping him.

noooop added 3 commits September 13, 2024 16:22

wde core, xlm_roberta, bge-m3

a6a003b

tests xlm_roberta, bge-m3

fdfa0ff

demos

62bf07d

noooop mentioned this pull request Sep 13, 2024

[RFC]: Support encode only models by Workflow Defined Engine #8453

Open

1 task

noooop closed this Sep 13, 2024

noooop reopened this Sep 13, 2024

noooop closed this Sep 13, 2024

noooop reopened this Sep 13, 2024

noooop changed the title ~~[RFC]: Support encode only models (xlm-roberta、bge-m3...) by Workflow Defined Engine~~ [Core]: Support encode only models (xlm-roberta、bge-m3...) by Workflow Defined Engine Sep 13, 2024

update tests

15348da

noooop mentioned this pull request Sep 14, 2024

[Usage]: Dose vLLM support embedding api of multimodal llm? #8483

Closed

1 task

support bge-reranker-v2-m3

046b4e9

noooop added 2 commits September 18, 2024 13:26

yapf

6c9ca52

yapf

6d7fef0

noooop changed the title ~~[Core]: Support encode only models (xlm-roberta、bge-m3...) by Workflow Defined Engine~~ [Core]: Support encode only models by Workflow Defined Engine Sep 18, 2024

noooop added 6 commits September 18, 2024 15:32

yapf

dd976e7

support torch spda backend

93503b7

support bert

81c6936

yapf & ruff

41214ab

yapf & ruff

b06fc40

yapf & ruff

1059e65

support data parallelism

60931e8

noooop added 3 commits September 27, 2024 13:51

dirty fix distributed environment

efef0b3

dirty fix destroy model parallel

a45e867

dirty fix pp environment

236980d

noooop added 2 commits September 29, 2024 11:18

Merge branch 'vllm-project:main' into wde_encode_only

653da13

catch up vllm main

653794e

noooop mentioned this pull request Sep 30, 2024

[Core]: (Last/N) Support prefill only models by Workflow Defined Engine #8964

Draft

3 tasks

noooop mentioned this pull request Oct 7, 2024

[Core]: (2/N) Support prefill only models by Workflow Defined Engine - Prefill only attention #9124

Draft

noooop changed the title ~~[Core]: Support encode only models by Workflow Defined Engine~~ [PoC]: Support encode only models by Workflow Defined Engine Oct 8, 2024

noooop mentioned this pull request Oct 9, 2024

[Core]: (1/N) Support prefill only models by Workflow Defined Engine - Prefill only scheduler #9181

Draft

noooop force-pushed the wde_encode_only branch from 653794e to e80d5f2 Compare October 9, 2024 11:06

noooop closed this Oct 9, 2024

noooop deleted the wde_encode_only branch October 9, 2024 11:11

noooop reopened this Oct 9, 2024

noooop force-pushed the wde_encode_only branch from e80d5f2 to 653794e Compare October 9, 2024 11:38

This was referenced Oct 10, 2024

[Feature]: Simple Data Parallelism in vLLM #9206

Open

[WIP] Prototyping re-arch #9166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PoC]: Support encode only models by Workflow Defined Engine #8452

[PoC]: Support encode only models by Workflow Defined Engine #8452

noooop commented Sep 13, 2024 •

edited

Loading

github-actions bot commented Sep 13, 2024

DarkLight1337 commented Sep 17, 2024

noooop commented Sep 18, 2024

DarkLight1337 commented Sep 18, 2024

noooop commented Sep 18, 2024

noooop commented Sep 18, 2024

noooop commented Sep 18, 2024

noooop commented Sep 18, 2024

DarkLight1337 commented Sep 18, 2024

DarkLight1337 commented Sep 26, 2024 •

edited

Loading

noooop commented Sep 26, 2024 •

edited

Loading

DarkLight1337 commented Sep 26, 2024

noooop commented Sep 26, 2024

DarkLight1337 commented Sep 26, 2024

noooop commented Sep 26, 2024

DarkLight1337 commented Sep 27, 2024

DarkLight1337 commented Sep 27, 2024 •

edited

Loading

DarkLight1337 commented Sep 27, 2024

noooop commented Sep 30, 2024

noooop commented Sep 30, 2024

noooop commented Oct 9, 2024

DarkLight1337 commented Oct 9, 2024 •

edited

Loading

[PoC]: Support encode only models by Workflow Defined Engine #8452

Are you sure you want to change the base?

[PoC]: Support encode only models by Workflow Defined Engine #8452

Conversation

noooop commented Sep 13, 2024 • edited Loading

Briefly introduce

What new models need to be supported

What new features these new models have

How engine Architecture needs to support these features flexibly and efficiently.

Let's start splitting this pr and try to merge it into main

github-actions bot commented Sep 13, 2024

DarkLight1337 commented Sep 17, 2024

noooop commented Sep 18, 2024

DarkLight1337 commented Sep 18, 2024

noooop commented Sep 18, 2024

noooop commented Sep 18, 2024

noooop commented Sep 18, 2024

noooop commented Sep 18, 2024

DarkLight1337 commented Sep 18, 2024

DarkLight1337 commented Sep 26, 2024 • edited Loading

noooop commented Sep 26, 2024 • edited Loading

DarkLight1337 commented Sep 26, 2024

noooop commented Sep 26, 2024

DarkLight1337 commented Sep 26, 2024

noooop commented Sep 26, 2024

DarkLight1337 commented Sep 27, 2024

DarkLight1337 commented Sep 27, 2024 • edited Loading

DarkLight1337 commented Sep 27, 2024

noooop commented Sep 30, 2024

noooop commented Sep 30, 2024

noooop commented Oct 9, 2024

DarkLight1337 commented Oct 9, 2024 • edited Loading

noooop commented Sep 13, 2024 •

edited

Loading

DarkLight1337 commented Sep 26, 2024 •

edited

Loading

noooop commented Sep 26, 2024 •

edited

Loading

DarkLight1337 commented Sep 27, 2024 •

edited

Loading

DarkLight1337 commented Oct 9, 2024 •

edited

Loading