[Core]: (2/N) Support prefill only models by Workflow Defined Engine - Prefill only attention #9124
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously
etc. PTAL [RFC]: Support encode only models by Workflow Defined Engine #8453
What is this pr going to do?
This PR focuses on the following new features
Proposed Change.
Prefill only attention backend implementations
Adding the atten parameter
Why do you need this?
How to support enable bidirectional
The following is not the focus of this pr, but we will need to discuss it sooner or later. It's best to put it here first.
In the proof-of-concept branch #8452
1.https://github.com/noooop/vllm/blob/653794e37db5af6a8951d927f2a231a67531bea0/vllm/wde/retriever/modelzoo/gte_qwen/arg_utils.py#L23-L25
2.https://github.com/noooop/vllm/blob/653794e37db5af6a8951d927f2a231a67531bea0/vllm/wde/decode_only/workflow.py#L15C1-L17C47
3.https://github.com/noooop/vllm/blob/653794e37db5af6a8951d927f2a231a67531bea0/vllm/wde/prefill_only/layers/attention/selector.py#L51-L59
4.https://github.com/noooop/vllm/blob/653794e37db5af6a8951d927f2a231a67531bea0/vllm/wde/core/llm_engine.py#L83C1-L84C57
5.https://github.com/noooop/vllm/blob/653794e37db5af6a8951d927f2a231a67531bea0/vllm/wde/prefill_only/runner/model_runner.py#L35C1-L50C1
6.https://github.com/noooop/vllm/blob/653794e37db5af6a8951d927f2a231a67531bea0/vllm/wde/decode_only/modelzoo/qwen2.py#L313C1-L319C15
7.https://github.com/noooop/vllm/blob/653794e37db5af6a8951d927f2a231a67531bea0/vllm/wde/decode_only/modelzoo/qwen2.py#L141C1-L147C57
Briefly introduce
What new models need to be supported
These models are all from issues and are also very famous:
These models is roughly divided into three categories:
What new features these new models have
What the above three categories have in common is that there is only the prefill stage. In order to make the terminology more precise, prefill only is used below.
You can think of prefill only as encode only fancy writing.
New features:
How engine Architecture needs to support these features flexibly and efficiently.
If we directly add new functions to existing modules, these modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results
The most flexible and efficient way to support the prefill only models is to implement different modules for models of different architectures and load the required modules on demand.
I call this architecture Workflow Defined Engine, or WDE for short.
I divided the Engine into the following modules.
With wde, there is no need for one module to be compatible with all functions. You can use the dynamic loading feature of python to load different modules at the highest level, for different models and different needs.
Some models cannot use the common Workflow. When you don’t know where to put the dirty code, you can always create a new workflow and link the model architecture to the new workflow to avoid leaving dirty code everywhere for the sake of compatibility.