[Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility #16292
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ticket
N/A
Problem description
What's changed
generator_vllm.py::TtLlamaForCausalLM
for vLLM model init and execution_easy_trace_text
(handles capture trace and decode forward trace automatically for user) toLlamaGenerator
and modifieddecode_forward_text
(only valid decode entry point) to either call_easy_trace_text
or_decode_forward_no_trace_text
depending on theenable_trace
arg.read_from_device
arg todecode_forward_text
so vLLM can perform async output processing during decode execution and modified_easy_trace_text
and_decode_forward_no_trace_text
to not read back outputsllama_common.py::get_padded_prefill_len
to pad to 128 if seq len < 128 since that is the min required for llama3 attention (same as currently done for llama3 demo and vision model)Checklist