This repository has been archived by the owner on Oct 11, 2024. It is now read-only.
Upstream sync 2024 07 01#350
Merged
robertgshaw2-neuralmagic merged 113 commits intomain from upstream-sync-2024-07-01Jul 3, 2024
+12,471-4,350
Commits
Commits on Jul 1, 2024
[Speculative Decoding] Support draft model on different tensor-parallel size than target model (vllm-project#5414)
[Core] Refactor Worker and ModelRunner to consolidate control plane communication (vllm-project#5408)
[VLM][BugFix] Make sure that
multi_modal_kwargs
can broadcast properly with ring buffer. (vllm-project#5905)[Bugfix] Better error message for MLPSpeculator when
num_speculative_tokens
is set too high (vllm-project#5894)[ Misc ] Remove
fp8_shard_indexer
from Col/Row Parallel Linear (Simplify Weight Loading) (vllm-project#5928)[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (vllm-project#5909)
[ Misc ] Refactor w8a8 to use
process_weights_after_load
(Simplify Weight Loading) (vllm-project#5940)- committed
- committed
- committed
- committed
- committed
- committed
- committed
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (vllm-project#5348)
- committed
Commits on Jul 2, 2024
- authored
- committed
- committed
- committed
- committed
- committed