This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Upstream sync 2024 07 01#350

Merged

robertgshaw2-neuralmagic merged 113 commits intomainfrom upstream-sync-2024-07-01

Jul 3, 2024

+12,471-4,350

Commits on Jul 1, 2024

[Distributed] Add send and recv helpers (vllm-project#5719 )

andoorve
authored and
robertgshaw2-neuralmagic
committed
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement (vllm-project#5772 )

Isotr0py
authored and
robertgshaw2-neuralmagic
committed
[doc][faq] add warning to download models for every nodes (vllm-project#5783 )

youkaichao
authored and
robertgshaw2-neuralmagic
committed
[Doc] Add "Suggest edit" button to doc pages (vllm-project#5789 )

mgoin
authored and
robertgshaw2-neuralmagic
committed
[Doc] Add Phi-3-medium to list of supported models (vllm-project#5788 )

mgoin
authored and
robertgshaw2-neuralmagic
committed
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args (vllm-project#5795 )

CatherineSue
authored and
robertgshaw2-neuralmagic
committed
[ci] Remove aws template (vllm-project#5757 )

khluu
authored and
robertgshaw2-neuralmagic
committed
[Doc] Add notice about breaking changes to VLMs (vllm-project#5818 )

DarkLight1337
authored and
robertgshaw2-neuralmagic
committed
[Speculative Decoding] Support draft model on different tensor-parallel size than target model (vllm-project#5414 )

wooyeonlee0
authored and
robertgshaw2-neuralmagic
committed
[Misc] Remove useless code in cpu_worker (vllm-project#5824 )

DamonFool
authored and
robertgshaw2-neuralmagic
committed
[Core] Add fault tolerance for RayTokenizerGroupPool (vllm-project#5748 )

Yard1
authored and
robertgshaw2-neuralmagic
committed
[doc][distributed] add both gloo and nccl tests (vllm-project#5834 )

youkaichao
authored and
robertgshaw2-neuralmagic
committed
[CI/Build] Add unit testing for FlexibleArgumentParser (vllm-project#5798 )

mgoin
authored and
robertgshaw2-neuralmagic
committed
[Misc] Update w4a16 compressed-tensors support to include w8a16 (vllm-project#5794 )

dsikka
authored and
robertgshaw2-neuralmagic
committed
[Hardware][TPU] Refactor TPU backend (vllm-project#5831 )

WoosukKwon
authored and
robertgshaw2-neuralmagic
committed
resolved

mawong-amd
authored and
robertgshaw2-neuralmagic
committed
[Hardware][TPU] Raise errors for unsupported sampling params (vllm-project#5850 )

WoosukKwon
authored and
robertgshaw2-neuralmagic
committed
[CI/Build] Add E2E tests for MLPSpeculator (vllm-project#5791 )

tdoublep
authored and
robertgshaw2-neuralmagic
committed
[Bugfix] Fix assertion in NeuronExecutor (vllm-project#5841 )

aws-patlange
authored and
robertgshaw2-neuralmagic
committed
[Core] Refactor Worker and ModelRunner to consolidate control plane communication (vllm-project#5408 )

authored and
robertgshaw2-neuralmagic
committed
[Misc][Doc] Add Example of using OpenAI Server with VLM (vllm-project#5832 )

ywang96
authored and
robertgshaw2-neuralmagic
committed
[bugfix][distributed] fix shm broadcast when the queue size is full (vllm-project#5801 )

youkaichao
authored and
robertgshaw2-neuralmagic
committed
[Bugfix] Fix embedding to support 2D inputs (vllm-project#5829 )

WoosukKwon
authored and
robertgshaw2-neuralmagic
committed
[Bugfix][TPU] Fix KV cache size calculation (vllm-project#5860 )

WoosukKwon
authored and
robertgshaw2-neuralmagic
committed
[CI/Build] Refactor image test assets (vllm-project#5821 )

DarkLight1337
authored and
robertgshaw2-neuralmagic
committed
[Kernel] Adding bias epilogue support for cutlass_scaled_mm (vllm-project#5560 )

authored and
robertgshaw2-neuralmagic
committed
[Frontend] Add tokenize/detokenize endpoints (vllm-project#5054 )

sasha0552
authored and
robertgshaw2-neuralmagic
committed
[Hardware][TPU] Support parallel sampling & Swapping (vllm-project#5855 )

WoosukKwon
authored and
robertgshaw2-neuralmagic
committed
[Bugfix][TPU] Fix CPU cache allocation (vllm-project#5869 )

WoosukKwon
authored and
robertgshaw2-neuralmagic
committed
Support CPU inference with VSX PowerPC ISA (vllm-project#5652 )

ChipKerchner
authored and
robertgshaw2-neuralmagic
committed
[doc] update usage of env var to avoid conflict (vllm-project#5873 )

youkaichao
authored and
robertgshaw2-neuralmagic
committed
[Misc] Add example for LLaVA-NeXT (vllm-project#5879 )

ywang96
authored and
robertgshaw2-neuralmagic
committed
[BugFix] Fix cuda graph for MLPSpeculator (vllm-project#5875 )

authored and
robertgshaw2-neuralmagic
committed
[Doc] Add note about context length in Phi-3-Vision example (vllm-project#5887 )

DarkLight1337
authored and
robertgshaw2-neuralmagic
committed
[VLM][Bugfix] Make sure that multi_modal_kwargs is broadcasted properly (vllm-project#5880 )

xwjiang2010
authored and
robertgshaw2-neuralmagic
committed
[Model] Add base class for LoRA-supported models (vllm-project#5018 )

DarkLight1337
authored and
robertgshaw2-neuralmagic
committed
[Bugfix] Fix img_sizes Parsing in Phi3-Vision (vllm-project#5888 )

ywang96
authored and
robertgshaw2-neuralmagic
committed
[CI/Build] [1/3] Reorganize entrypoints tests (vllm-project#5526 )

DarkLight1337
authored and
robertgshaw2-neuralmagic
committed
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (vllm-project#5896 )

DarkLight1337
authored and
robertgshaw2-neuralmagic
committed
[doc][misc] add note for Kubernetes users (vllm-project#5916 )

youkaichao
authored and
robertgshaw2-neuralmagic
committed
[BugFix] Fix MLPSpeculator handling of num_speculative_tokens (vllm-project#5876 )

njhill
authored and
robertgshaw2-neuralmagic
committed
[BugFix] Fix min_tokens behaviour for multiple eos tokens (vllm-project#5849 )

njhill
authored and
robertgshaw2-neuralmagic
committed
[CI/Build] Fix Args for _get_logits_warper in Sampler Test (vllm-project#5922 )

ywang96
authored and
robertgshaw2-neuralmagic
committed
[Model] Add Gemma 2 (vllm-project#5908 )

WoosukKwon
authored and
robertgshaw2-neuralmagic
committed
[core][misc] remove logical block (vllm-project#5882 )

youkaichao
authored and
robertgshaw2-neuralmagic
committed
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (vllm-project#5932 )

divakar-amd
authored and
robertgshaw2-neuralmagic
committed
[Hardware][TPU] Optimize KV cache swapping (vllm-project#5878 )

WoosukKwon
authored and
robertgshaw2-neuralmagic
committed
[VLM][BugFix] Make sure that multi_modal_kwargs can broadcast properly with ring buffer. (vllm-project#5905 )

authored and
robertgshaw2-neuralmagic
committed
[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner (vllm-project#5956 )

Isotr0py
authored and
robertgshaw2-neuralmagic
committed
[Core] Registry for processing model inputs (vllm-project#5214 )

authored and
robertgshaw2-neuralmagic
committed
Unmark fused_moe config json file as executable (vllm-project#5960 )

tlrmchlsmth
authored and
robertgshaw2-neuralmagic
committed
[Hardware][Intel] OpenVINO vLLM backend (vllm-project#5379 )

ilya-lavrenov
authored and
robertgshaw2-neuralmagic
committed
[Bugfix] Better error message for MLPSpeculator when num_speculative_tokens is set too high (vllm-project#5894 )

tdoublep
authored and
robertgshaw2-neuralmagic
committed
[CI/Build] [2/3] Reorganize entrypoints tests (vllm-project#5904 )

DarkLight1337
authored and
robertgshaw2-neuralmagic
committed
[Distributed] Make it clear that % should not be in tensor dict keys. (vllm-project#5927 )

xwjiang2010
authored and
robertgshaw2-neuralmagic
committed
[Spec Decode] Introduce DraftModelRunner (vllm-project#5799 )

comaniac
authored and
robertgshaw2-neuralmagic
committed
[Bugfix] Fix compute datatype for cutlass 3.x epilogues (vllm-project#5931 )

tlrmchlsmth
authored and
robertgshaw2-neuralmagic
committed
[ Misc ] Remove fp8_shard_indexer from Col/Row Parallel Linear (Simplify Weight Loading) (vllm-project#5928 )

robertgshaw2-neuralmagic
and
Robert Shaw
committed
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (vllm-project#5921 )

robertgshaw2-neuralmagic
and
Robert Shaw
committed
Support Deepseek-V2 (vllm-project#4650 )

authored and
robertgshaw2-neuralmagic
committed
[Bugfix] Only add Attention.kv_scale if kv cache quantization is enabled (vllm-project#5936 )

mgoin
authored and
robertgshaw2-neuralmagic
committed
Unmark more files as executable (vllm-project#5962 )

tlrmchlsmth
authored and
robertgshaw2-neuralmagic
committed
[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError (vllm-project#5963 )

robertgshaw2-neuralmagic
and
Robert Shaw
committed
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (vllm-project#4628 )

authored and
robertgshaw2-neuralmagic
committed
[Bugfix][TPU] Fix TPU sampler output (vllm-project#5978 )

WoosukKwon
authored and
robertgshaw2-neuralmagic
committed
[Bugfix][TPU] Fix pad slot id (vllm-project#5977 )

WoosukKwon
authored and
robertgshaw2-neuralmagic
committed
[Bugfix] fix missing last itl in openai completions benchmark (vllm-project#5926 )

mcalman
authored and
robertgshaw2-neuralmagic
committed
[Misc] Extend vLLM Metrics logging API (vllm-project#5925 )

authored and
robertgshaw2-neuralmagic
committed
[Kernel] Add punica dimensions for Granite 3b and 8b (vllm-project#5930 )

joerunde
authored and
robertgshaw2-neuralmagic
committed
[Bugfix] Fix precisions in Gemma 1 (vllm-project#5913 )

WoosukKwon
authored and
robertgshaw2-neuralmagic
committed
[Misc] Update Phi-3-Vision Example (vllm-project#5981 )

authored and
robertgshaw2-neuralmagic
committed
[Bugfix] Support eos_token_id from config.json (vllm-project#5954 )

DarkLight1337
authored and
robertgshaw2-neuralmagic
committed
[Core] Optimize SequenceStatus.is_finished by switching to IntEnum (vllm-project#5974 )

Yard1
authored and
robertgshaw2-neuralmagic
committed
[Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k (vllm-project#5939 )

comaniac
authored and
robertgshaw2-neuralmagic
committed
[ CI/Build ] Added E2E Test For Compressed Tensors (vllm-project#5839 )

committed
[CI/Build] Add TP test for vision models (vllm-project#5892 )

DarkLight1337
authored and
robertgshaw2-neuralmagic
committed
[ CI/Build ] LM Eval Harness Based CI Testing (vllm-project#5838 )

robertgshaw2-neuralmagic
and
Robert Shaw
committed
[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (vllm-project#5949 )

mawong-amd
authored and
robertgshaw2-neuralmagic
committed
[CI/Build] Temporarily Remove Phi3-Vision from TP Test (vllm-project#5989 )

ywang96
authored and
robertgshaw2-neuralmagic
committed
[CI/Build] Reuse code for checking output consistency (vllm-project#5988 )

DarkLight1337
authored and
robertgshaw2-neuralmagic
committed
[CI/Build] [3/3] Reorganize entrypoints tests (vllm-project#5966 )

DarkLight1337
authored and
robertgshaw2-neuralmagic
committed
[ci][distributed] fix device count call

youkaichao
authored and
robertgshaw2-neuralmagic
committed
[Frontend]: Support base64 embedding (vllm-project#5935 )

authored and
robertgshaw2-neuralmagic
committed
[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (vllm-project#5909 )

authored and
robertgshaw2-neuralmagic
committed
[ CI ] Temporarily Disable Large LM-Eval Tests (vllm-project#6005 )

robertgshaw2-neuralmagic
and
[email protected]
committed
[Misc] Fix get_min_capability (vllm-project#5971 )

dsikka
authored and
robertgshaw2-neuralmagic
committed
[ Misc ] Refactor w8a8 to use process_weights_after_load (Simplify Weight Loading) (vllm-project#5940 )

robertgshaw2-neuralmagic
and
Robert Shaw
committed
format
robertgshaw2-neuralmagic
committed
isort
robertgshaw2-neuralmagic
committed
format
robertgshaw2-neuralmagic
committed
updated skipping
robertgshaw2-neuralmagic
committed
added skipping to compressed-tensors
robertgshaw2-neuralmagic
committed
updated
robertgshaw2-neuralmagic
committed
format
robertgshaw2-neuralmagic
committed
[misc][cuda] use nvml to avoid accidentally cuda initialization (vllm-project#6007 )

youkaichao
authored and
robertgshaw2-neuralmagic
committed
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (vllm-project#5348 )

sroy745
authored and
robertgshaw2-neuralmagic
committed
[ CI ] Re-enable Large Model LM Eval (vllm-project#6031 )
robertgshaw2-neuralmagic
committed
[doc][misc] remove deprecated api server in doc (vllm-project#6037 )

youkaichao
authored and
robertgshaw2-neuralmagic
committed
[Misc] update benchmark backend for scalellm (vllm-project#6018 )

zhyncs
authored and
robertgshaw2-neuralmagic
committed
[doc][misc] further lower visibility of simple api server (vllm-project#6041 )

authored and
robertgshaw2-neuralmagic
committed
[Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool (vllm-project#6039 )

Yard1
authored and
robertgshaw2-neuralmagic
committed
[Bugfix] adding chunking mechanism to fused_moe to handle large inputs (vllm-project#6029 )

avshalomman
authored and
robertgshaw2-neuralmagic
committed
add FAQ doc under 'serving' (vllm-project#5946 )

llmpros
authored and
robertgshaw2-neuralmagic
committed
[Bugfix][Doc] Fix Doc Formatting (vllm-project#6048 )

ywang96
authored and
robertgshaw2-neuralmagic
committed

Commits

Commits on Jul 1, 2024

Commits on Jul 2, 2024

Commits on Jul 3, 2024