From c0d2ed9fb6eba71332d6f70b196978a39166059c Mon Sep 17 00:00:00 2001 From: Yuan Tang Date: Sun, 3 Nov 2024 23:41:26 -0500 Subject: [PATCH] Redirect vLLM runtime guide to Hugging Face runtime overview (#408) * Redirect vLLM runtime guide to Hugging Face runtime overview Signed-off-by: Yuan Tang * Update README.md Signed-off-by: Yuan Tang --------- Signed-off-by: Yuan Tang --- docs/modelserving/v1beta1/llm/vllm/README.md | 84 +------------------- 1 file changed, 2 insertions(+), 82 deletions(-) diff --git a/docs/modelserving/v1beta1/llm/vllm/README.md b/docs/modelserving/v1beta1/llm/vllm/README.md index 3bf459c0e..204638203 100644 --- a/docs/modelserving/v1beta1/llm/vllm/README.md +++ b/docs/modelserving/v1beta1/llm/vllm/README.md @@ -1,83 +1,3 @@ -## Deploy the LLaMA model with vLLM Runtime -Serving LLM models can be surprisingly slow even on high end GPUs, [vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. -It supports [continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference) for increased throughput and GPU utilization, -[paged attention](https://vllm.ai) to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens. +## vLLM Runtime -You can deploy the LLaMA model with built vLLM inference server container image using the `InferenceService` yaml API spec. -We have work in progress integrating `vLLM` with `Open Inference Protocol` and KServe observability stack. - -The LLaMA model can be downloaded from [huggingface](https://huggingface.co/meta-llama/Llama-2-7b) and upload to your cloud storage. - -=== "Yaml" - - - ```yaml - kubectl apply -n kserve-test -f - <