From f020a6297e7539fdc5688fad366309fed5c0456a Mon Sep 17 00:00:00 2001 From: Simon Mo Date: Sun, 11 Aug 2024 17:13:37 -0700 Subject: [PATCH] [Docs] Update readme (#7316) --- README.md | 19 +++++++++++-------- docs/source/index.rst | 13 +++++++------ 2 files changed, 18 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index a154eb20ddb68..6729a7aeb54e3 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ Easy, fast, and cheap LLM serving for everyone

-| Documentation | Blog | Paper | Discord | +| Documentation | Blog | Paper | Discord | Twitter/X |

@@ -36,10 +36,12 @@ vLLM is fast with: - Efficient management of attention key and value memory with **PagedAttention** - Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph -- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629), FP8 KV Cache -- Optimized CUDA kernels +- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8. +- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer. +- Speculative decoding +- Chunked prefill -**Performance benchmark**: We include a [performance benchmark](https://buildkite.com/vllm/performance-benchmark/builds/4068) that compares the performance of vllm against other LLM serving engines ([TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [text-generation-inference](https://github.com/huggingface/text-generation-inference) and [lmdeploy](https://github.com/InternLM/lmdeploy)). +**Performance benchmark**: We include a [performance benchmark](https://buildkite.com/vllm/performance-benchmark/builds/4068) that compares the performance of vLLM against other LLM serving engines ([TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [text-generation-inference](https://github.com/huggingface/text-generation-inference) and [lmdeploy](https://github.com/InternLM/lmdeploy)). vLLM is flexible and easy to use with: @@ -48,20 +50,21 @@ vLLM is flexible and easy to use with: - Tensor parallelism and pipeline parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server -- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs -- (Experimental) Prefix caching support -- (Experimental) Multi-lora support +- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron. +- Prefix caching support +- Multi-lora support vLLM seamlessly supports most popular open-source models on HuggingFace, including: - Transformer-like LLMs (e.g., Llama) - Mixture-of-Expert LLMs (e.g., Mixtral) +- Embedding Models (e.g. E5-Mistral) - Multi-modal LLMs (e.g., LLaVA) Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html). ## Getting Started -Install vLLM with pip or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source): +Install vLLM with `pip` or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source): ```bash pip install vllm diff --git a/docs/source/index.rst b/docs/source/index.rst index 8f06f2f2e5469..54e4806354575 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -31,8 +31,10 @@ vLLM is fast with: * Efficient management of attention key and value memory with **PagedAttention** * Continuous batching of incoming requests * Fast model execution with CUDA/HIP graph -* Quantization: `GPTQ `_, `AWQ `_, `SqueezeLLM `_, FP8 KV Cache -* Optimized CUDA kernels +* Quantization: `GPTQ `_, `AWQ `_, INT4, INT8, and FP8 +* Optimized CUDA kernels, including integration with FlashAttention and FlashInfer. +* Speculative decoding +* Chunked prefill vLLM is flexible and easy to use with: @@ -41,9 +43,9 @@ vLLM is flexible and easy to use with: * Tensor parallelism and pipeline parallelism support for distributed inference * Streaming outputs * OpenAI-compatible API server -* Support NVIDIA GPUs and AMD GPUs -* (Experimental) Prefix caching support -* (Experimental) Multi-lora support +* Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron. +* Prefix caching support +* Multi-lora support For more information, check out the following: @@ -53,7 +55,6 @@ For more information, check out the following: * :ref:`vLLM Meetups `. - Documentation -------------