Fix torchserve llm example link

Signed-off-by: Dan Sun <[email protected]>
bmopuri · Nov 18, 2023 · daaa70a · daaa70a
1 parent c8f6a1e
commit daaa70a
Showing 1 changed file with 3 additions and 3 deletions.
diff --git a/docs/blog/articles/2023-10-08-KServe-0.11-release.md b/docs/blog/articles/2023-10-08-KServe-0.11-release.md
@@ -84,11 +84,11 @@ While `pip install` still works,  we highly recommend using poetry to ensure pre
 
 ### LLM Runtimes
 
-### TorchServe LLM Runtime
+#### TorchServe LLM Runtime
 KServe now integrates with TorchServe 0.8, offering the support for [LLM models](https://pytorch.org/serve/large_model_inference.html) that may not fit onto a single GPU. 
-Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the [detailed example](../../modelserving/v1beta1/llm/) for how to serve the LLM on KServe with TorchServe runtime.
+Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the [detailed example](../../modelserving/v1beta1/llm/torchserve/accelerate/README.md) for how to serve the LLM on KServe with TorchServe runtime.
 
-### vLLM Runtime
+#### vLLM Runtime
 Serving LLM models can be surprisingly slow even on high end GPUs, [vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. 
 It supports [continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference) for increased throughput and GPU utilization,
 [paged attention](https://vllm.ai) to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens.