How does this compare to triton inference server? #403
-
Forgive me I am new to llms so I don't know if this project is actually a comparable project to triton or not, but I would love some feedback on how they compare if at all. Thank you on advance 🤗 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @jebarpg Thanks for your interest and great question! NVIDIA Triton inference server is a serving system that provides high availability, observability, model versioning, etc. It needs to co-operate with an inference engine ("backend") that simply processes inputs with the models on GPUs, like vLLM, FasterTransformer, and PyTorch. Thus, while we haven't investigated it much, vLLM can be potentially used with Triton. For now, Ray Serve provides an example of using vLLM as their backend. Please check it out. |
Beta Was this translation helpful? Give feedback.
Hi @jebarpg Thanks for your interest and great question! NVIDIA Triton inference server is a serving system that provides high availability, observability, model versioning, etc. It needs to co-operate with an inference engine ("backend") that simply processes inputs with the models on GPUs, like vLLM, FasterTransformer, and PyTorch. Thus, while we haven't investigated it much, vLLM can be potentially used with Triton.
For now, Ray Serve provides an example of using vLLM as their backend. Please check it out.