Skip to content

How does this compare to triton inference server? #403

Answered by WoosukKwon
jebarpg asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @jebarpg Thanks for your interest and great question! NVIDIA Triton inference server is a serving system that provides high availability, observability, model versioning, etc. It needs to co-operate with an inference engine ("backend") that simply processes inputs with the models on GPUs, like vLLM, FasterTransformer, and PyTorch. Thus, while we haven't investigated it much, vLLM can be potentially used with Triton.

For now, Ray Serve provides an example of using vLLM as their backend. Please check it out.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@jebarpg
Comment options

Answer selected by jebarpg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #398 on July 08, 2023 18:36.