-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add setup instructions for TensorRT-LLM #789
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, will leave approval for Daya or Megha
1. Convert an MPT HuggingFace checkpoint into the FasterTransformer format. | ||
2. Build a TensorRT engine with the FasterTransformer weights | ||
|
||
Using this engine, you can utilize TensorRT-LLM for fast inference. If you would like to use TensorRT-LLM as an end-to-end solution for an inference service, you can utilize the built engine with an NVIDIA Triton server backend: an example server can be found in [this repository](https://github.com/triton-inference-server/tensorrtllm_backend/tree/v0.6.1) accompanying the most recent release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was "built engine" supposed to be "built-in engine"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rephrase it as "built TRT engine". Also, here again we should drop "most recent release" as suggested by Daniel above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@linden-li can you pls make the suggested changes here? also, update TRT LLM link to v0.7.1?
@megha95 can we merge this? |
No description provided.