This repository contains all the necessary code and scripts to deploy a huggingface retrieval model
such as multilingual-e5-large
using NVIDIA's Triton Inference Server.
The guide covers every step from model export, configuration, and optimization
to deploying the model on Triton for high-performance inference.
- Model conversion scripts (e.g., ONNX or TensorFlow to Triton format)
- Configuration files (config.pbtxt)
- Docker setup for Triton Server
- Load testing code for sending inference requests
A complete step-by-step guide is available in my detailed blog post: Deploying a Sentence Transformer with Triton Inference Server. This post explains the deployment process and how to use the files provided in this repository.
Clone this repository:
git clone [email protected]:rproskuryakov/triton-sentence-transformer-tutorial.git
cd triton-sentence-transformer-tutorial
Follow the instructions in the guide to set up Triton and deploy your model.
This project is licensed under the MIT License.