📖A curated list of Awesome LLM/VLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism, etc. 🎉🎉
-
Updated
Nov 28, 2024
📖A curated list of Awesome LLM/VLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism, etc. 🎉🎉
A nearly-live implementation of OpenAI's Whisper.
An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.
OpenAI compatible API for TensorRT LLM triton backend
【深度学习模型部署框架】支持tf/torch/trt/trtllm/vllm以及更多nn框架,支持dynamic batching、streaming模式,支持python/c++双语言,可限制,可拓展,高性能。帮助用户快速地将模型部署到线上,并通过http/rpc接口方式提供服务。
【grps接入trtllm】通过GPRS+TensorRT-LLM+Tokenizers.cpp实现纯C++版高性能OpenAI LLM服务,支持chat和function call模式,支持ai agent,支持分布式多卡推理,支持多模态,支持gradio聊天界面。
This repository is an AI Bootcamp material that consist of a workflow for LLM
Chat With RTX Python API
Cortex.Tensorrt-LLM is a C++ inference library that can be loaded by any server at runtime. It submodules NVIDIA’s TensorRT-LLM for GPU accelerated inference on NVIDIA's GPUs.
Add-in for new Outlook that adds LLM new features (Composition, Summarizing, Q&A). It uses a local LLM via Nvidia TensorRT-LLM
大模型推理框架加速,让 LLM 飞起来
Whisper in TensorRT-LLM
TensorRT-LLM server with Structured Outputs (JSON) built with Rust
Getting started with TensorRT-LLM using BLOOM as a case study
LLM tutorial materials include but not limited to NVIDIA NeMo, TensorRT-LLM, Triton Inference Server, and NeMo Guardrails.
Add a description, image, and links to the tensorrt-llm topic page so that developers can more easily learn about it.
To associate your repository with the tensorrt-llm topic, visit your repo's landing page and select "manage topics."