mosaicml · linden-li · Dec 7, 2023 · Dec 8, 2023 · Jan 21, 2024 · Feb 2, 2024
@@ -3,12 +3,18 @@
 This folder contains helper scripts for exporting and generating outputs with your Composer-trained LLMs.
 
 Table of Contents:
-- [Converting a Composer checkpoint to an HF checkpoint folder](#converting-a-composer-checkpoint-to-an-hf-checkpoint-folder)
-- [Interactive Generation with HF models](#interactive-generation-with-hf-models)
-- [Interactive Chat with HF models](#interactive-chat-with-hf-models)
-- [Converting an HF model to ONNX](#converting-an-hf-model-to-onnx)
-- [Converting an HF MPT to FasterTransformer](#converting-an-hf-mpt-to-fastertransformer)
-- [Running MPT with FasterTransformer](#running-mpt-with-fastertransformer)
+
+- [LLM Inference](#llm-inference)
+  - [Converting a Composer checkpoint to an HF checkpoint folder](#converting-a-composer-checkpoint-to-an-hf-checkpoint-folder)
+  - [Interactive Generation with HF models](#interactive-generation-with-hf-models)
+  - [Interactive Chat with HF models](#interactive-chat-with-hf-models)
+  - [Converting an HF model to ONNX](#converting-an-hf-model-to-onnx)
+  - [Converting an HF MPT to FasterTransformer](#converting-an-hf-mpt-to-fastertransformer)
+    - [Download and Convert](#download-and-convert)
+    - [Pre-Download the Model and Convert](#pre-download-the-model-and-convert)
+  - [Converting a Composer MPT to FasterTransformer](#converting-a-composer-mpt-to-fastertransformer)
+  - [Running MPT with FasterTransformer](#running-mpt-with-fastertransformer)
+  - [Running MPT with TensorRT-LLM](#running-mpt-with-tensorrt-llm)
 
 ## Converting a Composer checkpoint to an HF checkpoint folder
 
@@ -19,6 +25,7 @@ At the end of your training runs, you will see a collection of Composer `Trainer
 To extract these pieces, we provide a script `convert_composer_to_hf.py` that converts a Composer checkpoint directly to a standard HF checkpoint folder. For example:
 
 <!--pytest.mark.skip-->
+
 ```bash
 python convert_composer_to_hf.py --composer_path ep0-ba2000-rank0.pt --hf_output_path my_hf_model/ --output_precision bf16
 ```
@@ -46,6 +53,7 @@ You can also pass object store URIs for both `--composer_path` and `--hf_output_
 To make it easy to inspect the generations produced by your HF model, we include a script `hf_generate.py` that allows you to run custom prompts through your HF model, like so:
 
 <!--pytest.mark.skip-->
+
 ```bash
 python hf_generate.py \
     --name_or_path gpt2 \
@@ -64,6 +72,7 @@ python hf_generate.py \
 which will produce output:
 
 <!--pytest.mark.skip-->
+
 ```bash
 Loading HF model...
 n_params=124439808
@@ -112,6 +121,7 @@ For MPT models specifically, you can pass args like `--attn_impl triton`, and `-
 Chat models need to pass conversation history back to the model for multi-turn conversations. To make that easier, we include `hf_chat.py`. Chat models usually require an introductory/system prompt, as well as a wrapper around user and model messages, to fit the training format. Default values work with our ChatML-trained models, but you can set other important values like generation kwargs:
 
 <!--pytest.mark.skip-->
+
 ```bash
 # using an MPT/ChatML style model
 python hf_chat.py -n mosaicml/mpt-7b-chat-v2 \
@@ -123,6 +133,7 @@ python hf_chat.py -n mosaicml/mpt-7b-chat-v2 \
 ```
 
 <!--pytest.mark.skip-->
+
 ```bash
 # using an MPT/ChatML style model on  > 1 GPU
 python hf_chat.py -n mosaicml/mpt-7b-chat-v2 \
@@ -137,6 +148,7 @@ python hf_chat.py -n mosaicml/mpt-7b-chat-v2 \
 The script also works with other style models. Here is an example of using it with a Vicuna-style model:
 
 <!--pytest.mark.skip-->
+
 ```bash
 python hf_chat.py -n eachadea/vicuna-7b-1.1 --system_prompt="A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions." --user_msg_fmt="USER: {}\n" --assistant_msg_fmt="ASSISTANT: {}\n" --max_new_tokens=512
 ```
@@ -158,6 +170,7 @@ of exporting and working with HuggingFace models with ONNX, see <https://hugging
 Here a couple examples of using the script:
 
 <!--pytest.mark.skip-->
+
 ```bash
 # 1) Local export
 python convert_hf_to_onnx.py --pretrained_model_name_or_path local/path/to/huggingface/folder --output_folder local/folder
@@ -175,6 +188,7 @@ python convert_hf_to_onnx.py --pretrained_model_name_or_path local/path/to/huggi
 Please open a Github issue if you discover any problems!
 
 ## Converting an HF MPT to FasterTransformer
+
 We include a script `convert_hf_mpt_to_ft.py` that converts HuggingFace MPT checkpoints to the
 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) format. This makes the checkpoints compatible with the
 FasterTransformer library, which can be used to run transformer models on GPUs.
@@ -183,12 +197,14 @@ You can either pre-download the model in a local dir or directly provide the HF
 checkpoint to FasterTransformer format.
 
 ### Download and Convert
+
 ```
 # The script handles the download
 python convert_hf_mpt_to_ft.py -i mosaicml/mpt-7b -o mpt-ft-7b --infer_gpu_num 1
 ```
 
 ### Pre-Download the Model and Convert
+
 ```
 apt update
 apt install git-lfs
@@ -197,25 +213,31 @@ git clone https://huggingface.co/mosaicml/mpt-7b
 # This will convert the MPT checkpoint in mpt-7b dir and save the converted checkpoint to mpt-ft-7b dir
 python convert_hf_mpt_to_ft.py -i mpt-7b -o mpt-ft-7b --infer_gpu_num 1
 ```
+
 You can change `infer_gpu_num` to > 1 to prepare a FT checkpoint for multi-gpu inference. Please open a Github issue if you discover any problems!
 
 ## Converting a Composer MPT to FasterTransformer
+
 We include a script `convert_composer_mpt_to_ft.py` that directly converts a Composer MPT checkpoint to the FasterTransformer format. You can either provide a path to a local Composer checkpoint or a URI to a file stored in a cloud supported by Composer (e.g. `s3://`). Simply run:
+
 ```
 python convert_composer_mpt_to_ft.py -i <path_to_composer_checkpoint.pt> -o mpt-ft-7b --infer_gpu_num 1
 ```
 
 ## Running MPT with FasterTransformer
+
 This step assumes that you already have converted an MPT checkpoint to FT format by following the instructions in
 [Converting an HF MPT to FasterTransformer](#converting-an-hf-mpt-to-fastertransformer). It also assumes that you have
+
 1. Built FasterTransformer for PyTorch by following the instructions
-[here](https://github.com/NVIDIA/FasterTransformer/blob/main/docs/gpt_guide.md#build-the-project)
+   [here](https://github.com/NVIDIA/FasterTransformer/blob/main/docs/gpt_guide.md#build-the-project)
 2. A PyTorch install that supports [MPI as distributed communication
-backend](https://pytorch.org/docs/stable/distributed.html#backends-that-come-with-pytorch). You need to build and
-install PyTorch
-from source to include MPI as a backend.
+   backend](https://pytorch.org/docs/stable/distributed.html#backends-that-come-with-pytorch). You need to build and
+   install PyTorch
+   from source to include MPI as a backend.
 
 Once above steps are complete, you can run MPT using the following commands:
+
 ```
 # For running on a single gpu and benchmarking
 PYTHONPATH=/mnt/work/FasterTransformer python scripts/inference/run_mpt_with_ft.py --ckpt_path mpt-ft-7b/1-gpu \
@@ -236,3 +258,12 @@ PYTHONPATH=/mnt/work/FasterTransformer python scripts/inference/run_mpt_with_ft.
     --ckpt_path mpt-ft-7b/1-gpu --lib_path /mnt/work/FasterTransformer/build/lib/libth_transformer.so \
     --sample_input_file prompts.txt --sample_output_file output.txt
 ```
+
+## Running MPT with TensorRT-LLM
+
+MPT-like architectures can be used with the NVIDIA's [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) library for language model inference. To do so, follow the instructions in the [examples/mpt](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.6.1/examples/mpt) directory for the most recent release, which will show you how to:
+
+1. Convert an MPT HuggingFace checkpoint into the FasterTransformer format.
+2. Build a TensorRT engine with the FasterTransformer weights
+
+Using this engine, you can utilize TensorRT-LLM for fast inference. If you would like to use TensorRT-LLM as an end-to-end solution for an inference service, you can utilize the built engine with an NVIDIA Triton server backend: an example server can be found in [this repository](https://github.com/triton-inference-server/tensorrtllm_backend/tree/v0.6.1) accompanying the most recent release.