Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add setup instructions for TensorRT-LLM #789

Closed
wants to merge 8 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 41 additions & 10 deletions scripts/inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,18 @@
This folder contains helper scripts for exporting and generating outputs with your Composer-trained LLMs.

Table of Contents:
- [Converting a Composer checkpoint to an HF checkpoint folder](#converting-a-composer-checkpoint-to-an-hf-checkpoint-folder)
- [Interactive Generation with HF models](#interactive-generation-with-hf-models)
- [Interactive Chat with HF models](#interactive-chat-with-hf-models)
- [Converting an HF model to ONNX](#converting-an-hf-model-to-onnx)
- [Converting an HF MPT to FasterTransformer](#converting-an-hf-mpt-to-fastertransformer)
- [Running MPT with FasterTransformer](#running-mpt-with-fastertransformer)

- [LLM Inference](#llm-inference)
- [Converting a Composer checkpoint to an HF checkpoint folder](#converting-a-composer-checkpoint-to-an-hf-checkpoint-folder)
- [Interactive Generation with HF models](#interactive-generation-with-hf-models)
- [Interactive Chat with HF models](#interactive-chat-with-hf-models)
- [Converting an HF model to ONNX](#converting-an-hf-model-to-onnx)
- [Converting an HF MPT to FasterTransformer](#converting-an-hf-mpt-to-fastertransformer)
- [Download and Convert](#download-and-convert)
- [Pre-Download the Model and Convert](#pre-download-the-model-and-convert)
- [Converting a Composer MPT to FasterTransformer](#converting-a-composer-mpt-to-fastertransformer)
- [Running MPT with FasterTransformer](#running-mpt-with-fastertransformer)
- [Running MPT with TensorRT-LLM](#running-mpt-with-tensorrt-llm)

## Converting a Composer checkpoint to an HF checkpoint folder

Expand All @@ -19,6 +25,7 @@ At the end of your training runs, you will see a collection of Composer `Trainer
To extract these pieces, we provide a script `convert_composer_to_hf.py` that converts a Composer checkpoint directly to a standard HF checkpoint folder. For example:

<!--pytest.mark.skip-->

```bash
python convert_composer_to_hf.py --composer_path ep0-ba2000-rank0.pt --hf_output_path my_hf_model/ --output_precision bf16
```
Expand Down Expand Up @@ -46,6 +53,7 @@ You can also pass object store URIs for both `--composer_path` and `--hf_output_
To make it easy to inspect the generations produced by your HF model, we include a script `hf_generate.py` that allows you to run custom prompts through your HF model, like so:

<!--pytest.mark.skip-->

```bash
python hf_generate.py \
--name_or_path gpt2 \
Expand All @@ -64,6 +72,7 @@ python hf_generate.py \
which will produce output:

<!--pytest.mark.skip-->

```bash
Loading HF model...
n_params=124439808
Expand Down Expand Up @@ -112,6 +121,7 @@ For MPT models specifically, you can pass args like `--attn_impl triton`, and `-
Chat models need to pass conversation history back to the model for multi-turn conversations. To make that easier, we include `hf_chat.py`. Chat models usually require an introductory/system prompt, as well as a wrapper around user and model messages, to fit the training format. Default values work with our ChatML-trained models, but you can set other important values like generation kwargs:

<!--pytest.mark.skip-->

```bash
# using an MPT/ChatML style model
python hf_chat.py -n mosaicml/mpt-7b-chat-v2 \
Expand All @@ -123,6 +133,7 @@ python hf_chat.py -n mosaicml/mpt-7b-chat-v2 \
```

<!--pytest.mark.skip-->

```bash
# using an MPT/ChatML style model on > 1 GPU
python hf_chat.py -n mosaicml/mpt-7b-chat-v2 \
Expand All @@ -137,6 +148,7 @@ python hf_chat.py -n mosaicml/mpt-7b-chat-v2 \
The script also works with other style models. Here is an example of using it with a Vicuna-style model:

<!--pytest.mark.skip-->

```bash
python hf_chat.py -n eachadea/vicuna-7b-1.1 --system_prompt="A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions." --user_msg_fmt="USER: {}\n" --assistant_msg_fmt="ASSISTANT: {}\n" --max_new_tokens=512
```
Expand All @@ -158,6 +170,7 @@ of exporting and working with HuggingFace models with ONNX, see <https://hugging
Here a couple examples of using the script:

<!--pytest.mark.skip-->

```bash
# 1) Local export
python convert_hf_to_onnx.py --pretrained_model_name_or_path local/path/to/huggingface/folder --output_folder local/folder
Expand All @@ -175,6 +188,7 @@ python convert_hf_to_onnx.py --pretrained_model_name_or_path local/path/to/huggi
Please open a Github issue if you discover any problems!

## Converting an HF MPT to FasterTransformer
megha95 marked this conversation as resolved.
Show resolved Hide resolved

We include a script `convert_hf_mpt_to_ft.py` that converts HuggingFace MPT checkpoints to the
[FasterTransformer](https://github.com/NVIDIA/FasterTransformer) format. This makes the checkpoints compatible with the
FasterTransformer library, which can be used to run transformer models on GPUs.
Expand All @@ -183,12 +197,14 @@ You can either pre-download the model in a local dir or directly provide the HF
checkpoint to FasterTransformer format.

### Download and Convert

```
# The script handles the download
python convert_hf_mpt_to_ft.py -i mosaicml/mpt-7b -o mpt-ft-7b --infer_gpu_num 1
```

### Pre-Download the Model and Convert

```
apt update
apt install git-lfs
Expand All @@ -197,25 +213,31 @@ git clone https://huggingface.co/mosaicml/mpt-7b
# This will convert the MPT checkpoint in mpt-7b dir and save the converted checkpoint to mpt-ft-7b dir
python convert_hf_mpt_to_ft.py -i mpt-7b -o mpt-ft-7b --infer_gpu_num 1
```

You can change `infer_gpu_num` to > 1 to prepare a FT checkpoint for multi-gpu inference. Please open a Github issue if you discover any problems!

## Converting a Composer MPT to FasterTransformer

We include a script `convert_composer_mpt_to_ft.py` that directly converts a Composer MPT checkpoint to the FasterTransformer format. You can either provide a path to a local Composer checkpoint or a URI to a file stored in a cloud supported by Composer (e.g. `s3://`). Simply run:

```
python convert_composer_mpt_to_ft.py -i <path_to_composer_checkpoint.pt> -o mpt-ft-7b --infer_gpu_num 1
```

## Running MPT with FasterTransformer

This step assumes that you already have converted an MPT checkpoint to FT format by following the instructions in
[Converting an HF MPT to FasterTransformer](#converting-an-hf-mpt-to-fastertransformer). It also assumes that you have

1. Built FasterTransformer for PyTorch by following the instructions
[here](https://github.com/NVIDIA/FasterTransformer/blob/main/docs/gpt_guide.md#build-the-project)
[here](https://github.com/NVIDIA/FasterTransformer/blob/main/docs/gpt_guide.md#build-the-project)
2. A PyTorch install that supports [MPI as distributed communication
backend](https://pytorch.org/docs/stable/distributed.html#backends-that-come-with-pytorch). You need to build and
install PyTorch
from source to include MPI as a backend.
backend](https://pytorch.org/docs/stable/distributed.html#backends-that-come-with-pytorch). You need to build and
install PyTorch
from source to include MPI as a backend.

Once above steps are complete, you can run MPT using the following commands:

```
# For running on a single gpu and benchmarking
PYTHONPATH=/mnt/work/FasterTransformer python scripts/inference/run_mpt_with_ft.py --ckpt_path mpt-ft-7b/1-gpu \
Expand All @@ -236,3 +258,12 @@ PYTHONPATH=/mnt/work/FasterTransformer python scripts/inference/run_mpt_with_ft.
--ckpt_path mpt-ft-7b/1-gpu --lib_path /mnt/work/FasterTransformer/build/lib/libth_transformer.so \
--sample_input_file prompts.txt --sample_output_file output.txt
```

## Running MPT with TensorRT-LLM

MPT-like architectures can be used with the NVIDIA's [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) library for language model inference. To do so, follow the instructions in the [examples/mpt](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.6.1/examples/mpt) directory for the most recent release, which will show you how to:
megha95 marked this conversation as resolved.
Show resolved Hide resolved

1. Convert an MPT HuggingFace checkpoint into the FasterTransformer format.
2. Build a TensorRT engine with the FasterTransformer weights

Using this engine, you can utilize TensorRT-LLM for fast inference. If you would like to use TensorRT-LLM as an end-to-end solution for an inference service, you can utilize the built engine with an NVIDIA Triton server backend: an example server can be found in [this repository](https://github.com/triton-inference-server/tensorrtllm_backend/tree/v0.6.1) accompanying the most recent release.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was "built engine" supposed to be "built-in engine"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rephrase it as "built TRT engine". Also, here again we should drop "most recent release" as suggested by Daniel above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linden-li can you pls make the suggested changes here? also, update TRT LLM link to v0.7.1?

Loading