diff --git a/scripts/inference/README.md b/scripts/inference/README.md index e5faeb8df0..f930935144 100644 --- a/scripts/inference/README.md +++ b/scripts/inference/README.md @@ -3,12 +3,18 @@ This folder contains helper scripts for exporting and generating outputs with your Composer-trained LLMs. Table of Contents: -- [Converting a Composer checkpoint to an HF checkpoint folder](#converting-a-composer-checkpoint-to-an-hf-checkpoint-folder) -- [Interactive Generation with HF models](#interactive-generation-with-hf-models) -- [Interactive Chat with HF models](#interactive-chat-with-hf-models) -- [Converting an HF model to ONNX](#converting-an-hf-model-to-onnx) -- [Converting an HF MPT to FasterTransformer](#converting-an-hf-mpt-to-fastertransformer) -- [Running MPT with FasterTransformer](#running-mpt-with-fastertransformer) + +- [LLM Inference](#llm-inference) + - [Converting a Composer checkpoint to an HF checkpoint folder](#converting-a-composer-checkpoint-to-an-hf-checkpoint-folder) + - [Interactive Generation with HF models](#interactive-generation-with-hf-models) + - [Interactive Chat with HF models](#interactive-chat-with-hf-models) + - [Converting an HF model to ONNX](#converting-an-hf-model-to-onnx) + - [Converting an HF MPT to FasterTransformer](#converting-an-hf-mpt-to-fastertransformer) + - [Download and Convert](#download-and-convert) + - [Pre-Download the Model and Convert](#pre-download-the-model-and-convert) + - [Converting a Composer MPT to FasterTransformer](#converting-a-composer-mpt-to-fastertransformer) + - [Running MPT with FasterTransformer](#running-mpt-with-fastertransformer) + - [Running MPT with TensorRT-LLM](#running-mpt-with-tensorrt-llm) ## Converting a Composer checkpoint to an HF checkpoint folder @@ -19,6 +25,7 @@ At the end of your training runs, you will see a collection of Composer `Trainer To extract these pieces, we provide a script `convert_composer_to_hf.py` that converts a Composer checkpoint directly to a standard HF checkpoint folder. For example: + ```bash python convert_composer_to_hf.py --composer_path ep0-ba2000-rank0.pt --hf_output_path my_hf_model/ --output_precision bf16 ``` @@ -46,6 +53,7 @@ You can also pass object store URIs for both `--composer_path` and `--hf_output_ To make it easy to inspect the generations produced by your HF model, we include a script `hf_generate.py` that allows you to run custom prompts through your HF model, like so: + ```bash python hf_generate.py \ --name_or_path gpt2 \ @@ -64,6 +72,7 @@ python hf_generate.py \ which will produce output: + ```bash Loading HF model... n_params=124439808 @@ -112,6 +121,7 @@ For MPT models specifically, you can pass args like `--attn_impl flash`, and `-- Chat models need to pass conversation history back to the model for multi-turn conversations. To make that easier, we include `hf_chat.py`. Chat models usually require an introductory/system prompt, as well as a wrapper around user and model messages, to fit the training format. Default values work with our ChatML-trained models, but you can set other important values like generation kwargs: + ```bash # using an MPT/ChatML style model python hf_chat.py -n mosaicml/mpt-7b-chat-v2 \ @@ -123,6 +133,7 @@ python hf_chat.py -n mosaicml/mpt-7b-chat-v2 \ ``` + ```bash # using an MPT/ChatML style model on > 1 GPU python hf_chat.py -n mosaicml/mpt-7b-chat-v2 \ @@ -137,6 +148,7 @@ python hf_chat.py -n mosaicml/mpt-7b-chat-v2 \ The script also works with other style models. Here is an example of using it with a Vicuna-style model: + ```bash python hf_chat.py -n eachadea/vicuna-7b-1.1 --system_prompt="A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions." --user_msg_fmt="USER: {}\n" --assistant_msg_fmt="ASSISTANT: {}\n" --max_new_tokens=512 ``` @@ -158,6 +170,7 @@ of exporting and working with HuggingFace models with ONNX, see + ```bash # 1) Local export python convert_hf_to_onnx.py --pretrained_model_name_or_path local/path/to/huggingface/folder --output_folder local/folder @@ -175,6 +188,7 @@ python convert_hf_to_onnx.py --pretrained_model_name_or_path local/path/to/huggi Please open a Github issue if you discover any problems! ## Converting an HF MPT to FasterTransformer + We include a script `convert_hf_mpt_to_ft.py` that converts HuggingFace MPT checkpoints to the [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) format. This makes the checkpoints compatible with the FasterTransformer library, which can be used to run transformer models on GPUs. @@ -183,12 +197,14 @@ You can either pre-download the model in a local dir or directly provide the HF checkpoint to FasterTransformer format. ### Download and Convert + ``` # The script handles the download python convert_hf_mpt_to_ft.py -i mosaicml/mpt-7b -o mpt-ft-7b --infer_gpu_num 1 ``` ### Pre-Download the Model and Convert + ``` apt update apt install git-lfs @@ -197,25 +213,31 @@ git clone https://huggingface.co/mosaicml/mpt-7b # This will convert the MPT checkpoint in mpt-7b dir and save the converted checkpoint to mpt-ft-7b dir python convert_hf_mpt_to_ft.py -i mpt-7b -o mpt-ft-7b --infer_gpu_num 1 ``` + You can change `infer_gpu_num` to > 1 to prepare a FT checkpoint for multi-gpu inference. Please open a Github issue if you discover any problems! ## Converting a Composer MPT to FasterTransformer + We include a script `convert_composer_mpt_to_ft.py` that directly converts a Composer MPT checkpoint to the FasterTransformer format. You can either provide a path to a local Composer checkpoint or a URI to a file stored in a cloud supported by Composer (e.g. `s3://`). Simply run: + ``` python convert_composer_mpt_to_ft.py -i -o mpt-ft-7b --infer_gpu_num 1 ``` ## Running MPT with FasterTransformer + This step assumes that you already have converted an MPT checkpoint to FT format by following the instructions in [Converting an HF MPT to FasterTransformer](#converting-an-hf-mpt-to-fastertransformer). It also assumes that you have + 1. Built FasterTransformer for PyTorch by following the instructions -[here](https://github.com/NVIDIA/FasterTransformer/blob/main/docs/gpt_guide.md#build-the-project) + [here](https://github.com/NVIDIA/FasterTransformer/blob/main/docs/gpt_guide.md#build-the-project) 2. A PyTorch install that supports [MPI as distributed communication -backend](https://pytorch.org/docs/stable/distributed.html#backends-that-come-with-pytorch). You need to build and -install PyTorch -from source to include MPI as a backend. + backend](https://pytorch.org/docs/stable/distributed.html#backends-that-come-with-pytorch). You need to build and + install PyTorch + from source to include MPI as a backend. Once above steps are complete, you can run MPT using the following commands: + ``` # For running on a single gpu and benchmarking PYTHONPATH=/mnt/work/FasterTransformer python scripts/inference/run_mpt_with_ft.py --ckpt_path mpt-ft-7b/1-gpu \ @@ -236,3 +258,12 @@ PYTHONPATH=/mnt/work/FasterTransformer python scripts/inference/run_mpt_with_ft. --ckpt_path mpt-ft-7b/1-gpu --lib_path /mnt/work/FasterTransformer/build/lib/libth_transformer.so \ --sample_input_file prompts.txt --sample_output_file output.txt ``` + +## Running MPT with TensorRT-LLM + +MPT-like architectures can be used with NVIDIA's [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) library for language model inference. To do so, follow the instructions in the [examples/mpt](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.6.1/examples/mpt) directory for the most recent release, which will show you how to: + +1. Convert an MPT HuggingFace checkpoint into the FasterTransformer format. +2. Build a TensorRT engine with the FasterTransformer weights + +Using this engine, you can utilize TensorRT-LLM for fast inference. If you would like to use TensorRT-LLM as an end-to-end solution for an inference service, you can utilize the built engine with an NVIDIA Triton server backend: an example server can be found in [this repository](https://github.com/triton-inference-server/tensorrtllm_backend/tree/v0.6.1) accompanying the most recent release.