-
Download the Llama2-70B weights from Meta (https://llama.meta.com/):
-
Repack the weights:
# This concatenates the sharded checkpoints and makes it easier for us to load. python models/demos/t3000/llama2_70b/scripts/repack_weights.py <path_to_checkpoint_dir> <repacked_output_dir> <chunk_size>
Note: Use
5
forchunk_size
.Once the weights are repacked, move the
params.json
file from thecheckpoint_dir
to therepacked_output_dir
.
After setting up the repacked weights and tokenizer, you can run the demo using the commands below:
-
Prepare the weight cache directory:
# Make a directory for us to cache weights into. This speeds up subsequent runs. mkdir <weight_cache_dir>
-
Set up environment variables:
export LLAMA2_CKPT_DIR=<repacked_output_dir> export LLAMA2_TOKENIZER_PATH=<path_to_checkpoint_dir> # Path needs to include the tokenizer.model file export LLAMA2_CACHE_PATH=<weight_cache_dir> export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml export TIKTOKEN_CACHE_DIR="" pip install -r models/demos/t3000/llama2_70b/reference/llama/requirements.txt # Example: # export LLAMA2_CKPT_DIR="/home/llama-data-repacked/llama-2-70b/" # export LLAMA2_TOKENIZER_PATH="/home/llama-data-repacked/tokenizer.model" # export LLAMA2_CACHE_PATH="/home/llama-data-cache/weights-cache"
-
Run the demo:
NOTE: Run the following comand twice.
- The first run will cache the weights. This will take some time.
- The second run will use the cached weights, thereby running much faster.
# Run the demo using sampling decode pytest -svv models/demos/t3000/llama2_70b/demo/demo.py::test_LlamaModel_demo[wormhole_b0-True-check_disabled-sampling-tt-70b-T3000-80L-decode_only-text_completion-llama2]
-
Run the performance test:
The above demo does not achieve peak performance because we log outputs to the screen. The following perf test will print an accurate end-to-end throughput number. For best performance numbers, we recommend building
tt-metal
withCONFIG=Release
env var, and ensuring the host's CPU frequency governors are set toperformance
.pytest -svv models/demos/t3000/llama2_70b/tests/test_llama_perf_decode.py::test_Llama_perf_host[wormhole_b0-True-gen128-llama2]
-
Batch Size: Supports batch size 32.
-
Input File: Uses
./demo/data/multi_prompt.json
. -
Model Configuration: Utilizes a pretrained model.
-
Hardware Requirements: Runs on an 8-chip T3000 machine using tensor parallelism. The host machine must have at least 512 GB of memory.
-
Model Functionality:
- The maximum context length is currently limited to (2k + 128) tokens. Support for 8k context is in testing.
- The demo can run in
decode_only
mode in which we use decode mode to consume the context one token at a time, orprefill_decode
mode in which we prefill the context and then decode.
-
Demo arguments:
ground_truth: [check_disabled, check_enabled]
: Enable or disable ground truth checking, used for testingsampling: [greedy, sampling]
: Select between greedy decoding and top-k/top-p samplingimplementation: [tt-70b-T3000]
: Run the 70B model on the Tenstorrent backendnum_layers: [1L, 2L, 10L, 80L]
: Select 80L to run the full modeldecode_only: [decode_only, prefill_decode]
: Usedecode_only
. Alternately, chooseprefill_decode
to enable prefill-decode modechat: [text_completion, chat_completion]
: Run in text_completion mode for the pretrained model or chat_completion for the finetuned modelllama_version: [llama2]
: Select the Llama2 model
Ensure you follow these guidelines to successfully run the Llama2-70B demo.