If you have access to TT-VPN, you can copy the weights directly to your local machine using the following SCP commands:
-
Copying repacked Llama3-70B weights:
scp -r 10.230.36.208:/home/llama3-data-repacked/llama-3-70b/ <repacked_output_dir>
-
Copying Llama3-70B tokenizer:
scp -r 10.230.36.208:/home/llama3-data-repacked/tokenizer.model <path_to_checkpoint_dir>
If you do not have access to TT-VPN, follow these steps to download the weights directly from Meta and use the repacking script:
-
Download the Llama3-70B weights from Meta (https://llama.meta.com/):
-
Repack the weights:
# This concatenates the sharded checkpoints and makes it easier for us to load. python models/demos/t3000/llama2_70b/scripts/repack_weights.py <path_to_checkpoint_dir> <repacked_output_dir>
After setting up the repacked weights and tokenizer, you can run the demo using the commands below:
-
Install requirements for reference/llama folder:
pip install blobfile
-
Prepare the weight cache directory:
# Make a directory for us to cache weights into. This speeds up subsequent runs. mkdir <weight_cache_dir>
-
Set up environment variables:
export LLAMA_CKPT_DIR=<repacked_output_dir> export LLAMA_TOKENIZER_PATH=<path_to_checkpoint_dir> # Path needs to include the tokenizer.model file export LLAMA_CACHE_PATH=<weight_cache_dir> # Example: # export LLAMA_CKPT_DIR="/home/llama3-data-repacked/llama-3-70b/" # export LLAMA_TOKENIZER_PATH="/home/llama3-data-repacked/tokenizer.model" # export LLAMA_CACHE_PATH="/home/llama3-data-cache/weights-cache"
-
Cache the weights (first-time setup only):
# Build a full 80 layer model to cache the weights. This will take some time. pytest -svv models/demos/t3000/llama2_70b/tests/test_llama_model.py::test_LlamaModel_inference[decode-8chip-T3000-80L]
-
Run the demo:
# Run the demo using sampling decode pytest -svv models/demos/t3000/llama3_70b/demo/demo.py::test_LlamaModel_demo[sampling-tt-70b-80L]
- Batch Size: Supports batch size 32.
- Input File: Uses
./demo/data/multi_prompt.json
. - Model Configuration: Utilizes a pretrained model.
- Hardware Requirements: Runs on an 8-chip T3000 machine using tensor parallelism. The host machine must have at least 512 GB of memory.
- Model Functionality: Implements decode-to-prefill strategy, where prompts are processed token-by-token to produce KV caches, followed by token generation in decode mode.
Ensure you follow these guidelines to successfully run the Llama3-70B demo.
For best performance results, please do the following:
- Set environment variable
TT_METAL_ASYNC_DEVICE_QUEUE=1
. - Ensure the cpu frequency governor is set to
performance
.