Name	Name	Last commit message	Last commit date
parent directory ..
demo	demo
model_params	model_params
scripts	scripts
tests	tests
tt	tt
PERF.md	PERF.md
README.md	README.md
lt	lt
requirements.txt	requirements.txt

Llama3 Models

This codebase includes the Llama3 family of models.

The current version supports the following Llama3 models:

Llama3.2-1B
Llama3.2-3B
Llama3.1-8B
Llama3.2-11B
Llama3.1-70B (T3000-only)

All the above llama models (with the exception of 70B due to its large size) are compatible and tested on the following Tenstorrent hardware:

N150 (1-chip)
N300 (2-chips)
T3000 (8-chips)

Below is an updated table with max prefill context-length support for our demo. These were tested on both accuracy and performance mode.

The main reason for a long context length not fitting on device is lack of memory memory. Any exceptions are marked in the table.

	N150	N300	T3K	TG
Llama3.2-1B	64k tokens	64k tokens	64k tokens [1]	TBD
Llama3.2-3B	32k tokens	64k tokens	64k tokens [1]	TBD
Llama3.1-8B	16k tokens	64k tokens	128k tokens	TBD
Llama3.2-11B	16k tokens	64k tokens	128k tokens	TBD
Llama3.1-70B	Not supported	Not supported	64k tokens [2]	128k tokens

[1] For these configurations, running context lengths greater than those specified on the table will generate a bad repetitive output.

[2] Although longer prefill context-lengths are not supported due to model size and available memory, you can still decode (generate) tokens up to a maximum of 128k tokens.

How to Run

Download the weights

Download the weights directly from Meta, this will mean accepting their license terms.

The downloaded directories include weight files (e.g. consolidated.00.pth), the tokenizer tokenizer.model and configuration file params.json.

Llama3.1-70B only

Llama3.1-70B requires repacked weights. We provide a script to facilitate this in models/demos/llama3/scripts/repack_weights_70b.py.

The repacked output directory can be same as the checkpoint directory, since the new files will have different names. If providing a different path, please make sure that you keep the string 3.1-70B in the new path name, since the Llama3 codebase relies on the weights directory name to identify the correct model.

Note: Use the default value of 10 for chunk_size.

# This concatenates the sharded checkpoints and makes it easier for us to load.
python models/demos/llama3/scripts/repack_weights_70b.py <path_to_checkpoint_dir> <repacked_output_dir>

If providing a different output directory, please copy the params.json and the tokenizer.model files to the new directory.

Llama3.2-11B multimodal only

Llama3.2-11B multimodal requires extra python dependencies. Install them from:

pip install -r models/demos/llama3/requirements.txt

Setup TT environment

Set up environment variables:

export LLAMA_DIR=<meta_llama_model_dir>
export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml

$LLAMA_DIR sets the path for the Llama3 model weights and caches.
$WH_ARCH_YAML sets the dispatch over ethernet cores. This is optional for N150 and required for N300 and T3000, enabling a full core grid utilization (8x8), allowing for maximum performance of LLama3 models.

On the first execution of each model, TTNN will create weight cache files for that model, to speed up future runs. These cache files only need to be created once for each model and each weight (i.e. new finetuned weights will need to be cached) and will be stored accordingly to the machine you are running the models:

$LLAMA_DIR/N150  # For N150
$LLAMA_DIR/N300  # For N300
$LLAMA_DIR/T3K   # For T3000

Run the demo

The Llama3 demo includes 3 main modes of operation and is fully parametrized to support other configurations.

batch-1: Runs a small prompt for a single user
batch-32: Runs a small prompt for a a batch of 32 users
long-context: Runs a large prompt (64k tokens) for a single user

If you want to provide your own demo configuration, please take a look at the pytest parametrize calls in models/demos/llama3/demo/demo.py. For convenience we list all the supported params below:

input_prompts (string): input json file with prompts to process. See models/demos/llama3/demo/*.json for a list of input files
instruct (bool): Whether to use Llama instruct weights or general weights
repeat_batches (int): Number of consecutive batches of users to run (default: 1)
max_seq_len (int): Maximum context length supported by the model (refer to the table above)
batch_size (int): Number of users in a batch (Supports 1/2/4/8/16/32 batches)
max_generated_tokens (int): Maximum number of tokens to generate for each user (Note that the users will stop generation before this limit if they reach a eos token)
paged_attention (bool): Whether to use paged attention or default attention (vLLM support (WIP) requires paged attention)
page_params (dict): Page parameters for paged attention - [block_size, max_num_blocks]. For smaller context lengths use block_size=32 and max_num_blocks=1024, for larger context use block_size=64 and max_num_blocks=2048
sampling_params (dict): Sampling parameters for decoding -[temperature, top_p]. If temperature is set to 0, argmax (greedy decode) is used.
optimization (LlamaOptimizations): Optimization level to use for the model [performance, accuracy]

Please note that using argmax with batch_size > 1 or using top-p sampling with any batch size, these ops will be run on host. This is because those ops are not yet fully supported on device. A decrease in performance is expected when these configurations are enabled.

When running the demo, do not forget to setup the $LLAMA_DIR environment variable to the corresponding Llama3 model weights.

Additionally, we also support the use of a fake device. This enables running a smaller chip demo in a larger multichip device. Supported devices: [N150, N300, T3K, TG].

Example: export FAKE_DEVICE=N150, will enable running a single-chip demo on a multi-chip system.

# Examples of how to run the demo for any supported Llama3 models

# Batch-1
pytest models/demos/llama3/demo/demo.py -k "performance and batch-1"

# Batch-32
pytest models/demos/llama3/demo/demo.py -k "performance and batch-32"

# Long-context
pytest models/demos/llama3/demo/demo.py -k "performance and long"

The above examples are run in LlamaOptimizations.performance mode. You can override this by setting the optimizations argument in the demo. To use instead the accuracy mode you can call the above tests with -k "accuracy and ..." instead of performance.

Expected performance and accuracy

See PERF.md for expected performance and accuracy across different configurations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama3

llama3

README.md

Llama3 Models

How to Run

Download the weights

Llama3.1-70B only

Llama3.2-11B multimodal only

Setup TT environment

Run the demo

Expected performance and accuracy

Files

llama3

Directory actions

More options

Directory actions

More options

Latest commit

History

llama3

Folders and files

parent directory

README.md

Llama3 Models

How to Run

Download the weights

Llama3.1-70B only

Llama3.2-11B multimodal only

Setup TT environment

Run the demo

Expected performance and accuracy