Skip to content

Latest commit

 

History

History
 
 

llama3

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Llama3 Models

This codebase includes the Llama3 family of models.

The current version supports the following Llama3 models:

  • Llama3.2-1B
  • Llama3.2-3B
  • Llama3.1-8B
  • Llama3.2-11B
  • Llama3.1-70B (T3000-only)

All the above llama models (with the exception of 70B due to its large size) are compatible and tested on the following Tenstorrent hardware:

  • N150 (1-chip)
  • N300 (2-chips)
  • T3000 (8-chips)

Below is an updated table with max prefill context-length support for our demo. These were tested on both accuracy and performance mode.

The main reason for a long context length not fitting on device is lack of memory memory. Any exceptions are marked in the table.

N150 N300 T3K TG
Llama3.2-1B 64k tokens 64k tokens 64k tokens [1] TBD
Llama3.2-3B 32k tokens 64k tokens 64k tokens [1] TBD
Llama3.1-8B 16k tokens 64k tokens 128k tokens TBD
Llama3.2-11B 16k tokens 64k tokens 128k tokens TBD
Llama3.1-70B Not supported Not supported 64k tokens [2] 128k tokens

[1] For these configurations, running context lengths greater than those specified on the table will generate a bad repetitive output.

[2] Although longer prefill context-lengths are not supported due to model size and available memory, you can still decode (generate) tokens up to a maximum of 128k tokens.

How to Run

Download the weights

Download the weights directly from Meta, this will mean accepting their license terms.

The downloaded directories include weight files (e.g. consolidated.00.pth), the tokenizer tokenizer.model and configuration file params.json.

Llama3.1-70B only

Llama3.1-70B requires repacked weights. We provide a script to facilitate this in models/demos/llama3/scripts/repack_weights_70b.py.

The repacked output directory can be same as the checkpoint directory, since the new files will have different names. If providing a different path, please make sure that you keep the string 3.1-70B in the new path name, since the Llama3 codebase relies on the weights directory name to identify the correct model.

Note: Use the default value of 10 for chunk_size.

# This concatenates the sharded checkpoints and makes it easier for us to load.
python models/demos/llama3/scripts/repack_weights_70b.py <path_to_checkpoint_dir> <repacked_output_dir>

If providing a different output directory, please copy the params.json and the tokenizer.model files to the new directory.

Llama3.2-11B multimodal only

Llama3.2-11B multimodal requires extra python dependencies. Install them from:

pip install -r models/demos/llama3/requirements.txt

Setup TT environment

  1. Set up environment variables:
export LLAMA_DIR=<meta_llama_model_dir>
export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
  • $LLAMA_DIR sets the path for the Llama3 model weights and caches.

  • $WH_ARCH_YAML sets the dispatch over ethernet cores. This is optional for N150 and required for N300 and T3000, enabling a full core grid utilization (8x8), allowing for maximum performance of LLama3 models.

On the first execution of each model, TTNN will create weight cache files for that model, to speed up future runs. These cache files only need to be created once for each model and each weight (i.e. new finetuned weights will need to be cached) and will be stored accordingly to the machine you are running the models:

$LLAMA_DIR/N150  # For N150
$LLAMA_DIR/N300  # For N300
$LLAMA_DIR/T3K   # For T3000

Run the demo

The Llama3 demo includes 3 main modes of operation and is fully parametrized to support other configurations.

  • batch-1: Runs a small prompt for a single user
  • batch-32: Runs a small prompt for a a batch of 32 users
  • long-context: Runs a large prompt (64k tokens) for a single user

If you want to provide your own demo configuration, please take a look at the pytest parametrize calls in models/demos/llama3/demo/demo.py. For convenience we list all the supported params below:

  • input_prompts (string): input json file with prompts to process. See models/demos/llama3/demo/*.json for a list of input files
  • instruct (bool): Whether to use Llama instruct weights or general weights
  • repeat_batches (int): Number of consecutive batches of users to run (default: 1)
  • max_seq_len (int): Maximum context length supported by the model (refer to the table above)
  • batch_size (int): Number of users in a batch (Supports 1/2/4/8/16/32 batches)
  • max_generated_tokens (int): Maximum number of tokens to generate for each user (Note that the users will stop generation before this limit if they reach a eos token)
  • paged_attention (bool): Whether to use paged attention or default attention (vLLM support (WIP) requires paged attention)
  • page_params (dict): Page parameters for paged attention - [block_size, max_num_blocks]. For smaller context lengths use block_size=32 and max_num_blocks=1024, for larger context use block_size=64 and max_num_blocks=2048
  • sampling_params (dict): Sampling parameters for decoding -[temperature, top_p]. If temperature is set to 0, argmax (greedy decode) is used.
  • optimization (LlamaOptimizations): Optimization level to use for the model [performance, accuracy]

Please note that using argmax with batch_size > 1 or using top-p sampling with any batch size, these ops will be run on host. This is because those ops are not yet fully supported on device. A decrease in performance is expected when these configurations are enabled.

When running the demo, do not forget to setup the $LLAMA_DIR environment variable to the corresponding Llama3 model weights.

Additionally, we also support the use of a fake device. This enables running a smaller chip demo in a larger multichip device. Supported devices: [N150, N300, T3K, TG].

Example: export FAKE_DEVICE=N150, will enable running a single-chip demo on a multi-chip system.

# Examples of how to run the demo for any supported Llama3 models

# Batch-1
pytest models/demos/llama3/demo/demo.py -k "performance and batch-1"

# Batch-32
pytest models/demos/llama3/demo/demo.py -k "performance and batch-32"

# Long-context
pytest models/demos/llama3/demo/demo.py -k "performance and long"

The above examples are run in LlamaOptimizations.performance mode. You can override this by setting the optimizations argument in the demo. To use instead the accuracy mode you can call the above tests with -k "accuracy and ..." instead of performance.

Expected performance and accuracy

See PERF.md for expected performance and accuracy across different configurations.