Skip to content

Commit

Permalink
Llama3 model family now supports batch-32, long context (up to 128k) …
Browse files Browse the repository at this point in the history
…and paged attention (#15327)

Co-authored-by: avoraTT <[email protected]>
Co-authored-by: Stuti Raizada <[email protected]>
Co-authored-by: kpaigwar <[email protected]>
  • Loading branch information
4 people authored Dec 4, 2024
1 parent 68bc110 commit 1f7eccf
Show file tree
Hide file tree
Showing 37 changed files with 1,633 additions and 873 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/t3000-demo-tests-impl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
test-group: [
{ name: "t3k_falcon40b_tests", arch: wormhole_b0, cmd: run_t3000_falcon40b_tests, timeout: 50, owner_id: U053W15B6JF}, #Djordje Ivanovic
{ name: "t3k_llama3_tests", arch: wormhole_b0, cmd: run_t3000_llama3_tests, timeout: 30, owner_id: U03PUAKE719}, # Miguel Tairum
{ name: "t3k_llama3_vision_tests", arch: wormhole_b0, cmd: run_t3000_llama3_vision_tests, timeout: 30, owner_id: U03FJB5TM5Y}, #Colman Glagovich
# { name: "t3k_llama3_vision_tests", arch: wormhole_b0, cmd: run_t3000_llama3_vision_tests, timeout: 30, owner_id: U03FJB5TM5Y}, #Colman Glagovich
{ name: "t3k_llama3_70b_tests", arch: wormhole_b0, cmd: run_t3000_llama3_70b_tests, timeout: 30, owner_id: U03FJB5TM5Y}, #Colman Glagovich
{ name: "t3k_falcon7b_tests", arch: wormhole_b0, cmd: run_t3000_falcon7b_tests, timeout: 90, owner_id: U05RWH3QUPM}, #Salar Hosseini
{ name: "t3k_mixtral_tests", arch: wormhole_b0, cmd: run_t3000_mixtral_tests, timeout: 50, owner_id: U03PUAKE719}, # Miguel Tairum
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/t3000-frequent-tests-impl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ jobs:
{ name: "t3k ethernet tests", arch: wormhole_b0, cmd: run_t3000_ethernet_tests, timeout: 60, owner_id: ULMEPM2MA}, #Sean Nijjar
{ name: "t3k trace stress tests", arch: wormhole_b0, cmd: run_t3000_trace_stress_tests, timeout: 120, owner_id: U03NG0A5ND7}, #Aditya Saigal
{ name: "t3k falcon40b tests", arch: wormhole_b0, cmd: run_t3000_falcon40b_tests, timeout: 120, owner_id: U04S2UV6L8N}, #Sofija Jovic
{ name: "t3k llama3.2-vision tests", arch: wormhole_b0, cmd: run_t3000_llama3.2-11b-vision_freq_tests, timeout: 60, owner_id: U03FJB5TM5Y}, #Colman Glagovich
{ name: "t3k n300 mesh llama3.2-vision tests", arch: wormhole_b0, cmd: run_t3000_spoof_n300_llama3.2-11b-vision_freq_tests, timeout: 60, owner_id: U03FJB5TM5Y}, #Colman Glagovich
# { name: "t3k llama3.2-vision tests", arch: wormhole_b0, cmd: run_t3000_llama3.2-11b-vision_freq_tests, timeout: 60, owner_id: U03FJB5TM5Y}, #Colman Glagovich
# { name: "t3k n300 mesh llama3.2-vision tests", arch: wormhole_b0, cmd: run_t3000_spoof_n300_llama3.2-11b-vision_freq_tests, timeout: 60, owner_id: U03FJB5TM5Y}, #Colman Glagovich
{ name: "t3k llama3 tests", arch: wormhole_b0, cmd: run_t3000_llama3_tests, timeout: 45, owner_id: U03PUAKE719}, #Miguel Tairum Cruz
{ name: "t3k llama2_70b tests", arch: wormhole_b0, cmd: run_t3000_llama2_70b_tests, timeout: 45, owner_id: U03FJB5TM5Y}, #Colman Glagovich
# { name: "t3k llama3_70b tests", arch: wormhole_b0, cmd: run_t3000_llama3_70b_tests, timeout: 45, owner_id: U03FJB5TM5Y}, #Colman Glagovich # FIXME issue #14934
Expand Down
2 changes: 0 additions & 2 deletions .github/workflows/t3000-model-perf-tests-impl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,6 @@ jobs:
{ name: "t3k LLM falcon7b model perf tests", model: "falcon7b", model-type: "LLM", arch: wormhole_b0, cmd: run_t3000_falcon7b_tests, timeout: 75, owner_id: U05RWH3QUPM}, # Salar Hosseini
{ name: "t3k LLM mixtral model perf tests", model: "mixtral", model-type: "LLM", arch: wormhole_b0, cmd: run_t3000_mixtral_tests, timeout: 75, owner_id: U03PUAKE719}, # Miguel Tairum
{ name: "t3k LLM llama2-70B model perf tests", model: "llama2-70b", model-type: "LLM", arch: wormhole_b0, cmd: run_t3000_llama2_70b_tests, timeout: 75, owner_id: U03FJB5TM5Y}, # Colman Glagovich
{ name: "t3k LLM llama3-70B model perf tests", model: "llama3-70b", model-type: "LLM", arch: wormhole_b0, cmd: run_t3000_llama3_70b_tests, timeout: 60, owner_id: U03FJB5TM5Y}, # Colman Glagovich
{ name: "t3k LLM llama3 model perf tests", model: "llama3", model-type: "LLM", arch: wormhole_b0, cmd: run_t3000_llama3_tests, timeout: 60, owner_id: U03PUAKE719}, # Miguel Tairum
{ name: "t3k LLM falcon40b model perf tests", model: "falcon40b", model-type: "LLM", arch: wormhole_b0, cmd: run_t3000_falcon40b_tests, timeout: 75, owner_id: U053W15B6JF}, # Djordje Ivanovic
{ name: "t3k CNN resnet50 model perf tests", model: "resnet50", model-type: "CNN", arch: wormhole_b0, cmd: run_t3000_resnet50_tests, timeout: 75, owner_id: U013121KDH9}, # Austin Ho
{ name: "t3k CCL perf tests", arch: wormhole_b0, cmd: run_t3000_ccl_all_gather_perf_tests && run_t3000_ccl_reduce_scatter_perf_tests, timeout: 75, tracy: true, owner_id: ULMEPM2MA}, # Sean Nijjar
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/t3000-unit-tests-impl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ jobs:
{ name: "t3k falcon40b tests", arch: wormhole_b0, cmd: run_t3000_falcon40b_tests, timeout: 30, owner_id: U053W15B6JF}, #Djordje Ivanovic
{ name: "t3k llama3-small tests", arch: wormhole_b0, cmd: run_t3000_llama3-small_tests, timeout: 30, owner_id: U03PUAKE719}, #Miguel Tairum Cruz
{ name: "t3k llama3.2-11b tests", arch: wormhole_b0, cmd: run_t3000_llama3.2-11b_tests, timeout: 30, owner_id: U03PUAKE719}, #Miguel Tairum Cruz
{ name: "t3k llama3.2-11b-vision tests", arch: wormhole_b0, cmd: run_t3000_llama3.2-11b-vision_unit_tests, timeout: 30, owner_id: U03FJB5TM5Y}, #Colman Glagovich
{ name: "t3k n300 mesh llama3.2-11b-vision tests", arch: wormhole_b0, cmd: run_t3000_spoof_n300_llama3.2-11b-vision_unit_tests, timeout: 30, owner_id: U03FJB5TM5Y}, #Colman Glagovich
# { name: "t3k llama3.2-11b-vision tests", arch: wormhole_b0, cmd: run_t3000_llama3.2-11b-vision_unit_tests, timeout: 30, owner_id: U03FJB5TM5Y}, #Colman Glagovich
# { name: "t3k n300 mesh llama3.2-11b-vision tests", arch: wormhole_b0, cmd: run_t3000_spoof_n300_llama3.2-11b-vision_unit_tests, timeout: 30, owner_id: U03FJB5TM5Y}, #Colman Glagovich
{ name: "t3k llama3.1-70b tests", arch: wormhole_b0, cmd: run_t3000_llama3.1-70b_tests, timeout: 30, owner_id: U03PUAKE719}, #Miguel Tairum Cruz
{ name: "t3k mixtral tests", arch: wormhole_b0, cmd: run_t3000_mixtral_tests, timeout: 30, owner_id: U03PUAKE719}, #Miguel Tairum Cruz
{ name: "t3k grok tests", arch: wormhole_b0, cmd: run_t3000_grok_tests, timeout: 30, owner_id: U03HY7MK4BT}, #Mark O'Connor
Expand Down
61 changes: 49 additions & 12 deletions models/demos/llama3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,23 @@ All the above llama models (with the exception of 70B due to its large size) are
- N300 (2-chips)
- T3000 (8-chips)

Below is an updated table with max prefill context-length support for our demo. These were tested on both accuracy and performance mode.

The main reason for a long context length not fitting on device is lack of memory memory. Any exceptions are marked in the table.

| | N150 | N300 | T3K | TG
---------------|---------------|-----------------|----------------|-------------|
| Llama3.2-1B | 64k tokens | 64k tokens | 64k tokens [1] | TBD |
| Llama3.2-3B | 32k tokens | 32k tokens [1] | 64k tokens [1] | TBD |
| Llama3.1-8B | 16k tokens | 64k tokens | 128k tokens | TBD |
| Llama3.2-11B | 16k tokens | 64k tokens | 128k tokens | TBD |
| Llama3.1-70B | Not supported | Not supported | 32k tokens [2] | 128k tokens |

[1] For these configurations, running context lengths greater than those specified on the table will generate a bad repetitive output.

[2] Although longer prefill context-lengths are not supported due to model size and available memory, you can still decode (generate) tokens up to a maximum of 128k tokens.


## How to Run

### Download the weights
Expand Down Expand Up @@ -67,30 +84,50 @@ $LLAMA_DIR/T3K # For T3000

### Run the demo

The current demo is setup for a single user (batch=1) that loads a prompt file (around 128 tokens), prefills the encoded prompt and then runs decode for 120 iterations.
The Llama3 demo includes 3 main modes of operation and is fully parametrized to support other configurations.

- `batch-1`: Runs a small prompt for a single user
- `batch-32`: Runs a small prompt for a a batch of 32 users
- `long-context`: Runs a large prompt (64k tokens) for a single user

The demo is also parametrized to run for 1 or 3 continuous batch of users, i.e. to simulate multiple users generating text one after another.
If you want to provide your own demo configuration, please take a look at the pytest parametrize calls in `models/demos/llama3/demo/demo.py`. For convenience we list all the supported params below:

The input prompts are based on the general or instruct (fine-tuned) weights. The prompts are included in the demo folder `models/demos/llama3/demo`.
- `input_prompts (string)`: input json file with prompts to process. See `models/demos/llama3/demo/*.json` for a list of input files
- `instruct (bool)`: Whether to use Llama instruct weights or general weights
- `repeat_batches (int)`: Number of consecutive batches of users to run (default: 1)
- `max_seq_len (int)`: Maximum context length supported by the model (refer to the table above)
- `batch_size (int)`: Number of users in a batch (Supports 1/2/4/8/16/32 batches)
- `max_generated_tokens (int)`: Maximum number of tokens to generate for each user (Note that the users will stop generation before this limit if they reach a eos token)
- `paged_attention (bool)`: Whether to use paged attention or default attention (vLLM support (WIP) requires paged attention)
- `page_params (dict)`: Page parameters for paged attention - [`block_size`, `max_num_blocks`]. For smaller context lengths use `block_size=32` and `max_num_blocks=1024`, for larger context use block_size=64 and max_num_blocks=2048
- `sampling_params (dict)`: Sampling parameters for decoding -[`temperature`, `top_p`]. If temperature is set to 0, argmax (greedy decode) is used.
- `optimization (LlamaOptimizations)`: Optimization level to use for the model [`performance`, `accuracy`]

Please note that using `argmax` with `batch_size > 1` or using `top-p` sampling with any batch size, these ops will be run on host. This is because those ops are not yet fully supported on device. A decrease in performance is expected when these configurations are enabled.

When running the demo, do not forget to setup the `$LLAMA_DIR` environment variable to the corresponding Llama3 model weights.

Additionally, we also support the use of a fake device. This enables running a smaller chip demo in a larger multichip device.
Supported devices: [`N150`, `N300`, `T3K`, `TG`].

Example: `export FAKE_DEVICE=N150`, will enable running a single-chip demo on a multi-chip system.

```
# Examples of how to run the demo for any supported Llama3 models
# Run a single continuous batch with instruct weights
pytest models/demos/llama3/demo/demo.py -k 'instruct and 1_batch'
# Batch-1
pytest models/demos/llama3/demo/demo.py -k "performance and batch-1"
# Run 2 continuous batches with general weights
pytest models/demos/llama3/demo/demo.py -k 'general and 2_batch'
# Batch-32
pytest models/demos/llama3/demo/demo.py -k "performance and batch-32"
# Long-context
pytest models/demos/llama3/demo/demo.py -k "performance and long"
```

By default we run the models in `LlamaOptimizations.performance` mode. You can override this by setting the `optimizations` argument in the demo. To compare the two on a long prompt, you can run:
The above examples are run in `LlamaOptimizations.performance` mode.
You can override this by setting the `optimizations` argument in the demo. To use instead the accuracy mode you can call the above tests with `-k "accuracy and ..."` instead of performance.

```
pytest models/demos/llama3/demo/demo.py -k 'long-performance'
pytest models/demos/llama3/demo/demo.py -k 'long-accuracy'
```

### Expected performance and accuracy

Expand Down
Loading

0 comments on commit 1f7eccf

Please sign in to comment.