- Download the weights from Huggingface.
- Repack the weights using the provided script in
models/demos/t3000/mixtral8x7b/scripts/repack_weights.py
. It requires the consolidated weights from Huggingface to be inside<path_to_checkpoint_dir>
.
# This separates the 8 experts to facilitate sending them to multiple devices.
python models/demos/t3000/mixtral8x7b/scripts/repack_weights.py <path_to_checkpoint_dir> <repacked_output_dir>
- Set async and dispatch over ethernet cores env vars:
export TT_METAL_ASYNC_DEVICE_QUEUE=1
export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
- Prepare the weight cache directory:
# Make a directory for ttnn to cache weights into. This speeds up subsequent runs.
mkdir <weight_cache_dir>
- Set up environment variables:
export MIXTRAL_CKPT_DIR=<repacked_output_dir>
export MIXTRAL_TOKENIZER_PATH=<path_to_tokenizer_dir>
export MIXTRAL_CACHE_PATH=<weights_cache_dir>
Note that the cached weights folder structure will contain the general and instruct cached weights in separate directories, like so:
<weights_cache_dir>
/mixtral_tensor_cache_bfp8
/mixtral_tensor_cache_instruct_bfp8
...
- Cache the weights (first-time setup):
# Build a full 32 layer model to cache the weights. This will take some time.
pytest -svv models/demos/t3000/mixtral8x7b/tests/test_mixtral_model.py::test_mixtral_model_inference[wormhole_b0-True-1-32-output]
Mixtral prefill support is now available. We include two different demos: a decode-only mode where the prompts are decoded token-by-token and force-pushed until the user starts generating; and a prefill&decode mode, where the KV-caches are first prefilled for the prompt length (e.g. 128 tokens) and then decode as normal.
# Run the demo with a pre-written batch of 32 user prompts
# Prefill & Decode demo
pytest -svv models/demos/t3000/mixtral8x7b/demo/demo_with_prefill.py::test_mixtral8x7b_demo[wormhole_b0-True-general_weights]
# Decode-only demo
pytest -svv models/demos/t3000/mixtral8x7b/demo/demo.py::test_mixtral8x7b_demo[wormhole_b0-True-general_weights]
We also provide an input file with 32 user question-prompt for instruct weights (don't forget to update your flags to the correct weights!):
# Prefill & Decode demo
pytest -svv models/demos/t3000/mixtral8x7b/demo/demo_with_prefill.py::test_mixtral8x7b_demo[wormhole_b0-True-instruct_weights]
# Decode-only demo
pytest -svv models/demos/t3000/mixtral8x7b/demo/demo.py::test_mixtral8x7b_demo[wormhole_b0-True-instruct_weights]
Both input files are provided inside models/demos/t3000/mixtral8x7b/demo/
.
If you wish you to run the model using a different set of input prompts you can provide a different path to pytest inside the demo code. Keep in mind that for the instruct-mode, the prompts are automatically prefixed and suffixed by [INST]
and [/INST]
, respectively, so there's no need to add them to your file.