-
Notifications
You must be signed in to change notification settings - Fork 254
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
46 changed files
with
6,121 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,177 @@ | ||
# Intel® Extension for PyTorch\* Large Language Model (LLM) Feature Get Started For Llama 3.1 models | ||
|
||
Intel® Extension for PyTorch\* provides dedicated optimization for running Llama 3.1 models faster, including technical points like paged attention, ROPE fusion, etc. And a set of data types are supported for various scenarios, including BF16, Weight Only Quantization INT8/INT4 (prototype), etc. | ||
|
||
# 1. Environment Setup | ||
|
||
There are several environment setup methodologies provided. You can choose either of them according to your usage scenario. The Docker-based ones are recommended. | ||
|
||
## 1.1 [RECOMMENDED] Docker-based environment setup with pre-built wheels | ||
|
||
```bash | ||
# Get the Intel® Extension for PyTorch\* source code | ||
git clone https://github.com/intel/intel-extension-for-pytorch.git | ||
cd intel-extension-for-pytorch | ||
git checkout 2.4-llama-3 | ||
git submodule sync | ||
git submodule update --init --recursive | ||
|
||
# Build an image with the provided Dockerfile by installing from Intel® Extension for PyTorch\* prebuilt wheel files | ||
DOCKER_BUILDKIT=1 docker build -f examples/cpu/inference/python/llm/Dockerfile -t ipex-llm:2.4.0 . | ||
|
||
# Run the container with command below | ||
docker run --rm -it --privileged ipex-llm:2.4.0 bash | ||
|
||
# When the command prompt shows inside the docker container, enter llm examples directory | ||
cd llm | ||
|
||
# Activate environment variables | ||
source ./tools/env_activate.sh | ||
``` | ||
|
||
## 1.2 Conda-based environment setup with pre-built wheels | ||
|
||
```bash | ||
# Get the Intel® Extension for PyTorch\* source code | ||
git clone https://github.com/intel/intel-extension-for-pytorch.git | ||
cd intel-extension-for-pytorch | ||
git checkout 2.4-llama-3 | ||
git submodule sync | ||
git submodule update --init --recursive | ||
|
||
# Create a conda environment (pre-built wheel only available with python=3.10) | ||
conda create -n llm python=3.10 -y | ||
conda activate llm | ||
|
||
# Setup the environment with the provided script | ||
# A sample "prompt.json" file for benchmarking is also downloaded | ||
cd examples/cpu/inference/python/llm | ||
bash ./tools/env_setup.sh 7 | ||
|
||
# Activate environment variables | ||
source ./tools/env_activate.sh | ||
``` | ||
<br> | ||
|
||
# 2. How To Run Llama 3.1 with ipex.llm | ||
|
||
**ipex.llm provides a single script to facilitate running generation tasks as below:** | ||
|
||
``` | ||
# if you are using a docker container built from commands above in Sec. 1.1, the placeholder LLM_DIR below is ~/llm | ||
# if you are using a conda env created with commands above in Sec. 1.2, the placeholder LLM_DIR below is intel-extension-for-pytorch/examples/cpu/inference/python/llm | ||
cd <LLM_DIR> | ||
python run.py --help # for more detailed usages | ||
``` | ||
|
||
| Key args of run.py | Notes | | ||
|---|---| | ||
| model id | "--model-name-or-path" or "-m" to specify the <LLAMA3_MODEL_ID_OR_LOCAL_PATH>, it is model id from Huggingface or downloaded local path | | ||
| generation | default: beam search (beam size = 4), "--greedy" for greedy search | | ||
| input tokens | provide fixed sizes for input prompt size, use "--input-tokens" for <INPUT_LENGTH> in [1024, 2048, 4096, 8192, 32768, 130944]; if "--input-tokens" is not used, use "--prompt" to choose other strings as inputs| | ||
| output tokens | default: 32, use "--max-new-tokens" to choose any other size | | ||
| batch size | default: 1, use "--batch-size" to choose any other size | | ||
| token latency | enable "--token-latency" to print out the first or next token latency | | ||
| generation iterations | use "--num-iter" and "--num-warmup" to control the repeated iterations of generation, default: 100-iter/10-warmup | | ||
| streaming mode output | greedy search only (work with "--greedy"), use "--streaming" to enable the streaming generation output | | ||
|
||
*Note:* You may need to log in your HuggingFace account to access the model files. Please refer to [HuggingFace login](https://huggingface.co/docs/huggingface_hub/quick-start#login). | ||
|
||
## 2.1 Usage of running Llama 3.1 models | ||
|
||
The _\<LLAMA3_MODEL_ID_OR_LOCAL_PATH\>_ in the below commands specifies the Llama 3.1 model you will run, which can be found from [HuggingFace Models](https://huggingface.co/models). | ||
|
||
### 2.1.1 Run generation with multiple instances on multiple CPU numa nodes | ||
|
||
#### 2.1.1.1 Prepare: | ||
|
||
```bash | ||
unset KMP_AFFINITY | ||
``` | ||
|
||
In the DeepSpeed cases below, we recommend "--shard-model" to shard model weight sizes more even for better memory usage when running with DeepSpeed. | ||
|
||
If using "--shard-model", it will save a copy of the shard model weights file in the path of "--output-dir" (default path is "./saved_results" if not provided). | ||
If you have used "--shard-model" and generated such a shard model path (or your model weights files are already well sharded), in further repeated benchmarks, please remove "--shard-model", and replace "-m <LLAMA3_MODEL_ID_OR_LOCAL_PATH>" with "-m <shard model path>" to skip the repeated shard steps. | ||
|
||
Besides, the standalone shard model function/scripts are also provided in section 2.1.1.4, in case you would like to generate the shard model weights files in advance before running distributed inference. | ||
|
||
#### 2.1.1.2 BF16: | ||
|
||
- Command: | ||
```bash | ||
deepspeed --bind_cores_to_rank run.py --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --dtype bfloat16 --ipex --greedy --input-tokens <INPUT_LENGTH> --autotp --shard-model | ||
``` | ||
|
||
#### 2.1.1.3 Weight-only quantization (INT8): | ||
|
||
By default, for weight-only quantization, we use quantization with [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html) inference ("--quant-with-amp") to get peak performance and fair accuracy. | ||
For weight-only quantization with deepspeed, we quantize the model then run the benchmark. The quantized model won't be saved. | ||
|
||
- Command: | ||
```bash | ||
deepspeed --bind_cores_to_rank run.py --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --ipex --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --greedy --input-tokens <INPUT_LENGTH> --autotp --shard-model --output-dir "saved_results" | ||
# Note: you can add "--group-size" to tune good accuracy, suggested range as one of [32, 64, 128, 256, 512]. | ||
``` | ||
|
||
#### 2.1.1.4 How to Shard Model weight files for Distributed Inference with DeepSpeed | ||
|
||
To save memory usage, we could shard the model weights files under the local path before we launch distributed tests with DeepSpeed. | ||
|
||
``` | ||
cd ./utils | ||
# general command: | ||
python create_shard_model.py -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --save-path ./local_llama3_model_shard | ||
# After sharding the model, using "-m ./local_llama3_model_shard" in later tests | ||
``` | ||
|
||
### 2.1.2 Run generation with one socket inference | ||
#### 2.1.2.1 BF16: | ||
|
||
- Command: | ||
|
||
```bash | ||
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --dtype bfloat16 --ipex --greedy --input-tokens <INPUT_LENGTH> | ||
``` | ||
|
||
#### 2.1.2.2 Weight-only quantization (INT8): | ||
|
||
By default, for weight-only quantization, we use quantization with [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html) inference ("--quant-with-amp") to get peak performance and fair accuracy. | ||
|
||
- Command: | ||
|
||
```bash | ||
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir "saved_results" --greedy --input-tokens <INPUT_LENGTH> | ||
# Note: you can add "--group-size" to tune good accuracy, suggested range as one of [32, 64, 128, 256, 512]. | ||
``` | ||
|
||
#### 2.1.2.3 Weight-only quantization (INT4): | ||
You can use auto-round (part of INC) to generate INT4 WOQ model with following steps. | ||
- Environment installation: | ||
```bash | ||
pip install git+https://github.com/intel/auto-round.git@e24b9074af6cdb099e31c92eb81b7f5e9a4a244e | ||
git clone https://github.com/intel/auto-round.git | ||
git checkout e24b9074af6cdb099e31c92eb81b7f5e9a4a244e | ||
cd auto-round/examples/language-modeling | ||
``` | ||
|
||
- Command (quantize): | ||
```bash | ||
python3 main.py --model_name $model_name --device cpu --sym --nsamples 512 --iters 1000 --group_size 32 --deployment_device cpu --disable_eval --output_dir <INT4_MODEL_SAVE_PATH> | ||
``` | ||
|
||
- Command (benchmark): | ||
```bash | ||
cd <LLM_DIR> | ||
IPEX_WOQ_GEMM_LOOP_SCHEME=ACB OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --ipex-weight-only-quantization --weight-dtype INT4 --quant-with-amp --output-dir "saved_results" --greedy --input-tokens <INPUT_LENGTH> --cache-weight-for-large-batch --low-precision-checkpoint <INT4_MODEL_SAVE_PATH> | ||
``` | ||
|
||
#### 2.1.2.4 Notes: | ||
|
||
(1) [_numactl_](https://linux.die.net/man/8/numactl) is used to specify memory and cores of your hardware to get better performance. _\<node N\>_ specifies the [numa](https://en.wikipedia.org/wiki/Non-uniform_memory_access) node id (e.g., 0 to use the memory from the first numa node). _\<physical cores list\>_ specifies phsysical cores which you are using from the _\<node N\>_ numa node. You can use [_lscpu_](https://man7.org/linux/man-pages/man1/lscpu.1.html) command in Linux to check the numa node information. | ||
|
||
(2) For all quantization benchmarks, both quantization and inference stages will be triggered by default. For quantization stage, it will auto-generate the quantized model named "best_model.pt" in the "--output-dir" path, and for inference stage, it will launch the inference with the quantized model "best_model.pt". For inference-only benchmarks (avoid the repeating quantization stage), you can also reuse these quantized models for by adding "--quantized-model-path <output_dir + "best_model.pt">" . | ||
|
||
|
||
## Miscellaneous Tips | ||
Intel® Extension for PyTorch\* also provides dedicated optimization for many other Large Language Models (LLM), which cover a set of data types that are supported for various scenarios. For more details, please check this [Intel® Extension for PyTorch\* doc](https://github.com/intel/intel-extension-for-pytorch/blob/release/2.3/README.md). |
123 changes: 123 additions & 0 deletions
123
llm/llama3_1/cpu/_static/_sphinx_javascript_frameworks_compat.js
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
/* Compatability shim for jQuery and underscores.js. | ||
* | ||
* Copyright Sphinx contributors | ||
* Released under the two clause BSD licence | ||
*/ | ||
|
||
/** | ||
* small helper function to urldecode strings | ||
* | ||
* See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/decodeURIComponent#Decoding_query_parameters_from_a_URL | ||
*/ | ||
jQuery.urldecode = function(x) { | ||
if (!x) { | ||
return x | ||
} | ||
return decodeURIComponent(x.replace(/\+/g, ' ')); | ||
}; | ||
|
||
/** | ||
* small helper function to urlencode strings | ||
*/ | ||
jQuery.urlencode = encodeURIComponent; | ||
|
||
/** | ||
* This function returns the parsed url parameters of the | ||
* current request. Multiple values per key are supported, | ||
* it will always return arrays of strings for the value parts. | ||
*/ | ||
jQuery.getQueryParameters = function(s) { | ||
if (typeof s === 'undefined') | ||
s = document.location.search; | ||
var parts = s.substr(s.indexOf('?') + 1).split('&'); | ||
var result = {}; | ||
for (var i = 0; i < parts.length; i++) { | ||
var tmp = parts[i].split('=', 2); | ||
var key = jQuery.urldecode(tmp[0]); | ||
var value = jQuery.urldecode(tmp[1]); | ||
if (key in result) | ||
result[key].push(value); | ||
else | ||
result[key] = [value]; | ||
} | ||
return result; | ||
}; | ||
|
||
/** | ||
* highlight a given string on a jquery object by wrapping it in | ||
* span elements with the given class name. | ||
*/ | ||
jQuery.fn.highlightText = function(text, className) { | ||
function highlight(node, addItems) { | ||
if (node.nodeType === 3) { | ||
var val = node.nodeValue; | ||
var pos = val.toLowerCase().indexOf(text); | ||
if (pos >= 0 && | ||
!jQuery(node.parentNode).hasClass(className) && | ||
!jQuery(node.parentNode).hasClass("nohighlight")) { | ||
var span; | ||
var isInSVG = jQuery(node).closest("body, svg, foreignObject").is("svg"); | ||
if (isInSVG) { | ||
span = document.createElementNS("http://www.w3.org/2000/svg", "tspan"); | ||
} else { | ||
span = document.createElement("span"); | ||
span.className = className; | ||
} | ||
span.appendChild(document.createTextNode(val.substr(pos, text.length))); | ||
node.parentNode.insertBefore(span, node.parentNode.insertBefore( | ||
document.createTextNode(val.substr(pos + text.length)), | ||
node.nextSibling)); | ||
node.nodeValue = val.substr(0, pos); | ||
if (isInSVG) { | ||
var rect = document.createElementNS("http://www.w3.org/2000/svg", "rect"); | ||
var bbox = node.parentElement.getBBox(); | ||
rect.x.baseVal.value = bbox.x; | ||
rect.y.baseVal.value = bbox.y; | ||
rect.width.baseVal.value = bbox.width; | ||
rect.height.baseVal.value = bbox.height; | ||
rect.setAttribute('class', className); | ||
addItems.push({ | ||
"parent": node.parentNode, | ||
"target": rect}); | ||
} | ||
} | ||
} | ||
else if (!jQuery(node).is("button, select, textarea")) { | ||
jQuery.each(node.childNodes, function() { | ||
highlight(this, addItems); | ||
}); | ||
} | ||
} | ||
var addItems = []; | ||
var result = this.each(function() { | ||
highlight(this, addItems); | ||
}); | ||
for (var i = 0; i < addItems.length; ++i) { | ||
jQuery(addItems[i].parent).before(addItems[i].target); | ||
} | ||
return result; | ||
}; | ||
|
||
/* | ||
* backward compatibility for jQuery.browser | ||
* This will be supported until firefox bug is fixed. | ||
*/ | ||
if (!jQuery.browser) { | ||
jQuery.uaMatch = function(ua) { | ||
ua = ua.toLowerCase(); | ||
|
||
var match = /(chrome)[ \/]([\w.]+)/.exec(ua) || | ||
/(webkit)[ \/]([\w.]+)/.exec(ua) || | ||
/(opera)(?:.*version|)[ \/]([\w.]+)/.exec(ua) || | ||
/(msie) ([\w.]+)/.exec(ua) || | ||
ua.indexOf("compatible") < 0 && /(mozilla)(?:.*? rv:([\w.]+)|)/.exec(ua) || | ||
[]; | ||
|
||
return { | ||
browser: match[ 1 ] || "", | ||
version: match[ 2 ] || "0" | ||
}; | ||
}; | ||
jQuery.browser = {}; | ||
jQuery.browser[jQuery.uaMatch(navigator.userAgent).browser] = true; | ||
} |
Oops, something went wrong.