docs for Llama 3.1 release (#3122)

intel · Jul 23, 2024 · b807105 · b807105
1 parent dba3241
commit b807105
Show file tree

Hide file tree

Showing 46 changed files with 6,121 additions and 0 deletions.
diff --git a/llm/llama3_1/cpu/_sources/index.md.txt b/llm/llama3_1/cpu/_sources/index.md.txt
@@ -0,0 +1,177 @@
+# Intel® Extension for PyTorch\* Large Language Model (LLM) Feature Get Started For Llama 3.1 models
+
+Intel® Extension for PyTorch\* provides dedicated optimization for running Llama 3.1 models faster, including technical points like paged attention, ROPE fusion, etc. And a set of data types are supported for various scenarios, including BF16, Weight Only Quantization INT8/INT4 (prototype), etc.
+
+# 1. Environment Setup
+
+There are several environment setup methodologies provided. You can choose either of them according to your usage scenario. The Docker-based ones are recommended.
+
+## 1.1 [RECOMMENDED] Docker-based environment setup with pre-built wheels
+
+```bash
+# Get the Intel® Extension for PyTorch\* source code
+git clone https://github.com/intel/intel-extension-for-pytorch.git
+cd intel-extension-for-pytorch
+git checkout 2.4-llama-3
+git submodule sync
+git submodule update --init --recursive
+
+# Build an image with the provided Dockerfile by installing from Intel® Extension for PyTorch\* prebuilt wheel files
+DOCKER_BUILDKIT=1 docker build -f examples/cpu/inference/python/llm/Dockerfile -t ipex-llm:2.4.0 .
+
+# Run the container with command below
+docker run --rm -it --privileged ipex-llm:2.4.0 bash
+
+# When the command prompt shows inside the docker container, enter llm examples directory
+cd llm
+
+# Activate environment variables
+source ./tools/env_activate.sh
+```
+
+## 1.2 Conda-based environment setup with pre-built wheels
+
+```bash
+# Get the Intel® Extension for PyTorch\* source code
+git clone https://github.com/intel/intel-extension-for-pytorch.git
+cd intel-extension-for-pytorch
+git checkout 2.4-llama-3
+git submodule sync
+git submodule update --init --recursive
+
+# Create a conda environment (pre-built wheel only available with python=3.10)
+conda create -n llm python=3.10 -y
+conda activate llm
+
+# Setup the environment with the provided script
+# A sample "prompt.json" file for benchmarking is also downloaded
+cd examples/cpu/inference/python/llm
+bash ./tools/env_setup.sh 7
+
+# Activate environment variables
+source ./tools/env_activate.sh
+```
+<br>
+
+# 2. How To Run Llama 3.1 with ipex.llm
+
+**ipex.llm provides a single script to facilitate running generation tasks as below:**
+
+```
+# if you are using a docker container built from commands above in Sec. 1.1, the placeholder LLM_DIR below is ~/llm
+# if you are using a conda env created with commands above in Sec. 1.2, the placeholder LLM_DIR below is intel-extension-for-pytorch/examples/cpu/inference/python/llm
+cd <LLM_DIR>
+python run.py --help # for more detailed usages
+```
+
+| Key args of run.py | Notes |
+|---|---|
+| model id | "--model-name-or-path" or "-m" to specify the <LLAMA3_MODEL_ID_OR_LOCAL_PATH>, it is model id from Huggingface or downloaded local path |
+| generation | default: beam search (beam size = 4), "--greedy" for greedy search |
+| input tokens | provide fixed sizes for input prompt size, use "--input-tokens" for <INPUT_LENGTH> in [1024, 2048, 4096, 8192, 32768, 130944]; if "--input-tokens" is not used, use "--prompt" to choose other strings as inputs|
+| output tokens | default: 32, use "--max-new-tokens" to choose any other size |
+| batch size |  default: 1, use "--batch-size" to choose any other size |
+| token latency |  enable "--token-latency" to print out the first or next token latency |
+| generation iterations |  use "--num-iter" and "--num-warmup" to control the repeated iterations of generation, default: 100-iter/10-warmup |
+| streaming mode output | greedy search only (work with "--greedy"), use "--streaming" to enable the streaming generation output |
+
+*Note:* You may need to log in your HuggingFace account to access the model files. Please refer to [HuggingFace login](https://huggingface.co/docs/huggingface_hub/quick-start#login).
+
+## 2.1 Usage of running Llama 3.1 models
+
+The _\<LLAMA3_MODEL_ID_OR_LOCAL_PATH\>_ in the below commands specifies the Llama 3.1 model you will run, which can be found from [HuggingFace Models](https://huggingface.co/models).
+
+### 2.1.1 Run generation with multiple instances on multiple CPU numa nodes
+
+#### 2.1.1.1 Prepare:
+
+```bash
+unset KMP_AFFINITY
+```
+
+In the DeepSpeed cases below, we recommend "--shard-model" to shard model weight sizes more even for better memory usage when running with DeepSpeed.
+
+If using "--shard-model", it will save a copy of the shard model weights file in the path of "--output-dir" (default path is "./saved_results" if not provided).
+If you have used "--shard-model" and generated such a shard model path (or your model weights files are already well sharded), in further repeated benchmarks, please remove "--shard-model", and replace "-m <LLAMA3_MODEL_ID_OR_LOCAL_PATH>" with "-m <shard model path>" to skip the repeated shard steps.
+
+Besides, the standalone shard model function/scripts are also provided in section 2.1.1.4, in case you would like to generate the shard model weights files in advance before running distributed inference.
+
+#### 2.1.1.2 BF16:
+
+- Command:
+```bash
+deepspeed --bind_cores_to_rank  run.py --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --dtype bfloat16 --ipex  --greedy --input-tokens <INPUT_LENGTH> --autotp --shard-model
+```
+
+#### 2.1.1.3 Weight-only quantization (INT8):
+
+By default, for weight-only quantization, we use quantization with [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html) inference ("--quant-with-amp") to get peak performance and fair accuracy.
+For weight-only quantization with deepspeed, we quantize the model then run the benchmark. The quantized model won't be saved.
+
+- Command:
+```bash
+deepspeed --bind_cores_to_rank run.py  --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --ipex --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --greedy --input-tokens <INPUT_LENGTH>  --autotp --shard-model --output-dir "saved_results"
+# Note: you can add "--group-size" to tune good accuracy, suggested range as one of [32, 64, 128, 256, 512].
+```
+
+#### 2.1.1.4 How to Shard Model weight files for Distributed Inference with DeepSpeed
+
+To save memory usage, we could shard the model weights files under the local path before we launch distributed tests with DeepSpeed.
+
+```
+cd ./utils
+# general command:
+python create_shard_model.py -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH>  --save-path ./local_llama3_model_shard
+# After sharding the model, using "-m ./local_llama3_model_shard" in later tests
+```
+
+### 2.1.2 Run generation with one socket inference
+#### 2.1.2.1 BF16:
+
+- Command:
+
+```bash
+OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --dtype bfloat16 --ipex --greedy --input-tokens <INPUT_LENGTH>
+```
+
+#### 2.1.2.2 Weight-only quantization (INT8):
+
+By default, for weight-only quantization, we use quantization with [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html) inference ("--quant-with-amp") to get peak performance and fair accuracy.
+
+- Command:
+
+```bash
+OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list>  python run.py  --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir "saved_results"  --greedy --input-tokens <INPUT_LENGTH>
+# Note: you can add "--group-size" to tune good accuracy, suggested range as one of [32, 64, 128, 256, 512].
+```
+
+#### 2.1.2.3 Weight-only quantization (INT4):
+You can use auto-round (part of INC) to generate INT4 WOQ model with following steps.
+- Environment installation:
+```bash
+pip install git+https://github.com/intel/auto-round.git@e24b9074af6cdb099e31c92eb81b7f5e9a4a244e
+git clone https://github.com/intel/auto-round.git
+git checkout e24b9074af6cdb099e31c92eb81b7f5e9a4a244e
+cd auto-round/examples/language-modeling
+```
+
+- Command (quantize):
+```bash
+python3 main.py --model_name  $model_name --device cpu --sym --nsamples 512 --iters 1000 --group_size 32 --deployment_device cpu --disable_eval --output_dir <INT4_MODEL_SAVE_PATH>
+```
+
+- Command (benchmark):
+```bash
+cd <LLM_DIR>
+IPEX_WOQ_GEMM_LOOP_SCHEME=ACB OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list>  python run.py  --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --ipex-weight-only-quantization --weight-dtype INT4 --quant-with-amp --output-dir "saved_results"  --greedy --input-tokens <INPUT_LENGTH> --cache-weight-for-large-batch --low-precision-checkpoint <INT4_MODEL_SAVE_PATH>
+```
+
+#### 2.1.2.4 Notes:
+
+(1) [_numactl_](https://linux.die.net/man/8/numactl) is used to specify memory and cores of your hardware to get better performance. _\<node N\>_ specifies the [numa](https://en.wikipedia.org/wiki/Non-uniform_memory_access) node id (e.g., 0 to use the memory from the first numa node). _\<physical cores list\>_ specifies phsysical cores which you are using from the _\<node N\>_ numa node. You can use [_lscpu_](https://man7.org/linux/man-pages/man1/lscpu.1.html) command in Linux to check the numa node information.
+
+(2) For all quantization benchmarks, both quantization and inference stages will be triggered by default. For quantization stage, it will auto-generate the quantized model named "best_model.pt" in the "--output-dir" path, and for inference stage, it will launch the inference with the quantized model "best_model.pt".  For inference-only benchmarks (avoid the repeating quantization stage), you can also reuse these quantized models for by adding "--quantized-model-path <output_dir + "best_model.pt">" .
+
+
+## Miscellaneous Tips
+Intel® Extension for PyTorch\* also provides dedicated optimization for many other Large Language Models (LLM), which cover a set of data types that are supported for various scenarios. For more details, please check this [Intel® Extension for PyTorch\* doc](https://github.com/intel/intel-extension-for-pytorch/blob/release/2.3/README.md).
diff --git a/llm/llama3_1/cpu/_static/_sphinx_javascript_frameworks_compat.js b/llm/llama3_1/cpu/_static/_sphinx_javascript_frameworks_compat.js
@@ -0,0 +1,123 @@
+/* Compatability shim for jQuery and underscores.js.
+ *
+ * Copyright Sphinx contributors
+ * Released under the two clause BSD licence
+ */
+
+/**
+ * small helper function to urldecode strings
+ *
+ * See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/decodeURIComponent#Decoding_query_parameters_from_a_URL
+ */
+jQuery.urldecode = function(x) {
+    if (!x) {
+        return x
+    }
+    return decodeURIComponent(x.replace(/\+/g, ' '));
+};
+
+/**
+ * small helper function to urlencode strings
+ */
+jQuery.urlencode = encodeURIComponent;
+
+/**
+ * This function returns the parsed url parameters of the
+ * current request. Multiple values per key are supported,
+ * it will always return arrays of strings for the value parts.
+ */
+jQuery.getQueryParameters = function(s) {
+    if (typeof s === 'undefined')
+        s = document.location.search;
+    var parts = s.substr(s.indexOf('?') + 1).split('&');
+    var result = {};
+    for (var i = 0; i < parts.length; i++) {
+        var tmp = parts[i].split('=', 2);
+        var key = jQuery.urldecode(tmp[0]);
+        var value = jQuery.urldecode(tmp[1]);
+        if (key in result)
+            result[key].push(value);
+        else
+            result[key] = [value];
+    }
+    return result;
+};
+
+/**
+ * highlight a given string on a jquery object by wrapping it in
+ * span elements with the given class name.
+ */
+jQuery.fn.highlightText = function(text, className) {
+    function highlight(node, addItems) {
+        if (node.nodeType === 3) {
+            var val = node.nodeValue;
+            var pos = val.toLowerCase().indexOf(text);
+            if (pos >= 0 &&
+                !jQuery(node.parentNode).hasClass(className) &&
+                !jQuery(node.parentNode).hasClass("nohighlight")) {
+                var span;
+                var isInSVG = jQuery(node).closest("body, svg, foreignObject").is("svg");
+                if (isInSVG) {
+                    span = document.createElementNS("http://www.w3.org/2000/svg", "tspan");
+                } else {
+                    span = document.createElement("span");
+                    span.className = className;
+                }
+                span.appendChild(document.createTextNode(val.substr(pos, text.length)));
+                node.parentNode.insertBefore(span, node.parentNode.insertBefore(
+                    document.createTextNode(val.substr(pos + text.length)),
+                    node.nextSibling));
+                node.nodeValue = val.substr(0, pos);
+                if (isInSVG) {
+                    var rect = document.createElementNS("http://www.w3.org/2000/svg", "rect");
+                    var bbox = node.parentElement.getBBox();
+                    rect.x.baseVal.value = bbox.x;
+                    rect.y.baseVal.value = bbox.y;
+                    rect.width.baseVal.value = bbox.width;
+                    rect.height.baseVal.value = bbox.height;
+                    rect.setAttribute('class', className);
+                    addItems.push({
+                        "parent": node.parentNode,
+                        "target": rect});
+                }
+            }
+        }
+        else if (!jQuery(node).is("button, select, textarea")) {
+            jQuery.each(node.childNodes, function() {
+                highlight(this, addItems);
+            });
+        }
+    }
+    var addItems = [];
+    var result = this.each(function() {
+        highlight(this, addItems);
+    });
+    for (var i = 0; i < addItems.length; ++i) {
+        jQuery(addItems[i].parent).before(addItems[i].target);
+    }
+    return result;
+};
+
+/*
+ * backward compatibility for jQuery.browser
+ * This will be supported until firefox bug is fixed.
+ */
+if (!jQuery.browser) {
+    jQuery.uaMatch = function(ua) {
+        ua = ua.toLowerCase();
+
+        var match = /(chrome)[ \/]([\w.]+)/.exec(ua) ||
+            /(webkit)[ \/]([\w.]+)/.exec(ua) ||
+            /(opera)(?:.*version|)[ \/]([\w.]+)/.exec(ua) ||
+            /(msie) ([\w.]+)/.exec(ua) ||
+            ua.indexOf("compatible") < 0 && /(mozilla)(?:.*? rv:([\w.]+)|)/.exec(ua) ||
+            [];
+
+        return {
+            browser: match[ 1 ] || "",
+            version: match[ 2 ] || "0"
+        };
+    };
+    jQuery.browser = {};
+    jQuery.browser[jQuery.uaMatch(navigator.userAgent).browser] = true;
+}