From 2e4d306075293502a68d169de92eae5f16957a32 Mon Sep 17 00:00:00 2001 From: Jing Xu Date: Wed, 6 Jul 2022 14:58:30 +0900 Subject: [PATCH] update tutorial for 1.12 release (#942) update the runtime document update known issue of runtime extension [doc] update supported fusion patterns of fp32/bf16/int8 (#854) * update supported fusion patterns of fp32/bf16/int8 * fix typo doc: editor review of all tutorial docs (#863) - Lots of edits across the tutorial documents for grammar, clarity, simplification, and spelling - Fixed malformed md and rst causing layout issues (including indenting) - Removed trailing whitespace - Fixed UTF-8 characters in code examples (e.g, curley quotes vs. straight quotes) - Changed pygments language (code highlight) to bash for unsupported cmd - Changed absolute links to relative where appropriate. - Added toctree items to make documents visible in navigation menu. Signed-off-by: David B. Kinder update docs update int8.md update performance page with tunable parameters description update int8 example update torch-ccl package name update version in README update int8.md: change customer qconfig for dynamic quantization Add performance tuning guide for OneDNN primitive cache (#905) * Add performance tuning guide for OneDNN primitive cache * Update docs/tutorials/performance_tuning/tuning_guide.md Co-authored-by: Jiong Gong * Update tuning_guide.md Co-authored-by: Jiong Gong update doc for autocast (#899) add 2 known issues of MultiStreamModule update known issues update known issues update int8 doc add 1.12 release notes correct intel_extension_for_pytorch_structure.png update release notes, correct model zoo url in examples update docs update docs update graph_optimization.md --- docs/design_doc/isa_dyndisp.md | 70 ++--- docs/index.rst | 17 +- docs/tutorials/api_doc.rst | 3 +- docs/tutorials/blogs_publications.md | 1 + docs/tutorials/contribution.md | 133 +++++---- docs/tutorials/examples.md | 131 ++++++--- docs/tutorials/features.rst | 82 +++--- docs/tutorials/features/amp.md | 35 +-- docs/tutorials/features/graph_optimization.md | 247 +++++++++------- docs/tutorials/features/int8.md | 256 +++++++--------- .../features/isa_dynamic_dispatch.md | 122 ++++++++ docs/tutorials/features/nhwc.md | 32 +- docs/tutorials/features/optimizer_fusion.md | 8 +- docs/tutorials/features/runtime_extension.md | 48 +-- docs/tutorials/features/split_sgd.rst | 26 +- docs/tutorials/installation.md | 69 ++++- docs/tutorials/performance.md | 275 +++++++++++++++++- docs/tutorials/performance_tuning.rst | 2 +- .../performance_tuning/known_issues.md | 44 ++- .../performance_tuning/launch_script.md | 30 +- .../performance_tuning/torchserve.md | 96 +++--- .../performance_tuning/tuning_guide.md | 162 ++++++----- docs/tutorials/releases.md | 244 +++++++++++++++- .../intel_extension_for_pytorch_structure.png | Bin 89606 -> 83085 bytes .../quantization/README.md | 7 +- .../quantization/_quantize.py | 8 +- 26 files changed, 1458 insertions(+), 690 deletions(-) create mode 100644 docs/tutorials/features/isa_dynamic_dispatch.md diff --git a/docs/design_doc/isa_dyndisp.md b/docs/design_doc/isa_dyndisp.md index c608120c7..9be0a40df 100644 --- a/docs/design_doc/isa_dyndisp.md +++ b/docs/design_doc/isa_dyndisp.md @@ -1,10 +1,10 @@ -# IPEX CPU ISA Dynamic Dispatch Design Doc +# Intel® Extension for PyTorch\* CPU ISA Dynamic Dispatch Design Doc -This document explains the dynamic kernel dispatch mechanism based on CPU ISA. It is an extension to the similar mechanism in PyTorch. +This document explains the dynamic kernel dispatch mechanism for Intel® Extension for PyTorch\* (IPEX) based on CPU ISA. It is an extension to the similar mechanism in PyTorch. ## Overview ---- -IPEX dyndisp is forked from **PyTorch:** `ATen/native/DispatchStub.h` and `ATen/native/DispatchStub.cpp`. Besides that, IPEX add more CPU ISA level support, such as `AVX512_VNNI`, `AVX512_BF16` and `AMX`. + +IPEX dyndisp is forked from **PyTorch:** `ATen/native/DispatchStub.h` and `ATen/native/DispatchStub.cpp`. IPEX adds additional CPU ISA level support, such as `AVX512_VNNI`, `AVX512_BF16` and `AMX`. PyTorch & IPEX CPU ISA support statement: | | DEFAULT | AVX2 | AVX512 | AVX512_VNNI | AVX512_BF16 | AMX | @@ -23,19 +23,19 @@ PyTorch & IPEX CPU ISA support statement: | AVX512_BF16 | GCC 10.3+ | | AMX | GCC 11.2+ | -\* Detailed compiler check, please check with `cmake/Modules/FindAVX.cmake` +\* Check with `cmake/Modules/FindAVX.cmake` for detailed compiler checks. ## Dynamic Dispatch Design ---- -Dynamic dispatch major mechanism is to copy the kernel implementation source file to multiple folders for each ISA level. And then build each file using its ISA specific parameters. Each generated object file will contains its function body(**Kernel Implementation**). -Kernel Implementation use anonymous namespace so that different cpu versions won't conflict. +Dynamic dispatch copies the kernel implementation source files to multiple folders for each ISA level. It then builds each file using its ISA specific parameters. Each generated object file will contain its function body (**Kernel Implementation**). -**Kernel Stub** is a "virtual function" with polymorphic kernel implementations w.r.t. ISA levels. +Kernel Implementation uses an anonymous namespace so that different CPU versions won't conflict. -At the runtime, **Dispatch Stub implementation** will check CPUIDs and OS status to determins which ISA level pointer to best matching function body. +**Kernel Stub** is a "virtual function" with polymorphic kernel implementations pertaining to ISA levels. -### Code Folder Struct +At the runtime, **Dispatch Stub implementation** will check CPUIDs and OS status to determins which ISA level pointer best matches the function body. + +### Code Folder Struct >#### **Kernel implementation:** `intel_extension_for_pytorch/csrc/aten/cpu/kernels/xyzKrnl.cpp` >#### **Kernel Stub:** `intel_extension_for_pytorch/csrc/aten/cpu/xyz.cpp` and `intel_extension_for_pytorch/csrc/aten/cpu/xyz.h` >#### **Dispatch Stub implementation:** `intel_extension_for_pytorch/csrc/dyndisp/DispatchStub.cpp` and `intel_extension_for_pytorch/csrc/dyndisp/DispatchStub.h` @@ -46,8 +46,10 @@ IPEX build system will generate code for each ISA level with specifiy complier p The CodeGen will copy each cpp files from **Kernel implementation**, and then add ISA level as new file suffix. > **Sample:** +> > ---- -> **Origin file:** +> +> **Origin file:** > > `intel_extension_for_pytorch/csrc/aten/cpu/kernels/AdaptiveAveragePoolingKrnl.cpp` > @@ -64,7 +66,9 @@ The CodeGen will copy each cpp files from **Kernel implementation**, and then ad > AVX512_BF16: `build/Release/intel_extension_for_pytorch/csrc/aten/cpu/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_BF16.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -DCPU_CAPABILITY=AVX512_BF16 -DCPU_CAPABILITY_AVX512_BF16` > > AMX: `build/Release/intel_extension_for_pytorch/csrc/aten/cpu/kernels/AdaptiveAveragePoolingKrnl.cpp.AMX.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -DCPU_CAPABILITY_AVX512_BF16 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -mamx-tile -mamx-int8 -mamx-bf16 -DCPU_CAPABILITY=AMX -DCPU_CAPABILITY_AMX` + --- + >**Note:** >1. DEFAULT level kernels is not fully implemented in IPEX. In order to align to PyTorch, we build default use AVX2 parameters in stead of that. So, IPEX minimal required executing machine support AVX2. >2. `-D__AVX__` and `-D__AVX512F__` is defined for depends library [sleef](https://sleef.org/) . @@ -73,12 +77,12 @@ The CodeGen will copy each cpp files from **Kernel implementation**, and then ad >5. Higher ISA level is compatible to lower ISA levels, so it needs to contains level ISA feature definitions. Such as AVX512_BF16 need contains `-DCPU_CAPABILITY_AVX512` `-DCPU_CAPABILITY_AVX512_VNNI`. But AVX512 don't contains AVX2 definitions, due to there are different vec register width. ## Add Custom Kernel ---- -If you want to add new custom kernel, and the kernel using CPU ISA instruction. Please reference to below steps. -1. Please add CPU ISA related kernel implementation to the folder: `intel_extension_for_pytorch/csrc/aten/cpu/kernels/NewKernelKrnl.cpp` -2. Please add kernel stub to the folder: `intel_extension_for_pytorch/csrc/aten/cpu/NewKernel.cpp` -3. Please include header file: `intel_extension_for_pytorch/csrc/dyndisp/DispatchStub.h`, and reference to the comment in the header file. +If you want to add a new custom kernel, and the kernel uses CPU ISA instructions, refer to these tips: + +1. Add CPU ISA related kernel implementation to the folder: `intel_extension_for_pytorch/csrc/aten/cpu/kernels/NewKernelKrnl.cpp` +2. Add kernel stub to the folder: `intel_extension_for_pytorch/csrc/aten/cpu/NewKernel.cpp` +3. Include header file: `intel_extension_for_pytorch/csrc/dyndisp/DispatchStub.h`, and reference to the comment in the header file. ```c++ // Implements instruction set specific function dispatch. // @@ -111,9 +115,9 @@ If you want to add new custom kernel, and the kernel using CPU ISA instruction. >**Note:** > ->1. Some kernel only call **oneDNN** or **iDeep** implementation, or other backend implementation. Which is not need to add kernel implementation. (Refer: `BatchNorm.cpp`) ->2. Vec related header file must be included in kernel implementation file, but can not be included in kernel stub. Kernel stub is common code for all ISA level, and can't pass ISA related compiler parameters. ->3. More intrinsics please check at [Intel® Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html). +>1. Some kernels only call **oneDNN** or **iDeep** implementation, or other backend implementation, which is not needed to add kernel implementations. (Refer: `BatchNorm.cpp`) +>2. Vec related header file must be included in kernel implementation files, but can not be included in kernel stub. Kernel stub is common code for all ISA level, and can't pass ISA related compiler parameters. +>3. For more intrinsics, check the [Intel® Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html). ### ISA intrinics specific kernel example: @@ -163,7 +167,7 @@ void cvt_fp32_to_bf16(at::BFloat16* dst, const float* src, int len) { ``` Macro `CPU_CAPABILITY_AVX512` and `CPU_CAPABILITY_AVX512_BF16` are defined by compiler check, it is means that current compiler havs capability to generate defined ISA level code. -Because of `AVX512_BF16` is higher level than `AVX512`, and it compatible to `AVX512`. `CPU_CAPABILITY_AVX512_BF16` can be contained in `CPU_CAPABILITY_AVX512` region. +Because of `AVX512_BF16` is higher level than `AVX512`, and it compatible to `AVX512`. `CPU_CAPABILITY_AVX512_BF16` can be contained in `CPU_CAPABILITY_AVX512` region. ```c++ //csrc/aten/cpu/kernels/CvtFp32ToBf16Krnl.cpp @@ -247,7 +251,7 @@ REGISTER_DISPATCH(cvt_fp32_to_bf16_kernel_stub, &cvt_fp32_to_bf16_kernel_impl); ``` ### Vec specific kernel example: -This example show get data type size and Its Vec size. In different ISA, Vec has different register width, and it has different Vec size also. +This example shows how to get the data type size and its Vec size. In different ISA, Vec has a different register width and a different Vec size. ```c++ //csrc/aten/cpu/GetVecLength.h @@ -354,8 +358,8 @@ REGISTER_DISPATCH( ``` ## Private Debug APIs ---- -Here three ISA related private APIs could do same debug work. Which contains: + +Here are three ISA-related private APIs that can help debugging:: 1. Query current ISA level. 2. Query max CPU supported ISA level. 3. Query max binary supported ISA level. @@ -363,10 +367,10 @@ Here three ISA related private APIs could do same debug work. Which contains: > >1. Max CPU supported ISA level only depends on CPU features. >2. Max binary supported ISA level only depends on built complier version. ->3. Current ISA level, it is equal minimal of `max CPU ISA level` and `max binary ISA level`. +>3. Current ISA level, it is the smaller of `max CPU ISA level` and `max binary ISA level`. ### Example: -```cmd +```bash python Python 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] :: Anaconda, Inc. on linux @@ -382,10 +386,10 @@ Type "help", "copyright", "credits" or "license" for more information. ``` ## Select ISA level manually. ---- -By default, IPEX dispatches to the kernels with maximum ISA level supported by the underlying CPU hardware. This ISA level can be overridden by the environment variable `ATEN_CPU_CAPABILITY` (same environment variable from PyTorch). The available values are {`avx2`, `avx512`, `avx512_vnni`, `avx512_bf16`, `amx`}. The effective ISA level would be the minimal level between `ATEN_CPU_CAPABILITY` and the maximum level supported by the hardware. + +By default, IPEX dispatches to the kernels with the maximum ISA level supported by the underlying CPU hardware. This ISA level can be overridden by the environment variable `ATEN_CPU_CAPABILITY` (same environment variable as PyTorch). The available values are {`avx2`, `avx512`, `avx512_vnni`, `avx512_bf16`, `amx`}. The effective ISA level would be the minimal level between `ATEN_CPU_CAPABILITY` and the maximum level supported by the hardware. ### Example: -```cmd +```bash $ python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())' AMX $ ATEN_CPU_CAPABILITY=avx2 python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())' @@ -393,13 +397,13 @@ AVX2 ``` >**Note:** > ->`core._get_current_isa_level()` is an IPEX internal function used for checking the current effective ISA level. It is used for debugging purpose only and subjects to change. +>`core._get_current_isa_level()` is an IPEX internal function used for checking the current effective ISA level. It is used for debugging purpose only and subject to change. ## CPU feature check ---- + An addtional CPU feature check tool in the subfolder: `tests/cpu/isa` -```cmd +```bash $ cmake . -- The C compiler identification is GNU 11.2.1 -- The CXX compiler identification is GNU 11.2.1 @@ -466,4 +470,4 @@ amx_tile: true amx_int8: true prefetchw: true prefetchwt1: false -``` \ No newline at end of file +``` diff --git a/docs/index.rst b/docs/index.rst index be67a823c..257fed167 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -3,19 +3,24 @@ You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. -Welcome to Intel® Extension for PyTorch* documentation! -####################################################### +Welcome to Intel® Extension for PyTorch* Documentation +###################################################### -Intel® Extension for PyTorch* extends PyTorch with optimizations for extra performance boost on Intel hardware. Most of the optimizations will be included in stock PyTorch releases eventually, and the intention of the extension is to deliver up-to-date features and optimizations for PyTorch on Intel hardware, examples include AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX). +Intel® Extension for PyTorch* extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware. Example optimizations use AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX). Over time, most of these optimizations will be included directly into stock PyTorch releases. -Intel® Extension for PyTorch* is structured as the following figure. It is loaded as a Python module for Python programs or linked as a C++ library for C++ programs. Users can enable it dynamically in script by importing `intel_extension_for_pytorch`. It covers optimizations for both imperative mode and graph mode. Optimized operators and kernels are registered through PyTorch dispatching mechanism. These operators and kernels are accelerated from native vectorization feature and matrix calculation feature of Intel hardware. During execution, Intel® Extension for PyTorch* intercepts invocation of ATen operators, and replace the original ones with these optimized ones. In graph mode, further operator fusions are applied manually by Intel engineers or through a tool named *oneDNN Graph* to reduce operator/kernel invocation overheads, and thus increase performance. +Intel® Extension for PyTorch* provides optimizations for both eager mode and graph mode, however, compared to eager mode, graph mode in PyTorch normally yields better performance from optimization techniques such as operation fusion, and Intel® Extension for PyTorch* amplified them with more comprehensive graph optimizations. Therefore we recommended you to take advantage of Intel® Extension for PyTorch* with `TorchScript `_ whenever your workload supports it. You could choose to run with `torch.jit.trace()` function or `torch.jit.script()` function, but based on our evaluation, `torch.jit.trace()` supports more workloads so we recommend you to use `torch.jit.trace()` as your first choice. More detailed information can be found at `pytorch.org website `_. -.. image:: ../images/intel_extension_for_pytorch_structure.png +The extension can be loaded as a Python module for Python programs or linked as a C++ library for C++ programs. In Python scripts users can enable it dynamically by importing `intel_extension_for_pytorch`. + +Intel® Extension for PyTorch* is structured as shown in the following figure: + +.. figure:: ../images/intel_extension_for_pytorch_structure.png :width: 800 :align: center :alt: Structure of Intel® Extension for PyTorch* -| + +PyTorch components are depicted with white boxes while Intel Extensions are with blue boxes. Extra performance of the extension is delivered via both custom addons and overriding existing PyTorch components. In eager mode, the PyTorch frontend is extended with custom Python modules (such as fusion modules), optimal optimizers and INT8 quantization API. Further performance boosting is available by converting the eager-mode model into graph mode via the extended graph fusion passes. Intel® Extension for PyTorch* dispatches the operators into their underlying kernels automatically based on ISA that it detects and leverages vectorization and matrix acceleration units available in Intel hardware, as much as possible. oneDNN library is used for computation intensive operations. Intel Extension for PyTorch runtime extension brings better efficiency with finer-grained thread runtime control and weight sharing. Intel® Extension for PyTorch* has been released as an open–source project at `Github `_. diff --git a/docs/tutorials/api_doc.rst b/docs/tutorials/api_doc.rst index 02a430226..d710296d8 100644 --- a/docs/tutorials/api_doc.rst +++ b/docs/tutorials/api_doc.rst @@ -13,8 +13,7 @@ Quantization ************ .. automodule:: intel_extension_for_pytorch.quantization -.. autofunction:: QuantConf -.. autoclass:: calibrate +.. autofunction:: prepare .. autofunction:: convert CPU Runtime diff --git a/docs/tutorials/blogs_publications.md b/docs/tutorials/blogs_publications.md index fcc05ff5d..ec01860d9 100644 --- a/docs/tutorials/blogs_publications.md +++ b/docs/tutorials/blogs_publications.md @@ -1,6 +1,7 @@ Blogs & Publications ==================== +* [Accelerating PyTorch with Intel® Extension for PyTorch\*](https://medium.com/pytorch/accelerating-pytorch-with-intel-extension-for-pytorch-3aef51ea3722) * [Intel and Facebook Accelerate PyTorch Performance with 3rd Gen Intel® Xeon® Processors and Intel® Deep Learning Boost’s new BFloat16 capability](https://www.intel.com/content/www/us/en/artificial-intelligence/posts/intel-facebook-boost-bfloat16.html) * [Accelerate PyTorch with the extension and oneDNN using Intel BF16 Technology](https://medium.com/pytorch/accelerate-pytorch-with-ipex-and-onednn-using-intel-bf16-technology-dca5b8e6b58f) * *Note*: APIs mentioned in it are deprecated. diff --git a/docs/tutorials/contribution.md b/docs/tutorials/contribution.md index 188984e71..4f618d23a 100644 --- a/docs/tutorials/contribution.md +++ b/docs/tutorials/contribution.md @@ -3,95 +3,94 @@ Contribution ## Contributing to Intel® Extension for PyTorch\* -Thank you for your interest in contributing to Intel® Extension for PyTorch\*! Before you begin writing code, it is important that you share your intention to contribute with the team, based on the type of contribution: +Thank you for your interest in contributing to Intel® Extension for PyTorch\*. Before you begin writing code, it is important that you share your intention to contribute with the team, based on the type of contribution: 1. You want to propose a new feature and implement it. - - Post about your intended feature in an [issue](https://github.com/intel/intel-extension-for-pytorch/issues), and we shall discuss the design and implementation. Once we agree that the plan looks good, go ahead and implement it. + - Post about your intended feature in a [GitHub issue](https://github.com/intel/intel-extension-for-pytorch/issues), and we shall discuss the design and implementation. Once we agree that the plan looks good, go ahead and implement it. 2. You want to implement a feature or bug-fix for an outstanding issue. - - Search for your issue in the [issue list](https://github.com/intel/intel-extension-for-pytorch/issues). + - Search for your issue in the [GitHub issue list](https://github.com/intel/intel-extension-for-pytorch/issues). - Pick an issue and comment that you'd like to work on the feature or bug-fix. - - If you need more context on a particular issue, please ask and we shall provide. + - If you need more context on a particular issue, ask and we shall provide. -Once you implement and test your feature or bug-fix, please submit a Pull Request to https://github.com/intel/intel-extension-for-pytorch. +Once you implement and test your feature or bug-fix, submit a Pull Request to https://github.com/intel/intel-extension-for-pytorch. ## Developing Intel® Extension for PyTorch\* -A full set of instructions on installing Intel® Extension for PyTorch\* from source is here: -https://github.com/intel/intel-extension-for-pytorch#install-extension-by-compiling-from-source +A full set of instructions on installing Intel® Extension for PyTorch\* from source is in the [Installation document](instalation.md#install-via-source-compilation). To develop on your machine, here are some tips: 1. Uninstall all existing Intel® Extension for PyTorch\* installs. You may need to run `pip uninstall intel_extension_for_pytorch` multiple times. You'll know `intel_extension_for_pytorch` is fully uninstalled when you see `WARNING: Skipping intel_extension_for_pytorch as it is not installed`. (You should only have to `pip uninstall` a few times, but you can always `uninstall` with `timeout` or in a loop if you're feeling lazy.) -```bash -yes | pip uninstall intel_extension_for_pytorch -``` + ```bash + yes | pip uninstall intel_extension_for_pytorch + ``` 2. Clone a copy of Intel® Extension for PyTorch\* from source: -```bash -git clone https://github.com/intel/intel-extension-for-pytorch.git -cd intel-extension-for-pytorch -``` + ```bash + git clone https://github.com/intel/intel-extension-for-pytorch.git + cd intel-extension-for-pytorch + ``` -2.1. If you already have Intel® Extension for PyTorch\* from source, update it: + If you already have Intel® Extension for PyTorch\* from source, update it: -```bash -git pull --rebase -git submodule sync --recursive -git submodule update --init --recursive --jobs 0 -``` + ```bash + git pull --rebase + git submodule sync --recursive + git submodule update --init --recursive --jobs 0 + ``` -If you want to have no-op incremental rebuilds (which are fast), see the section below titled "Make no-op build fast." + If you want to have no-op incremental rebuilds (which are fast), see the section below titled "Make no-op build fast." 3. Install Intel® Extension for PyTorch\* in `develop` mode: -The change you have to make is to replace + Replace: -```bash -python setup.py install -``` + ```bash + python setup.py install + ``` -with + with: -```bash -python setup.py develop -``` + ```bash + python setup.py develop + ``` -This mode will symlink the Python files from the current local source tree into the Python install. Hence, if you modify a Python file, you do not need to reinstall PyTorch again and again. This is especially useful if you are only changing Python files. + This mode will symlink the Python files from the current local source tree into the Python install. After than, if you modify a Python file, you do not need to reinstall PyTorch again. This is especially useful if you are only changing Python files. -For example: -- Install local Intel® Extension for PyTorch\* in `develop` mode -- modify your Python file `intel_extension_for_pytorch/__init__.py` (for example) -- test functionality -- modify your Python file `intel_extension_for_pytorch/__init__.py` -- test functionality -- modify your Python file `intel_extension_for_pytorch/__init__.py` -- test functionality + For example: + - Install local Intel® Extension for PyTorch\* in `develop` mode + - modify your Python file `intel_extension_for_pytorch/__init__.py` (for example) + - test functionality + - modify your Python file `intel_extension_for_pytorch/__init__.py` + - test functionality + - modify your Python file `intel_extension_for_pytorch/__init__.py` + - test functionality -You do not need to repeatedly install after modifying Python files (`.py`). However, you would need to reinstall if you modify Python interface (`.pyi`, `.pyi.in`) or non-Python files (`.cpp`, `.cc`, `.cu`, `.h`, ...). +You do not need to repeatedly install after modifying Python files (`.py`). However, you would need to reinstall if you modify a Python interface (`.pyi`, `.pyi.in`) or non-Python files (`.cpp`, `.cc`, `.cu`, `.h`, etc.). -In case you want to reinstall, make sure that you uninstall Intel® Extension for PyTorch\* first by running `pip uninstall intel_extension_for_pytorch` until you see `WARNING: Skipping intel_extension_for_pytorch as it is not installed`; next run `python setup.py clean`. After that, you can install in `develop` mode again. +If you want to reinstall, make sure that you uninstall Intel® Extension for PyTorch\* first by running `pip uninstall intel_extension_for_pytorch` until you see `WARNING: Skipping intel_extension_for_pytorch as it is not installed`; next run `python setup.py clean`. After that, you can install in `develop` mode again. ### Tips and Debugging -* A prerequisite to installing Intel® Extension for PyTorch\* is CMake. We recommend installing it with [Homebrew](https://brew.sh/) with `brew install cmake` if you are developing on MacOS or Linux system. +* Cmake must be installed before installing Intel® Extension for PyTorch\*. If youre developing on MacOS or Linux, We recommend installing Cmake with [Homebrew](https://brew.sh/) with `brew install cmake`. * Our `setup.py` requires Python >= 3.6 * If you run into errors when running `python setup.py develop`, here are some debugging steps: 1. Run `printf '#include \nint main() { printf("Hello World");}'|clang -x c -; ./a.out` to make sure your CMake works and can compile this simple Hello World program without errors. - 2. Nuke your `build` directory. The `setup.py` script compiles binaries into the `build` folder and caches many details along the way, which saves time the next time you build. If you're running into issues, you can always `rm -rf build` from the toplevel `pytorch` directory and start over. + 2. Remove your `build` directory. The `setup.py` script compiles binaries into the `build` folder and caches many details along the way. This saves time the next time you build. If you're running into issues, you can always `rm -rf build` from the toplevel `pytorch` directory and start over. 3. If you have made edits to the Intel® Extension for PyTorch\* repo, commit any change you'd like to keep and clean the repo with the following commands (note that clean _really_ removes all untracked files and changes.): - ```bash - git submodule deinit -f . - git clean -xdf - python setup.py clean - git submodule update --init --recursive --jobs 0 # very important to sync the submodules - python setup.py develop # then try running the command again - ``` + ```bash + git submodule deinit -f . + git clean -xdf + python setup.py clean + git submodule update --init --recursive --jobs 0 # very important to sync the submodules + python setup.py develop # then try running the command again + ``` 4. The main step within `python setup.py develop` is running `make` from the `build` directory. If you want to experiment with some environment variables, you can pass them into the command: - ```bash - ENV_KEY1=ENV_VAL1[, ENV_KEY2=ENV_VAL2]* python setup.py develop - ``` + ```bash + ENV_KEY1=ENV_VAL1[, ENV_KEY2=ENV_VAL2]* python setup.py develop + ``` ## Unit testing @@ -107,7 +106,7 @@ python test/cpu/test_jit.py You can narrow down what you're testing even further by specifying the name of an individual test with `TESTCLASSNAME.TESTNAME`. Here, `TESTNAME` is the name of the test you want to run, and `TESTCLASSNAME` is the name of the class in which it is defined. -Going off the above example, let's say you want to run `test_Sequential`, which is defined as part of the `TestJit` class in `test/cpu/test_jit.py`. Your command would be: +Let's say you want to run `test_Sequential`, which is defined as part of the `TestJit` class in `test/cpu/test_jit.py`. Your command would be: ```bash python test/test_jit.py TestJit.test_Sequential @@ -119,7 +118,7 @@ The `expecttest` and `hypothesis` libraries must be installed to run the tests. We don't officially support `pytest`, but it works well with our `unittest` tests and offers a number of useful features for local developing. Install it via `pip install pytest`. -If you want to just run tests that contain a specific substring, you can use the `-k` flag: +If you want to run only tests that contain a specific substring, you can use the `-k` flag: ```bash pytest test/cpu/test_nn.py -k Loss -v @@ -145,7 +144,7 @@ These jobs may require extra dependencies that aren't dependencies of Intel® Ex make setup_lint ``` -To run a specific linting step, use one of these targets or see the [`Makefile`](Makefile) for a complete list of options. +To run a specific linting step, use one of these targets or see the Makefile for a complete list of options. ```bash # Check for tabs, trailing newlines, etc. @@ -178,7 +177,7 @@ For example, if you wanted to run the test `MayContainAlias`, which is part of t ## Writing documentation -So you want to write some documentation and don't know where to start? +So you want to write some documentation for your code contribution and don't know where to start? Intel® Extension for PyTorch\* uses [Google style](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) for formatting docstrings. Length of line inside docstrings block must be limited to 80 characters to fit into Jupyter documentation popups. @@ -186,22 +185,22 @@ Intel® Extension for PyTorch\* uses [Google style](http://sphinxcontrib-napoleo To build the documentation: -1. Build and install Intel® Extension for PyTorch\* +1. Build and install Intel® Extension for PyTorch\* (as discussed above) -2. Install the prerequisites +2. Install the prerequisites: -```bash -cd docs -pip install -r requirements.txt -``` + ```bash + cd docs + pip install -r requirements.txt + ``` 3. Generate the documentation HTML files. The generated files will be in `docs/_build/html`. -```bash -make clean -make html -``` + ```bash + make clean + make html + ``` #### Tips -The `.rst` source files live in [docs/tutorials](https://github.com/intel/intel-extension-for-pytorch/tree/master/docs/tutorials). Some of the `.rst` files pull in docstrings from Intel® Extension for PyTorch\* Python code (for example, via the `autofunction` or `autoclass` directives). To vastly shorten doc build times, it is helpful to remove the files you are not working on, only keeping the base `index.rst` file and the files you are editing. The Sphinx build will produce missing file warnings but will still complete. +The `.rst` source files live in [docs/tutorials](https://github.com/intel/intel-extension-for-pytorch/tree/master/docs/tutorials). Some of the `.rst` files pull in docstrings from Intel® Extension for PyTorch\* Python code (for example, via the `autofunction` or `autoclass` directives). To shorten doc build times, it is helpful to remove the files you are not working on, only keeping the base `index.rst` file and the files you are editing. The Sphinx build will produce missing file warnings but will still complete. diff --git a/docs/tutorials/examples.md b/docs/tutorials/examples.md index aab68d350..a3aa26733 100644 --- a/docs/tutorials/examples.md +++ b/docs/tutorials/examples.md @@ -7,12 +7,17 @@ Examples #### Code Changes Highlight +There are only a few lines of code change required to use Intel® Extension for PyTorch\* on training, as shown: +1. `torch.channels_last` should be applied to both of the model object and data to raise CPU resource usage efficiency. +2. `ipex.optimize` function applies optimizations against the model object, as well as an optimizer object. + ``` ... import torch import intel_extension_for_pytorch as ipex ... model = Model() +model = model.to(memory_format=torch.channels_last) criterion = ... optimizer = ... model.train() @@ -56,6 +61,7 @@ train_loader = torch.utils.data.DataLoader( ) model = torchvision.models.resnet50() +model = model.to(memory_format=torch.channels_last) criterion = torch.nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr = LR, momentum=0.9) model.train() @@ -104,6 +110,7 @@ train_loader = torch.utils.data.DataLoader( ) model = torchvision.models.resnet50() +model = model.to(memory_format=torch.channels_last) criterion = torch.nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr = LR, momentum=0.9) model.train() @@ -116,7 +123,7 @@ for batch_idx, (data, target) in enumerate(train_loader): data = data.to(memory_format=torch.channels_last) output = model(data) loss = criterion(output, target) - loss.backward() + loss.backward() optimizer.step() print(batch_idx) torch.save({ @@ -129,14 +136,14 @@ torch.save({ Distributed training with PyTorch DDP is accelerated by oneAPI Collective Communications Library Bindings for Pytorch\* (oneCCL Bindings for Pytorch\*). The extension supports FP32 and BF16 data types. More detailed information and examples are available at its [Github repo](https://github.com/intel/torch-ccl). -**Note:** When performing distributed training with BF16 data type, please use oneCCL Bindings for Pytorch\*. Due to a PyTorch limitation, distributed training with BF16 data type with Intel® Extension for PyTorch\* is not supported. +**Note:** When performing distributed training with BF16 data type, use oneCCL Bindings for Pytorch\*. Due to a PyTorch limitation, distributed training with BF16 data type with Intel® Extension for PyTorch\* is not supported. ``` import os import torch import torch.distributed as dist import torchvision -import torch_ccl +import oneccl_bindings_for_pytorch as torch_ccl import intel_extension_for_pytorch as ipex LR = 0.001 @@ -193,6 +200,10 @@ torch.save({ ## Inference +Channels last is a memory layout format that is more friendly to Intel Architecture. We recommend using this memory layout format for computer vision workloads by invoking `to(memory_format=torch.channels_last)` function against the model object and input data. + +The `optimize` function of Intel® Extension for PyTorch\* applies optimizations to the model, bringing additional performance boosts. For both computer vision workloads and NLP workloads, we recommend applying the `optimize` function against the model object. + ### Float32 #### Imperative Mode @@ -244,6 +255,8 @@ with torch.no_grad(): #### TorchScript Mode +We recommend you take advantage of Intel® Extension for PyTorch\* with [TorchScript](https://pytorch.org/docs/stable/jit.html) for further optimizations. + ##### Resnet50 ``` @@ -299,6 +312,9 @@ with torch.no_grad(): ### BFloat16 +Similar to running with FP32, the `optimize` function also works for BFloat16 data type. The only difference is setting `dtype` parameter to `torch.bfloat16`. +We recommend using Auto Mixed Precision (AMP) with BFloat16 data type. + #### Imperative Mode ##### Resnet50 @@ -350,6 +366,8 @@ with torch.no_grad(): #### TorchScript Mode +We recommend you take advantage of Intel® Extension for PyTorch\* with [TorchScript](https://pytorch.org/docs/stable/jit.html) for further optimizations. + ##### Resnet50 ``` @@ -406,80 +424,123 @@ with torch.no_grad(): ### INT8 +Starting from Intel® Extension for PyTorch\* 1.12.0, quantization feature supports both static and dynamic modes. + #### Calibration +##### Static Quantization + +Please follow the steps below to perform static calibration: + +1. Import `intel_extension_for_pytorch` as `ipex`. +2. Import `prepare` and `convert` from `intel_extension_for_pytorch.quantization`. +3. Instantiate a config object from `torch.ao.quantization.QConfig` to save configuration data during calibration. +4. Prepare model for calibration. +5. Perform calibration against dataset. +6. Invoke `ipex.quantization.convert` function to apply the calibration configure object to the fp32 model object to get an INT8 model. +7. Save the INT8 model into a `pt` file. + + ``` import os import torch +#################### code changes #################### +import intel_extension_for_pytorch as ipex +from intel_extension_for_pytorch.quantization import prepare, convert +###################################################### model = Model() model.eval() data = torch.rand() -# Applying torch.fx.experimental.optimization.fuse against model performs -# conv-batchnorm folding for better performance. -import torch.fx.experimental.optimization as optimization -model = optimization.fuse(model, inplace=True) - -#################### code changes #################### -import intel_extension_for_pytorch as ipex -conf = ipex.quantization.QuantConf(qscheme=torch.per_tensor_affine) -###################################################### +qconfig = ipex.quantization.default_static_qconfig +# Alternatively, define your own qconfig: +#from torch.ao.quantization import MinMaxObserver, PerChannelMinMaxObserver, QConfig +#qconfig = QConfig(activation=MinMaxObserver.with_args(qscheme=torch.per_tensor_affine, dtype=torch.quint8), +# weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric)) +prepared_model = prepare(model, qconfig, example_inputs=data, inplace=False) -for d in calibration_data_loader(): - # conf will be updated with observed statistics during calibrating with the dataset - with ipex.quantization.calibrate(conf): - model(d) +for d in calibration_data_loader(): + prepared_model(d) -conf.save('int8_conf.json', default_recipe=True) +converted_model = convert(prepared_model) with torch.no_grad(): - model = ipex.quantization.convert(model, conf, torch.rand()) - model.save('quantization_model.pt') + traced_model = torch.jit.trace(converted_model, data) + traced_model = torch.jit.freeze(traced_model) + +traced_model.save("quantized_model.pt") ``` -#### Deployment +##### Dynamic Quantization -##### Imperative Mode +Please follow the steps below to perform static calibration: + +1. Import `intel_extension_for_pytorch` as `ipex`. +2. Import `prepare` and `convert` from `intel_extension_for_pytorch.quantization`. +3. Instantiate a config object from `torch.ao.quantization.QConfig` to save configuration data during calibration. +4. Prepare model for quantization. +5. Convert the model. +6. Run inference to perform dynamic quantization. +7. Save the INT8 model into a `pt` file. ``` +import os import torch +#################### code changes #################### +import intel_extension_for_pytorch as ipex +from intel_extension_for_pytorch.quantization import prepare, convert +###################################################### model = Model() model.eval() data = torch.rand() -# Applying torch.fx.experimental.optimization.fuse against model performs -# conv-batchnorm folding for better performance. -import torch.fx.experimental.optimization as optimization -model = optimization.fuse(model, inplace=True) - -#################### code changes #################### -import intel_extension_for_pytorch as ipex -conf = ipex.quantization.QuantConf('int8_conf.json') -###################################################### +dynamic_qconfig = ipex.quantization.default_dynamic_qconfig +# Alternatively, define your own qconfig: +#from torch.ao.quantization import MinMaxObserver, PlaceholderObserver, QConfig +#qconfig = QConfig( +# activation = PlaceholderObserver.with_args(dtype=torch.float, compute_dtype=torch.quint8), +# weight = PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric)) +prepared_model = prepare(model, qconfig, example_inputs=data) +converted_model = convert(prepared_model) with torch.no_grad(): - model = ipex.quantization.convert(model, conf, torch.rand()) - model(data) + traced_model = torch.jit.trace(converted_model, data) + traced_model = torch.jit.freeze(traced_model) + +traced_model.save("quantized_model.pt") ``` -##### Graph Mode +#### Deployment + +For deployment, the INT8 model is loaded from the local file and can be used directly on the inference. + +Follow the steps below: + +1. Import `intel_extension_for_pytorch` as `ipex`. +2. Load the INT8 model from the saved file. +3. Run inference. ``` import torch +#################### code changes #################### import intel_extension_for_pytorch as ipex +###################################################### model = torch.jit.load('quantization_model.pt') model.eval() +model = torch.jit.freeze(model) data = torch.rand() with torch.no_grad(): model(data) ``` +oneDNN provides [oneDNN Graph Compiler](https://github.com/oneapi-src/oneDNN/tree/dev-graph-preview4/doc#onednn-graph-compiler) as a prototype feature that could boost performance for selective topologies. No code change is required. Install a binary with this feature enabled. We verified this feature with `Bert-large`, `bert-base-cased`, `roberta-base`, `xlm-roberta-base`, `google-electra-base-generator` and `google-electra-base-discriminator`. + ## C++ -To work with libtorch, C++ library of PyTorch, Intel® Extension for PyTorch\* provides its C++ dynamic library as well. The C++ library is supposed to handle inference workload only, such as service deployment. For regular development, please use Python interface. Comparing to usage of libtorch, no specific code changes are required, except for converting input data into channels last data format. Compilation follows the recommended methodology with CMake. Detailed instructions can be found in [PyTorch tutorial](https://pytorch.org/tutorials/advanced/cpp_export.html#depending-on-libtorch-and-building-the-application). +To work with libtorch, C++ library of PyTorch, Intel® Extension for PyTorch\* provides its C++ dynamic library as well. The C++ library is supposed to handle inference workload only, such as service deployment. For regular development, use the Python interface. Unlike using libtorch, no specific code changes are required, except for converting input data into channels last data format. Compilation follows the recommended methodology with CMake. Detailed instructions can be found in [PyTorch tutorial](https://pytorch.org/tutorials/advanced/cpp_export.html#depending-on-libtorch-and-building-the-application). During compilation, Intel optimizations will be activated automatically once C++ dynamic library of Intel® Extension for PyTorch\* is linked. @@ -578,4 +639,4 @@ $ ldd example-app ## Model Zoo -Use cases that had already been optimized by Intel engineers are available at [Model Zoo for Intel® Architecture](https://github.com/IntelAI/models/tree/pytorch-r1.10-models). A bunch of PyTorch use cases for benchmarking are also available on the [Github page](https://github.com/IntelAI/models/tree/pytorch-r1.10-models/benchmarks#pytorch-use-cases). You can get performance benefits out-of-box by simply running scipts in the Model Zoo. +Use cases that had already been optimized by Intel engineers are available at [Model Zoo for Intel® Architecture](https://github.com/IntelAI/models/tree/pytorch-r1.12-models). A bunch of PyTorch use cases for benchmarking are also available on the [GitHub page](https://github.com/IntelAI/models/tree/pytorch-r1.12-models/benchmarks#pytorch-use-cases). You can get performance benefits out-of-box by simply running scipts in the Model Zoo. diff --git a/docs/tutorials/features.rst b/docs/tutorials/features.rst index 4fdcebf4e..99eddaf9e 100644 --- a/docs/tutorials/features.rst +++ b/docs/tutorials/features.rst @@ -4,32 +4,39 @@ Features Ease-of-use Python API ---------------------- -Intel® Extension for PyTorch\* provides simple frontend Python APIs and utilities for users to get performance optimizations such as graph optimization and operator optimization with minor code changes. Typically, only two to three clauses are required to be added to the original code. +With only two or three clauses added to your original code, Intel® Extension for PyTorch\* provides simple frontend Python APIs and utilities to get performance optimizations such as graph optimization and operator optimization. -Please check `API Documentation `_ page for details of API functions. Examples are available in `Examples `_ page. +Check the `API Documentation`_ for details of API functions. `Examples `_ are also available. .. note:: - Please check the following table for package name of Intel® Extension for PyTorch\* from version to version when you do the package importing in Python scripts. + The package name used when you import Intel® Extension for PyTorch\* changed + from ``intel_pytorch_extension`` (for versions 1.2.0 through 1.9.0) to + ``intel_extension_for_pytorch`` (for versions 1.10.0 and later). Use the + correct package name depending on the version you are using. - .. list-table:: - :widths: auto - :align: center - :header-rows: 1 - - * - version - - package name - * - 1.2.0 ~ 1.9.0 - - intel_pytorch_extension - * - 1.10.0 ~ - - intel_extension_for_pytorch +Here are detailed discussions of specific feature topics, summarized in the rest +of this document: + +ISA Dynamic Dispatching +----------------------- + +Intel® Extension for PyTorch\* features dynamic dispatching functionality to automatically adapt execution binaries to the most advanced instruction set available on your machine. + +For more detailed information, check `ISA Dynamic Dispatching `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/isa_dynamic_dispatch Channels Last ------------- -Comparing to the default NCHW memory format, channels_last (NHWC) memory format could further accelerate convolutional neural networks. In Intel® Extension for PyTorch\*, NHWC memory format has been enabled for most key CPU operators, though not all of them have been merged to PyTorch master branch yet. They are expected to be fully landed in PyTorch upstream soon. +Compared with the default NCHW memory format, using channels_last (NHWC) memory format could further accelerate convolutional neural networks. In Intel® Extension for PyTorch\*, NHWC memory format has been enabled for most key CPU operators, though not all of them have been accepted and merged into the PyTorch master branch yet. -Check more detailed information for `Channels Last `_. +For more detailed information, check `Channels Last `_. .. toctree:: :hidden: @@ -37,14 +44,15 @@ Check more detailed information for `Channels Last `_. features/nhwc + Auto Mixed Precision (AMP) -------------------------- -Low precision data type BFloat16 has been natively supported on the 3rd Generation Xeon® Scalable Processors (aka Cooper Lake) with AVX512 instruction set and will be supported on the next generation of Intel® Xeon® Scalable Processors with Intel® Advanced Matrix Extensions (Intel® AMX) instruction set with further boosted performance. The support of Auto Mixed Precision (AMP) with BFloat16 for CPU and BFloat16 optimization of operators have been massively enabled in Intel® Extension for PyTorch\*, and partially upstreamed to PyTorch master branch. Most of these optimizations will be landed in PyTorch master through PRs that are being submitted and reviewed. +Low precision data type BFloat16 has been natively supported on 3rd Generation Xeon® Scalable Processors (aka Cooper Lake) with AVX512 instruction set. It will also be supported on the next generation of Intel® Xeon® Scalable Processors with Intel® Advanced Matrix Extensions (Intel® AMX) instruction set providing further boosted performance. The support of Auto Mixed Precision (AMP) with BFloat16 for CPU and BFloat16 optimization of operators has been enabled in Intel® Extension for PyTorch\*, and partially upstreamed to PyTorch master branch. These optimizations will be landed in PyTorch master through PRs that are being submitted and reviewed. -Check more detailed information for `Auto Mixed Precision (AMP) `_. +For more detailed information, check `Auto Mixed Precision (AMP) `_. -Bfloat16 computation can be conducted on platforms with AVX512 instruction set. On platforms with `AVX512 BFloat16 instruction `_, there will be further performance boost. +Bfloat16 computation can be conducted on platforms with AVX512 instruction set. On platforms with `AVX512 BFloat16 instruction `_, there will be an additional performance boost. .. toctree:: :hidden: @@ -55,9 +63,10 @@ Bfloat16 computation can be conducted on platforms with AVX512 instruction set. Graph Optimization ------------------ -To optimize performance further with torchscript, Intel® Extension for PyTorch\* supports fusion of frequently used operator patterns, like Conv2D+ReLU, Linear+ReLU, etc. The benefit of the fusions are delivered to users in a transparant fashion. +To further optimize TorchScript performance, Intel® Extension for PyTorch\* supports transparent fusion of frequently used operator patterns such as Conv2D+ReLU and Linear+ReLU. +For more detailed information, check `Graph Optimization `_. -Check more detailed information for `Graph Optimization `_. +Compared to eager mode, graph mode in PyTorch normally yields better performance from optimization methodologies such as operator fusion. Intel® Extension for PyTorch* provides further optimizations in graph mode. We recommend you take advantage of Intel® Extension for PyTorch* with `TorchScript `_. You may wish to run with the `torch.jit.trace()` function first, since it generally works better with Intel® Extension for PyTorch* than using the `torch.jit.script()` function. More detailed information can be found at the `pytorch.org website `_. .. toctree:: :hidden: @@ -68,7 +77,7 @@ Check more detailed information for `Graph Optimization `_ and `Optimizer Fusion `_. +For more detailed information, check `Optimizer Fusion `_ and `Split SGD `_ .. toctree:: :hidden: :maxdepth: 1 - features/split_sgd features/optimizer_fusion + features/split_sgd + -Runtime Extension (Experimental) +Runtime Extension -------------------------------- -Intel® Extension for PyTorch* Runtime Extension provides a couple of PyTorch frontend APIs for users to get finer-grained control of the thread runtime. It provides +Intel® Extension for PyTorch* Runtime Extension provides PyTorch frontend APIs for users to get finer-grained control of the thread runtime and provides: -1. Multi-stream inference via the Python frontend module MultiStreamModule. -2. Spawn asynchronous tasks from both Python and C++ frontend. -3. Configure core bindings for OpenMP threads from both Python and C++ frontend. +- Multi-stream inference via the Python frontend module MultiStreamModule. +- Spawn asynchronous tasks from both Python and C++ frontend. +- Program core bindings for OpenMP threads from both Python and C++ frontend. -Please **note**: Intel® Extension for PyTorch* Runtime extension is still in the **POC** stage. The API is subject to change. More detailed descriptions are available at `API Documentation page `_. +.. note:: Intel® Extension for PyTorch* Runtime extension is still in the experimental stage. The API is subject to change. More detailed descriptions are available in the `API Documentation `_. -Check more detailed information for `Runtime Extension `_. +For more detailed information, check `Runtime Extension `_. .. toctree:: :hidden: @@ -112,14 +122,12 @@ Check more detailed information for `Runtime Extension `_ fusion pass to deliver good performance. - -Check more detailed information for `INT8 `_. +Intel® Extension for PyTorch* has built-in quantization recipes to deliver good statistical accuracy for most popular DL workloads including CNN, NLP and recommendation models. -oneDNN provides an evaluation feature called `oneDNN Graph Compiler `_. Please refer to `oneDNN build instruction `_ to try this feature. You can find oneDNN in `third_party/llga`. +Check more detailed information for `INT8 Quantization `_. .. toctree:: :hidden: diff --git a/docs/tutorials/features/amp.md b/docs/tutorials/features/amp.md index 7fe5d1060..ff695363d 100644 --- a/docs/tutorials/features/amp.md +++ b/docs/tutorials/features/amp.md @@ -3,9 +3,9 @@ Auto Mixed Precision (AMP) ## Introduction -`torch.cpu.amp` provides convenience for auto data type conversion at runtime. Deep learning workloads could benefit from lower precision floating point data types like `torch.float16` or `torch.bfloat16`, because of its lighter calculation workload and less memory usage. However, because of the nature character of lower precision floating point data types, accuracy is sacrificed. Using lower precision floating point data types is a trade-off between accuracy and performance. Thus, some operations need to keep in `torch.float32`, while others can be converted to lower precision floating point data types. The Auto Mixed Precision (AMP) feature automates the tuning of data type conversions over all operators. +`torch.cpu.amp` provides convenience for auto data type conversion at runtime. Deep learning workloads can benefit from lower-precision floating point data types such as `torch.float16` or `torch.bfloat16`, because of its lighter calculation workload and smaller memory usage. Accuracy is sacrificed when using lower-precision floating point data types so there's a trade-off between accuracy and performance. Thus, some operations should use the slower but more accurate`torch.float32`, while others can be converted to use the faster but less accurate `torch.float16` data type. The Auto Mixed Precision (AMP) feature automates the tuning of data type conversions over all operators. -Currently, `torch.cpu.amp` only supports `torch.bfloat16`. It is the default lower precision floating point data type when `torch.cpu.amp` is enabled. `torch.cpu.amp` primarily benefits on Intel CPU with BFloat16 instruction set support. +`torch.cpu.amp` only supports `torch.bfloat16`. It is the default lower precision floating point data type when `torch.cpu.amp` is enabled. `torch.cpu.amp` primarily benefits when running on Intel CPU with BFloat16 instruction set support. ## Use Case @@ -32,7 +32,7 @@ y = model(x) ### Inference with Imperative Path -`torch.cpu.amp.autocast` is designed to be context managers that allow scopes of your script to run in mixed precision. In these scopes, operations run in a data type chosen by the `autocast` class to improve performance while maintaining accuracy. See the operations category section for details on what precision the `autocast` class chooses for each operator, and under what circumstances. +`torch.cpu.amp.autocast` is designed to be a context manager that allow scopes of your script to run with mixed precision. In these scopes, operations run in a data type chosen by the `autocast` class to improve performance while maintaining accuracy. See the operations category section for details on what precision the `autocast` class chooses for each operator, and under what circumstances. ``` model = SimpleNet().eval() @@ -72,10 +72,10 @@ for images, label in train_loader(): ### Op Eligibility -Ops that run in `float64` or non-floating-point dtypes are not eligible, and will run in these types whether or not autocast is enabled. +Ops that run in `float64` or non-floating-point dtypes are not eligible for mixed precision, and will run in these types whether or not autocast is enabled. -Only out-of-place ops and Tensor methods are eligible. In-place variants and calls that explicitly supply an `out=...` Tensor -are allowed in autocast-enabled regions, but won't go through autocasting. For example, in an autocast-enabled region `a.addmm(b, c)` can autocast, but `a.addmm_(b, c)` and `a.addmm(b, c, out=d)` cannot. For best performance and stability, prefer out-of-place ops in autocast-enabled regions. +Only out-of-place ops and Tensor methods are eligible for mixed precision. In-place variants and calls that explicitly supply an `out=...` Tensor +are allowed in autocast-enabled regions, but won't go through autocasting. For example, in an autocast-enabled region `a.addmm(b, c)` can autocast, but `a.addmm_(b, c)` and `a.addmm(b, c, out=d)` cannot. For best performance and stability, use out-of-place ops in autocast-enabled regions. ### Op-Specific Behavior @@ -83,15 +83,15 @@ The following lists describe the behavior of eligible ops in autocast-enabled re Ops not listed below do not go through autocasting. They run in the type defined by their inputs. However, autocasting may still change the type in which unlisted ops run if they're downstream from autocasted ops. -If an op is unlisted, we assume it's numerically stable in `bfloat16`. If you believe an unlisted op is numerically unstable in `bfloat16`, please file an issue. +If an op is unlisted, we assume it's numerically stable in `bfloat16`. If you believe that an unlisted op is numerically unstable in `bfloat16`, file a [GitHub issue](https://github.com/intel/intel-extension-for-pytorch/issues). #### Ops that can autocast to `bfloat16` -`conv1d`, `conv2d`, `conv3d`, `bmm`, `mm`, `baddbmm`, `addmm`, `addbmm`, `conv_transpose1d`, `conv_transpose2d`, `conv_transpose3d`, `linear`, `matmul`, `conv_tbc`, `group_norm` +`conv1d`, `conv2d`, `conv3d`, `bmm`, `mm`, `baddbmm`, `addmm`, `addbmm`, `linear`, `matmul`, `conv_tbc`, `group_norm`, `_native_multi_head_attention` #### Ops that can autocast to `float32` -`mish`, `avg_pool3d`, `binary_cross_entropy`, `grid_sampler`, `polar`, `fmod`, `prod`, `quantile`, `nanquantile`, `stft`, `cdist`, `trace`, `view_as_complex`, `cholesky`, `cholesky_inverse`, `cholesky_solve`, `inverse`, `lu_solve`, `matrix_rank`, `orgqr`, `ormqr`, `pinverse`, `max_pool3d`, `max_unpool2d`, `max_unpool3d`, `adaptive_avg_pool3d`, `reflection_pad1d`, `reflection_pad2d`, `replication_pad1d`, `replication_pad2d`, `replication_pad3d`, `mse_loss`, `ctc_loss`, `kl_div`, `multilabel_margin_loss`, `fft_fft`, `fft_ifft`, `fft_fft2`, `fft_ifft2`, `fft_fftn`, `fft_ifftn`, `fft_rfft`, `fft_irfft`, `fft_rfft2`, `fft_irfft2`, `fft_rfftn`, `fft_irfftn`, `fft_hfft`, `fft_ihfft`, `linalg_matrix_norm`, `linalg_cond`, `linalg_matrix_rank`, `linalg_solve`, `linalg_cholesky`, `linalg_svdvals`, `linalg_eigvals`, `linalg_eigvalsh`, `linalg_inv`, `linalg_householder_product`, `linalg_tensorinv`, `linalg_tensorsolve`, `fake_quantize_per_tensor_affine`, `eig`, `geqrf`, `lstsq`, `_lu_with_info`, `qr`, `svd`, `symeig`, `triangular_solve`, `fractional_max_pool2d`, `fractional_max_pool3d`, `adaptive_max_pool3d`, `multilabel_margin_loss_forward`, `linalg_qr`, `linalg_cholesky_ex`, `linalg_svd`, `linalg_eig`, `linalg_eigh`, `linalg_lstsq`, `linalg_inv_ex` +`conv_transpose1d`, `conv_transpose2d`, `conv_transpose3d`, `mish`, `avg_pool3d`, `binary_cross_entropy`, `grid_sampler`, `polar`, `prod`, `quantile`, `nanquantile`, `stft`, `cdist`, `trace`, `view_as_complex`, `cholesky`, `cholesky_inverse`, `cholesky_solve`, `inverse`, `lu_solve`, `matrix_rank`, `orgqr`, `ormqr`, `pinverse`, `max_pool3d`, `max_unpool2d`, `max_unpool3d`, `adaptive_avg_pool3d`, `reflection_pad1d`, `reflection_pad2d`, `replication_pad1d`, `replication_pad2d`, `replication_pad3d`, `mse_loss`, `ctc_loss`, `kl_div`, `multilabel_margin_loss`, `fft_fft`, `fft_ifft`, `fft_fft2`, `fft_ifft2`, `fft_fftn`, `fft_ifftn`, `fft_rfft`, `fft_irfft`, `fft_rfft2`, `fft_irfft2`, `fft_rfftn`, `fft_irfftn`, `fft_hfft`, `fft_ihfft`, `linalg_matrix_norm`, `linalg_cond`, `linalg_matrix_rank`, `linalg_solve`, `linalg_cholesky`, `linalg_svdvals`, `linalg_eigvals`, `linalg_eigvalsh`, `linalg_inv`, `linalg_householder_product`, `linalg_tensorinv`, `linalg_tensorsolve`, `fake_quantize_per_tensor_affine`, `eig`, `geqrf`, `lstsq`, `_lu_with_info`, `qr`, `svd`, `symeig`, `triangular_solve`, `fractional_max_pool2d`, `fractional_max_pool3d`, `adaptive_max_pool3d`, `multilabel_margin_loss_forward`, `linalg_qr`, `linalg_cholesky_ex`, `linalg_svd`, `linalg_eig`, `linalg_eigh`, `linalg_lstsq`, `linalg_inv_ex` #### Ops that promote to the widest input type @@ -100,20 +100,3 @@ These ops don't require a particular dtype for stability, but take multiple inpu `cat`, `stack`, `index_copy` Some ops not listed here (e.g., binary ops like `add`) natively promote inputs without autocasting's intervention. If inputs are a mixture of `bfloat16` and `float32`, these ops run in `float32` and produce `float32` output, regardless of whether autocast is enabled. - -## Design Details - -### Frontend API Design - -`torch.cpu.amp` is designed to be context managers that allow scopes of your script to run in mixed precision. It takes input parameter `dtype`, which is `torch.bfloat16` by default. - -### Dedicated Dispatch Key - -`torch.cpu.amp` extends the design of the original pytorch `Auto Mixed Precision` using the dedicated dispatch key of `AutocastCPU`. Each tensor during creation will have an `Autocast` Dispatchkey corresponding to the device (`CUDA` or `CPU`). Thus, for every tensor on CPU, `AutocastCPU` exists along with the tensor. During the dispatch phase, operators with input tensors of `AutocastCPU` are dispatched to the `Autocast` layers. The `Autocast` layer decides what precision to chooses for each operator. `AutocastCPU` has higher dispatch priority comparing to `Autograd` which makes sure the `Autocast` layer runs before `Autograd`. - -### Operations category - -The operations are generally divided into 3 major categories and registered under Dispatch Key `AutocastCPU`: -* `lower_precision_fp` category: Computation bound operators that could get performance boost with BFloat16 data type through acceleration by Intel CPU BFloat16 instruction set. Inputs of them are casted into `torch.bfloat16` before execution. `convolutions` and `linear` are examples of this category. -* `fallthrough` category: Operators that support running with both Float32 and BFloat16 data types, but could not get performance boost with BFloat16 data type. `relu` and `max_pool2d` are examples of this category. -* `fp32` category: Operators that are not enabled with BFloat16 support yet. Inputs of them are casted into `float32` before execution. `max_pool3d` and `group_norm` are examples of this category. diff --git a/docs/tutorials/features/graph_optimization.md b/docs/tutorials/features/graph_optimization.md index 194491821..6bb0f3d73 100644 --- a/docs/tutorials/features/graph_optimization.md +++ b/docs/tutorials/features/graph_optimization.md @@ -1,89 +1,12 @@ Graph Optimization ================== -Most Deep Learning models could be described as DAG(directed acyclic graph). Therefore, how to optimize a deep learning model from graph perspective is a nature thinking. Compared to the operator optimization and algorithm optimization, the graph optimization is at more high level. It convers not only the graph self but also the runtime. From the operator perspective, the graph optimization contains the operator fusing, the constant folding. From the runtime perspective, the graph optimization contains the operator scheduling, the computation resources management, the memory mangement. - -Currently, the Intel Extension for PyTorch focuses on the operator related graph optimizations. Regarding the runtime related optimization, the extension also provides some experiment features. Please refer to the runtime extension for more details about runtime optimization. - - -## Fusion -### FP32 and BF16 fusion patterns -- Conv2D + ReLU -- Conv2D + SUM -- Conv2D + SUM + ReLU -- Conv2D + Sigmoid -- Conv2D + Sigmoid + MUL -- Conv2D + HardTanh -- Conv2D + SiLU -- Conv2D + ELU -- Conv3D + ReLU -- Conv3D + SUM -- Conv3D + SUM + ReLU -- Conv3D + SiLU -- Linear + ReLU -- Linear + GELU -- Add + LayerNorm -- Div + Add + Softmax -- Linear + Linear + Linear -- View + Transpose + Contiguous + View - -### INT8 fusion patterns -The `ipex.quantization.convert(model, conf, inputs)` API will convert an FP32 `torch.nn.Module` to a quantized JIT ScriptModule according to the given quantization recipes. - -For example, for a FP32 model of one single convolution, the graph before and after conversion will be: -![image](../../../images/graph_optimization/int8_pattern.png) - -The oneDNN graph backend will select `dequantize` and `convolution` into one partition. During execution, this partition will execute a convolution with int8 as input and fp32 as output. - -Here listed all the currently supported int8 patterns in Intel® Extension for PyTorch\* using oneDNN graph backend: -1. Patterns with int8 as input and fp32 as output: -- dequant -> conv -- dequant -> linear -- dequant -> conv -> relu -- dequant -> conv -> sum -- dequant -> conv -> sum -> relu -- dequant -> linear -> relu -- dequant -> linear -> gelu -- dequant -> linear -> sigmoid -- dequant -> linear -> sum -- dequant -> bmm -- dequant -> bmm -> div - -2. Patterns with int8 as input and int8 as output: -- dequant -> conv -> quant -- dequant -> linear -> quant -- dequant -> conv -> relu -> quant -- dequant -> conv -> sum -> dequant -- dequant -> conv -> sum -> relu -> quant -- dequant -> linear -> relu -> quant -- dequant -> linear -> gelu -> quant -- dequant -> linear -> sigmoid -> quant -- dequant -> linear -> sum -> quant -- dequant -> bmm -> quant -- dequant -> bmm -> div -> quant -- dequant -> max_pool2d -> quant - - -## Folding -Stock PyTorch has provided the constant propagation and BatchNormalization folding. And these optimizations will be automatically applied to the jit model by invoking `torch.jit.freeze`. Take the Resnet50 as the example: -``` -import torch -import torchvision.models as models -model = models.__dict__["resnet50 "](pretrained=True) -model.eval() -x = torch.randn(args.batch_size, 3, 224, 224) -with torch.no_grad(): - model = torch.jit.trace(model, x, check_trace=False).eval() - # Fold the BatchNormalization and propagate constant - torch.jit.freeze(model) - # Print the graph - print(model.graph_for(x)) -``` -If the model owner does not invoke the `torch.jit.freeze`, the `BatchNormalization` still exists on the graph. Otheriwse, the `BatchNormalization` will be folded on the graph to save the compuation and then improve the performance. Please refer to the [Constant Folding Wikipedia page](https://en.wikipedia.org/wiki/Constant_folding) for more details. +Most Deep Learning models could be described as a DAG (directed acyclic graph). Optimizing a deep learning model from a graph perspective is straight forward. Compared to the operator optimization and algorithm optimization, the graph optimization is at a higher level. It covers not only the graph but also the runtime. From the operator perspective, the graph optimization contains the operator fusing and constant folding. From the runtime perspective, the graph optimization contains the operator scheduling, computation resources management, and memory management. +The Intel® Extension for PyTorch\* focuses on operator related graph optimizations. The extension also provides some experimental features for the related runtime optimizations. Refer to the runtime extension for more details about runtime optimization. ## Ease-of-use graph optimization API -The graph optimizations of Intel® Extension for PyTorch\* are enabled by default. Users could disable it by calling: +The graph optimizations of Intel® Extension for PyTorch\* are enabled by default. Users can disable it by calling: ``` ipex.enable_onednn_fusion(False) ``` @@ -110,7 +33,7 @@ with torch.no_grad(): # Print the graph print(model.graph_for(x)) ``` -Compared the original code, the model launcher just needs to add few lines of code, the extension will automatically acceletate the model. Regarding the RN50, the extension will automatically fuse the Conv + ReLU and Conv + Sum + ReLU as ConvReLU and ConvSumReLU. If you check the output of `graph_for`, you will observe the fused operators. +Compared to the original code, the model launcher needs to add a few lines of code and the extension will automatically accelerate the model. Regarding the RN50, the extension will automatically fuse the Conv + ReLU and Conv + Sum + ReLU as ConvReLU and ConvSumReLU. If you check the output of `graph_for`, you will observe the fused operators. ### INT8 models ``` @@ -118,27 +41,151 @@ import torch import intel_extension_for_pytorch as ipex -# First-time quantization flow -# define the model -def MyModel(torch.nn.Module): - … +# First-time quantization flow +# define the model +def MyModel(torch.nn.Module): + ... -# construct the model -model = MyModel(…) -conf = ipex.QuantConf(dtype=torch.int8) -model, conf = ipex.quantization.prepare(model, conf) -for images in calibration_data_loader(): - with ipex.quantization.calibrate(conf): # here, conf is in/out, populated with observed statistics - model(images) -conf.save(‘int8_conf.json’, default_recipe=True) # optional: save the configuration for later use -model = ipex.quantization.convert(model, conf, sample_image) +# construct the model +model = MyModel(...) +qconfig = ipex.quantization.default_static_qconfig +model.eval() +example_inputs = .. +prepared_model = prepare(user_model, qconfig, example_inputs=example_inputs, inplace=False) +with torch.no_grad(): + for images in calibration_data_loader(): + prepared_model(images) -# run the model -output = model(images) +convert_model = convert(prepared_model) +with torch.no_grad(): + traced_model = torch.jit.trace(convert_model, example_input) + traced_model = torch.jit.freeze(traced_model) -# Deployment +traced_model.save("quantized_model.pt") +# Deployment import intel_extension_for_pytorch as ipex -conf = ipex.QuantConf(‘int8_conf.json’) -model = ipex.quantization.convert(model, conf, sample_image) -output = model(images) +quantized_model = torch.jit.load("quantized_model.pt") +quantized_model = torch.jit.freeze(quantized_model.eval()) +with torch.no_grad(): + output = quantized_model(images) +``` + +## Methodology +### Fusion +#### FP32 and BF16 fusion patterns +- Conv1D/Conv2D/Conv3D/Linear/ConvTranspose2D/ConvTranspose3D + Abs/Clamp/Elu/Exp/GELU/HardTanh/HardSwish/Log/Mish/Sigmoid/Pow/ReLU/Round/Sqrt/Square/Tanh/Leaky_ReLU/SiLU +- Conv1D/Conv2D/Conv3D/Linear/ConvTranspose2D/ConvTranspose3D + Sigmoid + MUL +- Conv1D/Conv2D/Conv3D/Linear + SUM +- Conv1D/Conv2D/Conv3D + SUM + ReLU +- Add + LayerNorm +- Div + Add + Softmax +- Linear + Linear + Linear +- View + Transpose + Contiguous + View + +#### INT8 fusion patterns +The `ipex.quantization.convert(model, conf, inputs)` API will convert an FP32 `torch.nn.Module` to a quantized JIT ScriptModule according to the given quantization recipes. + +For example, for a FP32 model of one single convolution, the graph before and after conversion will be: +![image](../../../images/graph_optimization/int8_pattern.png) + +The oneDNN graph backend will select `dequantize` and `convolution` into one partition. During execution, this partition will execute a convolution with int8 as input and fp32 as output. + +Here listed all the currently supported int8 patterns in Intel® Extension for PyTorch\* using oneDNN graph backend: + +1. Conv/Linear/Matmul related fusion patterns + ``` + | + [Quantize]* + | | + Dequantize Dequantize + \ / + Conv1D/Conv2D/Conv3D/Linear/MatMul + | + [Abs/Elu/GELU/HardTanh/Leaky_ReLU/Sigmoid/ + ReLU/Sqrt/Square/Tanh/[Dequantize+Add]*[0,1] ]*[0,3] + | + [Quantize]* + | + ``` + + ``` + | | + Dequantize Dequantize + \___ ___/ + MatMul + \ / + Divide + \ / + [Add]* + | + ``` + +2. Non-Conv/Linear/Matmul related fusion patterns + ``` + | + Dequantize + | + MaxPool2D + | + Quantize + ``` +3. INT8-BF16 mixed-precision fusion patterns + ``` + | | + Dequantize Dequantize + | | + To To + \___ ___/ + MatMul + \ / + [Divide]* + \ / + [Add]* + | + ``` + + ``` + | | + Dequantize Dequantize + | | + To To + \___ ___/ + MatMul + | + [GeLU]* + | + To + | + Quantize + | + ``` + + ``` + | | + Dequantize Dequantize + | | + To To Dequantize + \___ ___/ | + MatMul To + \_____ ___/ + [Add]* + | + ``` + + +### Folding +Stock PyTorch provids constant propagation and BatchNormalization folding. These optimizations are automatically applied to the jit model by invoking `torch.jit.freeze`. Take the Resnet50 as an example: +``` +import torch +import torchvision.models as models +model = models.__dict__["resnet50 "](pretrained=True) +model.eval() +x = torch.randn(args.batch_size, 3, 224, 224) +with torch.no_grad(): + model = torch.jit.trace(model, x, check_trace=False).eval() + # Fold the BatchNormalization and propagate constant + torch.jit.freeze(model) + # Print the graph + print(model.graph_for(x)) ``` +If the model owner does not invoke the `torch.jit.freeze`, the `BatchNormalization` still exists on the graph. Otheriwse, the `BatchNormalization` will be folded on the graph to save the compuation and then improve the performance. Refer to the [Constant Folding Wikipedia page](https://en.wikipedia.org/wiki/Constant_folding) for more details. diff --git a/docs/tutorials/features/int8.md b/docs/tutorials/features/int8.md index f6193ca15..0ca36ed56 100644 --- a/docs/tutorials/features/int8.md +++ b/docs/tutorials/features/int8.md @@ -1,196 +1,146 @@ -Intel® Extension for PyTorch\* optimizations for quantization (Experimental) -============================================================================ +Intel® Extension for PyTorch\* optimizations for quantization +============================================================= -The quantization functionality in Intel® Extension for PyTorch\* currently only supports post-training static quantization. This tutorial introduces how the static quantization works in the Intel® Extension for PyTorch\* side. +The quantization functionality in Intel® Extension for PyTorch\* currently only supports post-training quantization. This tutorial introduces how the quantization works in the Intel® Extension for PyTorch\* side. -Suppose there is a model as below: +We fully utilize Pytorch quantization components as much as possible, such as PyTorch [Observer method](https://pytorch.org/docs/1.11/quantization-support.html#torch-quantization-observer). To make a PyTorch user be able to easily use the quantization API, API for quantization in Intel® Extension for PyTorch\* is very similar to those in PyTorch. Intel® Extension for PyTorch\* quantization supports a default recipe to automatically decide which operators should be quanized or not. This brings a satisfying performance and accuracy tradeoff. -``` -import torch -import torch.nn as nn -import intel_extension_for_pytorch as ipex +## Static Quantization -class MyModel(nn.Module): - def __init__(self): - super(MyModel, self).__init__() - self.conv = nn.Conv2d(10, 10, 3) - - def forward(self, x): - x = self.conv(x) - return x - -model = MyModel().eval() - -# user dataset for calibration. -xx_c = [torch.randn(1, 10, 28, 28) for i in range(2)) -# user dataset for validation. -xx_v = [torch.randn(1, 10, 28, 28) for i in range(20)) +```python +import intel_extension_for_pytorch as ipex +from intel_extension_for_pytorch.quantization import prepare, convert ``` -## Calibration Step - -Similar to the steps at PyTorch side, the first step is to perform calibration step to collect distributions of different activations. The distributions is then used to divide the entire range of activations into 256 levels. +### Define qconfig -At first, we need to define the quantization configuration determining which quantization scheme to be used for activation. Two values are supported: ``torch.per_tensor_affine`` and ``torch.per_tensor_symmetric``. The default qscheme is ``torch.per_tensor_affine``. +Using the default qconfig(recommended): -``` -conf = ipex.quantization.QuantConf(qscheme=torch.per_tensor_affine) +```python +qconfig = ipex.quantization.default_static_qconfig +# equal to +# QConfig(activation=HistogramObserver.with_args(reduce_range=False), +# weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric)) ``` -then perform calibration using the calibration dataset: +or define your own qconfig as: +```python +from torch.ao.quantization import MinMaxObserver, PerChannelMinMaxObserver, QConfig +qconfig = QConfig(activation=MinMaxObserver.with_args(qscheme=torch.per_tensor_affine, dtype=torch.quint8), + weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric)) ``` -with torch.no_grad(): - for x in xx_c: - with ipex.quantization.calibrate(conf): - y = model(x) -conf.save('configure.json') -``` +Note: we fully use PyTorch [observer methonds](https://pytorch.org/docs/stable/quantization-support.html#torch-quantization-observer), so you can use a different PyTorch obsever methond to define the [QConfig](https://pytorch.org/docs/1.11/generated/torch.quantization.qconfig.QConfig.html). For weight observer, we only support **torch.qint8** dtype now. -In the last line, a ``.json`` file is saved. The file contains info of quantization, such as observer algorithm, activations, and weights scales: - -```json -[ - { - "id": 0, - "name": "conv2d", - "algorithm": "min_max", - "weight_granularity": "per_channel", - "input_scales": [ - 0.02742583677172661 - ], - "input_zero_points": [ - 125 - ], - "output_scales": [ - 0.01582648977637291 - ], - "output_zero_points": [ - 120 - ], - "weight_scales": [ - [ - 0.0008243077900260687, - 0.0008239267044700682, - 0.0008076696540229023, - 0.000826483650598675, - 0.0008274353458546102, - 0.0008290993282571435, - 0.0007878943579271436, - 0.0008173943497240543, - 0.0008244941127486527, - 0.0008231988758780062 - ] - ], - "input_quantized_dtypes": [ - "uint8" - ], - "output_quantized_dtypes": [ - "uint8" - ], - "inputs_quantized": [ - true - ], - "outputs_quantized": [ - false - ], - "inputs_flow": [ - "conv2d0.input0" - ], - "outputs_flow": [ - "conv2d0.output0" - ] - } -] -``` +**Suggestion**: -Description of the json file can be found at [conf.py](https://github.com/intel/intel-extension-for-pytorch/blob/master/intel_extension_for_pytorch/quantization/conf.py). +1. For activation observer, if using **qscheme** as **torch.per_tensor_affine**, **torch.quint8** is preferred. If using **qscheme** as **torch.per_tensor_symmetric**, **torch.qint8** is preferred. For weight observer, setting **qscheme** to **torch.per_channel_symmetric** can get a better accuracy. +2. If your CPU device doesn't support VNNI, seting the observer's **reduce_range** to **True** can get a better accuracy, such as skylake. -## Model Conversion +### Prepare Model and Do Calibration -After doing calibration steps, distributions of activations and weights are collected. The model can be converted to a quantized model with these info. Quantization in Intel® Extension for PyTorch\* takes advantage of [oneDNN graph API](https://spec.oneapi.io/onednn-graph/latest/introduction.html). This requires to be executed with TorchScript graph, thus, we need to convert the eager model to Torchscript model: +```python +# prepare model, do conv+bn folding, and init model quant_state. +user_model = ... +user_model.eval() +example_inputs = .. +prepared_model = prepare(user_model, qconfig, example_inputs=example_inputs, inplace=False) -``` -conf = ipex.quantization.QuantConf('configure.json') +for x in calibration_data_set: + prepared_model(x) -with torch.no_grad(): - trace_model = ipex.quantization.convert(model, conf, example_input) +# Optional, if you want to tuning(performance or accuracy), you can save the qparams as json file which +# including the quantization state, such as scales, zero points and inference dtype. +# And then you can achange the json file's settings, loading the changed json file +# to model which will override the model's original quantization's settings. +# +# prepared_model.save_qconf_summary(qconf_summary = "configure.json") +# prepared_model.load_qconf_summary(qconf_summary = "configure.json") ``` -This step inserts some quantizer(``aten::quantize_per_tensor`` or ``aten::dequantize``) in the model. Meanwhile, [oneDNN graph API](https://spec.oneapi.io/onednn-graph/latest/introduction.html) will do graph optimization to replace some quantization pattens with quantization operators. More details can be found at [graph_optimization.md](./graph_optimization.md). - -## Evaluate - -After doing model conversion, we can do the evaluation step with your dataset by using the converted model: +### Convert to Static Quantized Model and Deploy -``` +```python +# make sure the example_inputs's size is same as the real input's size +convert_model = convert(prepared_model) with torch.no_grad(): - for x in xx_v: - y = trace_model(x) -``` - -## Deploy the Converted Model + traced_model = torch.jit.trace(convert_model, example_input) + traced_model = torch.jit.freeze(traced_model) +# for inference +y = traced_model(x) -If you want to deploy your model on another device, you need to save the converted model: +# or save the model to deploy -``` - trace_model.save('quantization_model.pt') +# traced_model.save("quantized_model.pt") +# quantized_model = torch.jit.load("quantized_model.pt") +# quantized_model = torch.jit.freeze(quantized_model.eval()) +# ... ``` -and then load the saved model on your target device: +## Dynamic Quantization -``` +```python import intel_extension_for_pytorch as ipex -loaded = torch.jit.load('quantization_model.pt') -# running the model using your dataset +from intel_extension_for_pytorch.quantization import prepare, convert ``` -## Additional context - -### Integration with oneDNN graph API -The quantization in Intel® Extension for PyTorch\* integrates [oneDNN graph API](https://spec.oneapi.io/onednn-graph/latest/introduction.html) with TorchScript graph of PyTorch. - -The integration is mainly composed of the Graph Optimization part and the Graph Executor part: - -#### Graph Optimization -We have registered quantization-related optimization passes in the Custom Pre-passes set of PyTorch: - -1. Alias and mutation reduction - - Operators of oneDNN graph are pure functional, while PyTorch has operators in in-place forms or create views for buffer sharing. - Due to the semantic gaps between the backend operators and the PyTorch operators, we have a pass to reduce mutation with best effort at the beginning. +### Define QConfig -2. Graph passing +Using the default qconfig(recommended): - With a PyTorch TorchScript graph, the integration maps PyTorch operators in the graph to the corresponding backend operators to form a backend graph. - -3. Partitioning - - The backend selects regions to be fused in the graph and return a list of partitions. Each partition corresponds to a fusion operator. - -4. Graph rewriting +```python +dynamic_qconfig = ipex.quantization.default_dynamic_qconfig +# equal to +# QConfig(activation=PlaceholderObserver.with_args(dtype=torch.float, compute_dtype=torch.quint8), +# weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric)) +``` - The original PyTorch graph will be re-written based on the partitions returned from the backend. The operators in one partition will be grouped together to form a JIT operator. +or define your own qconfig as: -The below diagram demonstrates the process of `Graph passing - Partitioning - Graph rewriting`: +```python +from torch.ao.quantization import MinMaxObserver, PlaceholderObserver, QConfig +dynamic_qconfig = QConfig(activation = PlaceholderObserver.with_args(dtype=torch.float, compute_dtype=torch.quint8), + weight = MinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_tensor_symmetric)) +``` -![image](../../../images/int8/integration_diagram.PNG) +Note: For weight observer, it only supports dtype **torch.qint8**, and the qscheme can only be **torch.per_tensor_symmetric** or **torch.per_channel_symmetric**. For activation observer, it only supports dtype **torch.float**, and the **compute_dtype** can be **torch.quint8** or **torch.qint8**. +**Suggestion**: -5. Layout propagation +1. For weight observer, setting **qscheme** to **torch.per_channel_symmetric** can get a better accuracy. +2. If your CPU device doesn't support VNNI, seeting the observer's **reduce_range** to **True** can get a better accuracy, such as skylake. - This pass is to eliminate unnecessary layout conversions at boundaries. We set different formats to the output of a partition so that the backend could perform layout conversion internally. When `ANY` is set, the layout at boundaries will be fully decided by the backend. Otherwise, the backend should follow the layout set by the Framework. - -![image](../../../images/int8/layout_propagation.png) +### Prepare Model -#### Graph Executor -During runtime execution of a PyTorch TorchScript graph, oneDNN graph partition will be dispatched to the oneDNN graph JIT variadic Operator. +```python +prepared_model = prepare(user_model, dynamic_qconfig, example_inputs=example_inputs) +``` -Inside the oneDNN graph JIT Op, input PyTorch tensors of each partition will be mapped to oneDNN graph tensors. The partition will then be [compiled](https://spec.oneapi.io/onednn-graph/latest/programming_model.html#partition) and [executed](https://spec.oneapi.io/onednn-graph/latest/programming_model.html#compiled-partition). The output oneDNN graph tensor will be mapped back to PyTorch tensors to be fed to the next operator on the TorchScript graph. +## Convert to Dynamic Quantized Model and Deploy + +```python +# make sure the example_inputs's size is same as the real input's size +convert_model = convert(prepared_model) +# Optional: convert the model to traced model +#with torch.no_grad(): +# traced_model = torch.jit.trace(convert_model, example_input) +# traced_model = torch.jit.freeze(traced_model) + +# or save the model to deploy +# traced_model.save("quantized_model.pt") +# quantized_model = torch.jit.load("quantized_model.pt") +# quantized_model = torch.jit.freeze(quantized_model.eval()) +# ... +# for inference +y = convert_model(x) +``` -### Limitations -#### Support for dynamic shapes -The support for dynamic shapes in Intel® Extension for PyTorch\* int8 integration is still working in progress. +Note: we only support the following ops to do dynamic quantization: -For the use cases where the input shapes are dynamic, for example inputs of variable image sizes in an object detection task or of variable sequence lengths in NLP tasks, the Intel® Extension for PyTorch\* int8 path may slow down the model inference. +- torch.nn.Linear +- torch.nn.LSTM +- torch.nn.GRU +- torch.nn.LSTMCell +- torch.nn.RNNCell +- torch.nn.GRUCell diff --git a/docs/tutorials/features/isa_dynamic_dispatch.md b/docs/tutorials/features/isa_dynamic_dispatch.md new file mode 100644 index 000000000..5e6aacc81 --- /dev/null +++ b/docs/tutorials/features/isa_dynamic_dispatch.md @@ -0,0 +1,122 @@ +ISA Dynamic Dispatching +======================= + +This document explains the dynamic kernel dispatch mechanism for Intel® Extension for PyTorch\* (Intel® Extension for PyTorch\*) based on CPU ISA. It is an extension to the similar mechanism in PyTorch. + +## Overview + +Forked from PyTorch, Intel® Extension for PyTorch\* adds additional CPU ISA level support, such as `AVX512_VNNI`, `AVX512_BF16` and `AMX`. + +PyTorch & Intel® Extension for PyTorch\* CPU ISA support statement: + + | | DEFAULT | AVX2 | AVX2_VNNI | AVX512 | AVX512_VNNI | AVX512_BF16 | AMX | + | ---- | :----: | :----: | :----: | :----: | :----: | :----: | :----: | + | PyTorch | ✔ | ✔ | ✘ | ✔ | ✘ | ✘ | ✘ | + | Intel® Extension for PyTorch\* 1.11 | ✘ | ✔ | ✘ | ✔ | ✘ | ✘ | ✘ | + | Intel® Extension for PyTorch\* 1.12 | ✘ | ✔ | ✘ | ✔ | ✔ | ✔ | ✔ | + +\* `DEFAULT` in Intel® Extension for PyTorch\* 1.12 implies `AVX2`. + +### CPU ISA build compiler requirement + + | ISA Level | GCC requirement | + | ---- | :----: | + | AVX2 | Any | + | AVX512 | GCC 9.2+ | + | AVX512_VNNI | GCC 9.2+ | + | AVX512_BF16 | GCC 10.3+ | + | AVX2_VNNI | GCC 11.2+ | + | AMX | GCC 11.2+ | + +\* Check with `cmake/Modules/FindAVX.cmake` for detailed compiler checks. + +## Select ISA Level + +By default, Intel® Extension for PyTorch\* dispatches to kernels with the maximum ISA level supported on the underlying CPU hardware. This ISA level can be overridden by an environment variable `ATEN_CPU_CAPABILITY` (same environment variable as PyTorch). Available values are {`avx2`, `avx512`, `avx512_vnni`, `avx512_bf16`, `amx`}. The effective ISA level would be the minimal level between `ATEN_CPU_CAPABILITY` and the maximum level supported by the hardware. + +### Example: + +```bash +$ python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())' +AMX +$ ATEN_CPU_CAPABILITY=avx2 python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())' +AVX2 +``` +>**Note:** +> +>`core._get_current_isa_level()` is an Intel® Extension for PyTorch\* internal function used for checking the current effective ISA level. It is used for debugging purpose only and subject to change. + +## CPU feature check + +An addtional CPU feature check tool in the subfolder: `tests/cpu/isa` + +```bash +$ cmake . +-- The C compiler identification is GNU 11.2.1 +-- The CXX compiler identification is GNU 11.2.1 +-- Detecting C compiler ABI info +-- Detecting C compiler ABI info - done +-- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/cc - skipped +-- Detecting C compile features +-- Detecting C compile features - done +-- Detecting CXX compiler ABI info +-- Detecting CXX compiler ABI info - done +-- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/c++ - skipped +-- Detecting CXX compile features +-- Detecting CXX compile features - done +-- Configuring done +-- Generating done +-- Build files have been written to: tests/cpu/isa + +$ make +[ 33%] Building CXX object CMakeFiles/cpu_features.dir/intel_extension_for_pytorch/csrc/cpu/isa/cpu_feature.cpp.o +[ 66%] Building CXX object CMakeFiles/cpu_features.dir/intel_extension_for_pytorch/csrc/cpu/isa/cpu_feature_main.cpp.o +[100%] Linking CXX executable cpu_features +[100%] Built target cpu_features + +$ ./cpu_features +XCR0: 00000000000602e7 +os --> avx: true +os --> avx2: true +os --> avx512: true +os --> amx: true +mmx: true +sse: true +sse2: true +sse3: true +ssse3: true +sse4_1: true +sse4_2: true +aes_ni: true +sha: true +xsave: true +fma: true +f16c: true +avx: true +avx2: true +avx_vnni: true +avx512_f: true +avx512_cd: true +avx512_pf: false +avx512_er: false +avx512_vl: true +avx512_bw: true +avx512_dq: true +avx512_ifma: true +avx512_vbmi: true +avx512_vpopcntdq: true +avx512_4fmaps: false +avx512_4vnniw: false +avx512_vbmi2: true +avx512_vpclmul: true +avx512_vnni: true +avx512_bitalg: true +avx512_fp16: true +avx512_bf16: true +avx512_vp2intersect: true +amx_bf16: true +amx_tile: true +amx_int8: true +prefetchw: true +prefetchwt1: false +``` diff --git a/docs/tutorials/features/nhwc.md b/docs/tutorials/features/nhwc.md index f1a70c914..a85d26df3 100644 --- a/docs/tutorials/features/nhwc.md +++ b/docs/tutorials/features/nhwc.md @@ -3,23 +3,23 @@ Channels Last ## What is Channels Last -**NB**: **Memory format** refers to data representation that describes how multidimensional arrays (nD) are stored in linear (1D) memory address space. **Memory format** has the same semantic with **layout** in oneDNN. **Layout** in PyTorch has other semantic ofdescribing **dense** or **sparse** with the attributes: 'torch.strided', 'torch.sparse_coo'. +**Note**: In PyTorch, **memory format** refers to data representation that describes how multidimensional arrays (nD) are stored in linear (1D) memory address space. **Memory format** has the same semantic meaning as **layout** in oneDNN. **Layout** in PyTorch has other semantic of describing **dense** or **sparse** with the attributes: 'torch.strided', 'torch.sparse_coo'. -On CNN models, the canonical order of tensor dimensions are assigned with semantic meaning. For example the input tensor of 2D convolution is of NCHW by default on PyTorch - . NHWC is an alternative way of describing the tensor dimensions - . +On CNN models, the canonical order of tensor dimensions is assigned with semantic meaning. For example the input tensor of 2D convolution is of NCHW by default on PyTorch - . NHWC is an alternative way of describing the tensor dimensions - . -Take a look at the following image of illustrating NCHW and NHWC when N=1. Actually when N=1, NHWC has the same format with BMP file image. +Look at the following image of illustrating NCHW and NHWC when N=1. Actually when N=1, NHWC has the same format with BMP file image. ![fig-1-memory-layout](../../../images/channels_last/figure1_memory_layout.png) -PyTorch refers NCHW as `torch.contiguous_format` which is the default memory format and NHWC as `torch.channels_last` which is an new feature from 1.5 release. +PyTorch refers to NCHW as `torch.contiguous_format` (the default memory format) and to NHWC as `torch.channels_last`, which is a new feature as of the 1.5 release. -TensorFlow takes NHWC as the default memory format and from the performance point of view NHWC has advantage over NCHW. On CPU platforms, we propose to optimize Channels Last memory path out of the following reasones: -* **Performance** - NHWC performance is not as good as blocked memory format (nChw16c), but it is close and much better than NCHW. -* **User Experience** - Operator coverage of NHWC would be higher than blocked memory format (`to_mkldnn()` method) so user experience is better. To be specific, it would be very difficult to enable operator that manipulates `dim` on blocked format such as `sum(dim=?)` so you need to convert tensor from blocked memory format back to NHWC by `to_dense()` before feeding it into `sum()`. But it is naturally supported on Channels Last memory format already. +TensorFlow uses NHWC as the default memory format because NHWC has a performance advantage over NCHW. On CPU platforms, we propose to optimize Channels Last memory path for ihe following reasons: +* **Performance** - NHWC performance is not as good as blocked memory format (nChw16c), but it is close, and much better performance than NCHW. +* **User Experience** - Operator coverage of NHWC would be higher than blocked memory format (`to_mkldnn()` method), so user experience is better. To be specific, it is difficult to enable operators that manipulates `dim` on blocked format such as `sum(dim=?)`. You would need to convert tensor from blocked memory format back to NHWC using `to_dense()`, before feeding it into `sum()`. This is naturally supported on Channels Last memory format already. * **Upstream** - Will be easier since CPU doesn't hold secret ingredient and both inference and training will be covered. ## Memory Format Is All That Matters -On CNN models, memory format is all most the foundation of any upper level design. One imporant fact is that converting memory format could be very expensive. Thus, in case that multiple CNN operators are performed in sequence, e.g. `Conv2d -> ReLU -> Conv2d`, it's beneficial to transform them from different memory formats once, do computation and reorder them back. +On CNN models, memory format is almost the foundation of any upper level design. One important fact is that converting memory format could be very expensive. Thus, in case that multiple CNN operators are performed in sequence, e.g. `Conv2d -> ReLU -> Conv2d`, it's beneficial to transform them from different memory formats once, do computation and reorder them back. On PyTorch, you can use 3 types of memory formats on CNN models: @@ -68,7 +68,7 @@ output = model(input) Better to explain the concepts here with a diagram, the **dotted lines** indicate simple memory view, no hard copy. ![fig-2(1)-pt-conv-layout-path-dispatch](../../../images/channels_last/figure2_dispatch.png) -**Conclusion** is that NHWC path saves the reorders from feature maps compared with NCHW path, but still weight reorder is necessary since oneDNN requires weights to be in blocked memory format. From performance perspective, when `batch_size=N`, weight reorder is minimum compared to feature map reorder. But when `batch_size=1`, weight reoder is usually not negligible. So whether to enable weight prepacking on channels last memory format needs further discussion. +**Conclusion** is that NHWC path saves the reorders from feature maps compared with NCHW path, but still weight reorder is necessary since oneDNN requires weights to be in blocked memory format. From performance perspective, when `batch_size=N`, weight reorder is minimum compared to feature map reorder. But when `batch_size=1`, weight reorder is usually not negligible. So whether to enable weight prepacking on channels last memory format needs further discussion. ## PyTorch Strided Layout @@ -83,11 +83,11 @@ offset(n,c,h,w) = stride_n * n + stride_c * c + stride_h * h + stride_w * w = CHW * n + HW * c + W * h + 1 * w ``` -One merit of introducing **stride** is it will be able to express noncontiguous tensors, e.g. a slice of big tensor. For example, the 'Xs' in the following image has a stride of . +One merit of introducing **stride** is that it can express noncontiguous tensors, e.g. a slice of big tensor. For example, the 'Xs' in the following image have a stride of . ![fig-3-pytorch-strided-layout](../../../images/channels_last/figure3_strided_layout.png) -Keep in mind that PyTorch Tensor does not have an attribute so called 'memory_format' or something else. The memory format expression completely relies on **size** and **stride**, design principle can be found at reference: [RFC: Memory format (aka layout aka NHWC) support](https://github.com/pytorch/pytorch/issues/19092). So no matter what the tensor's memory format is, we need a logical canonical order for the dimensions - that is **NCHW** on PyTorch. Thus, **size** and **stride** are ALWAYS described in the order of **NCHW**. OK let's take a look at the Channels Last case of the previous question: +Keep in mind that PyTorch Tensor does not have an attribute called 'memory_format' or something else. The memory format expression completely relies on **size** and **stride**. The design principle can be found at reference: [RFC: Memory format (aka layout aka NHWC) support](https://github.com/pytorch/pytorch/issues/19092). No matter what the tensor's memory format is, we need a logical canonical order for the dimensions - that is **NCHW** on PyTorch. Thus, **size** and **stride** are ALWAYS described in the order of **NCHW**. Let's now look at the Channels Last case of the previous question: ``` tensor: index: @@ -129,7 +129,7 @@ input = input.to(memory_format=torch.channels_last) Detailed operator coverage information has been listed at reference [Operators-with-Channels-Last-support](https://github.com/pytorch/pytorch/wiki/Operators-with-Channels-Last-support). In brief, ImageNet training topologies on GPU already have full support on Channels Last memory format, while CPU doesn't. Some spontaneous questions: -* **How to tell whether this model or operator support Channels Last?** - This requires mannual memory format check, aka. 'torch.channels_last' input and weight shall NOT generate 'torch.contiguous_format' output. +* **How to tell whether this model or operator support Channels Last?** - This requires manual memory format check, aka. 'torch.channels_last' input and weight shall NOT generate 'torch.contiguous_format' output. * **What if the model comprises of operator not supported Channels Last?** - No errors messages will be shown, the NHWC tensor will be handled by the operator as a non-contiguous NCHW tensor, so result might not be correct depending on the algorithm of this operator. ## Writing Channels Last Kernels @@ -146,11 +146,11 @@ The general guideline has been listed under reference [Writing-memory-format-awa ### c. Register oneDNN Kernel on Channels Last -Essence of registering an oneDNN kernel under Channels Last memory format on CPU is no differenct from [cuDNN](https://github.com/pytorch/pytorch/pull/23861): Only very few upper level change is needed such as accommodate 'contiguous()' to 'contiguous(suggested_memory_format)'. The automatic reorder of oneDNN weight shall been hided in ideep. +Registering a oneDNN kernel under Channels Last memory format on CPU is no different from [cuDNN](https://github.com/pytorch/pytorch/pull/23861): Only very few upper level changes are needed, such as accommodate 'contiguous()' to 'contiguous(suggested_memory_format)'. The automatic reorder of oneDNN weight shall been hidden in ideep. ## oneDNN NHWC APIs -Compared to NCHW interfaces, 2 parts need to be addressed on NHWC inferfaces: +Compared to NCHW interfaces, 2 parts need to be addressed on NHWC interfaces: ### a. Create NHWC Memory @@ -185,6 +185,6 @@ auto src_mem = memory(src_md, src_data_ptr, engine); * **Scenarios** - cover both training and inference; * **Models** - ResNet50 and ResNext101, extended targets: torchvision models, detectron2; * **Performance Targets** - training >0.8x blocked; inference throughput > 0.8x blocked; inference latency? (need further discussion) -* **Operator Converage** - No less than GPU path; -* **BFloat16** - This part shall align with big picture of BFloat16 integration (need further discussion); +* **Operator Coverage** - No less than GPU path; +* **BFloat16** - This part shall align with BFloat16 integration (need further discussion); * **int8** - Need further discussion. diff --git a/docs/tutorials/features/optimizer_fusion.md b/docs/tutorials/features/optimizer_fusion.md index 2aac46e73..6f8f722e2 100644 --- a/docs/tutorials/features/optimizer_fusion.md +++ b/docs/tutorials/features/optimizer_fusion.md @@ -2,7 +2,7 @@ Optimizer Fusion ================ ## Introduction -As the idea of TorchScript, operation fusion reduces number of operators that will be executed, and reduces overhead time. This methodology is also applied in ipex optimizer Optimization. We support Lamb/Adagrad/SGD fusion for both FP32/BF16(Split) at current stage. +As with TorchScript, operation fusion reduces the number of operators that will be executed, and reduces overhead time. This methodology is also applied in ipex optimizer Optimization. We support Lamb/Adagrad/SGD fusion for both FP32/BF16(Split) at current stage. Let's use [adagrad update](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html?highlight=adagrad#torch.optim.Adagrad) as an example. @@ -17,15 +17,15 @@ Let's use [adagrad update](https://pytorch.org/docs/stable/generated/torch.optim ## Operation Fusion -One problem of the native implementation above is that we need to access the whole storage of "grad", "parameters" and "state sum" several times. For example, we need to access the whole storage of "parameters" and "grad" at the first clause. For large topologies, it is highly possible that the "grad" and "parameters" cannot be stored on CPU onboard cache. Thus when we need to access the storage of "grad" again when executing the third clause, processors need to read data out from memory again, rather than highly effeciently using CPU onboard high speed cache. This is a memory-bound bottle neck preventing us to get a good performance. +One problem of the native implementation above is that we need to access the whole storage of "grad", "parameters", and "state sum" several times. For example, we need to access the whole storage of "parameters" and "grad" at the first clause. For large topologies, it is possible that the "grad" and "parameters" cannot be stored on the onboard CPU cache. When we need to access the storage of "grad" again when executing the third clause, the processor must read data out from memory again instead of the more efficient onboard high speed CPU cache. This is a memory-bound bottle neck preventing good performance. -Fusion is the methodology to solve this problem. Since the 5 clauses in the pseudo code are all element-wise operations. We can fused them into a single one, like the pseudo code below. +Fusion is the methodology to solve this problem. Since the 5 clauses in the pseudo code are all element-wise operations. We can fuse them into a single one, like the pseudo code below. ```python adagrad_fused_step(param, grad, state_sum, ...(other args)) ``` - In our fused opertors, we can seperate the storage of "grad", "paramerters" and "state sum" in several groups and ensure each groups are small enough to be stored at cache. The pseudo code below illustrates our execution process. + In our fused operators, we can separate the storage of "grad", "parameters" and "state sum" in several groups and ensure each group is small enough to be stored in the cache. The pseudo code below illustrates our execution process. ```python grad = (grad0, grad1, ..., grad_n) diff --git a/docs/tutorials/features/runtime_extension.md b/docs/tutorials/features/runtime_extension.md index 2be2dd128..d1cea79e1 100644 --- a/docs/tutorials/features/runtime_extension.md +++ b/docs/tutorials/features/runtime_extension.md @@ -1,31 +1,31 @@ -Runtime Extension (Experimental) -================================ +Runtime Extension +================= -Intel® Extension for PyTorch\* Runtime Extension provides a couple of PyTorch frontend APIs for users to get finer-grained control of the thread runtime. It provides +Intel® Extension for PyTorch\* Runtime Extension provides a couple of PyTorch frontend APIs for users to get finer-grained control of the thread runtime. It provides: 1. Multi-stream inference via the Python frontend module `intel_extension_for_pytorch.cpu.runtime.MultiStreamModule`. 2. Spawn asynchronous tasks via the Python frontend module `intel_extension_for_pytorch.cpu.runtime.Task`. -3. Configure core bindings for OpenMP threads via the Python frontend `intel_extension_for_pytorch.cpu.runtime.pin`. +3. Program core bindings for OpenMP threads via the Python frontend `intel_extension_for_pytorch.cpu.runtime.pin`. -Please **note**: Intel® Extension for PyTorch\* Runtime extension is still in the **Experimental** stage. The API is subject to change. More detailed descriptions are available at [API Documentation page](../api_doc.html). +**note**: Intel® Extension for PyTorch\* Runtime extension is in the **experimental** stage. The API is subject to change. More detailed descriptions are available at [API Documentation page](../api_doc.rst). ## Requirements -Intel® Extension for PyTorch\* Runtime Extension relies on `intel omp` to bind threads to cores. If you want to use it in your application, please start model script with extra flag: `LD_PRELOAD=$LD_PRELOAD:$PATH/libiomp5.so python model_script.py`. +Intel® Extension for PyTorch\* Runtime Extension relies on `intel omp` to bind threads to cores. If you want to use it in your application, start model script with an extra flag: `LD_PRELOAD=$LD_PRELOAD:$PATH/libiomp5.so python model_script.py`. ## Use Cases -### Example of Multi Stream Module +### Example of MultiStream Module -Runtime extension supports weight-sharing multi-stream inference for throughput mode on CPU. You just need to convert the original model into multi stream model and run the new multi stream model as normal. The detailed description of parameters to create `MultiStreamModule` is available at [API Documentation page](../api_doc.html). +Runtime extension supports weight-sharing multi-stream inference for throughput mode on CPU. You need to convert the original model into multi-stream model and run the new multi-stream model as normal. The detailed description of parameters to create `MultiStreamModule` is available at [API Documentation page](../api_doc.rst). -`MultiStreamModule` targets to improve performance of inference in throughput mode. We recommend creating a `MultiStreamModule` object with the `num_streams` parameter set to "AUTO" to heuristically decide the number of streams. Usually, it provides reasonable performance. However, it may still not be optimal for some cases (refer to the section [Performance recipes](#performance-recipes) for details) where manual tuning for the number of streams is needed. +`MultiStreamModule` can improve performance for inference in throughput mode. We suggest creating `MultiStreamModule` with `num_streams` of "AUTO", which heuristically decides the number of streams. Usually, it provides a reasonable performance. However, it may not be optimal for some cases (refer to the section [Performance recipes](#performance-recipes) for details). Manual tuning for number of streams is needed. The `MultiStreamModule` creates number of streams based on input parameter `num_streams` and bind cores to stream based on input parameter `cpu_pool`. If the number of cores inside `cpu_pool` is divisible by `num_streams`, the cores will be allocated equally to each stream. If the number of cores inside `cpu_pool` is not divisible by `num_streams` with remainder N, one extra core will be allocated to the first N streams. We suggest to set the `num_streams` as divisor of core number inside `cpu_pool`. If the inputs' batchsize is larger than and divisible by ``num_streams``, the batchsize will be allocated equally to each stream. If batchsize is not divisible by ``num_streams`` with remainder N, one extra piece will be allocated to the first N streams. If the inputs' batchsize is less than ``num_streams``, only the first batchsize's streams are used with mini batch as one. We suggest to set inputs' batchsize larger than and divisible by ``num_streams``. When creating `MultiStreamModule`, if you leave num of streams as "AUTO", we suggest to set inputs' batchsize larger than and divisible by number of cores. -Firstly, creating some ExampleNets which will be used by below examples: +Let's create some ExampleNets that will be used by further examples: ``` class ExampleNet1(torch.nn.Module): def __init__(self): @@ -95,9 +95,9 @@ cpu_pool = intel_extension_for_pytorch.cpu.runtime.CPUPool(node_id=0) # Create the input hint object input_hint = intel_extension_for_pytorch.cpu.runtime.MultiStreamModuleHint(0, 0) # Create the output hint object -# When python module has multi output tensors, it will be auto pack into a tuple, So we pass a tuple(0, 0) to create the output_hint +# When Python module has multi output tensors, it will be auto pack into a tuple, So we pass a tuple(0, 0) to create the output_hint output_hint = intel_extension_for_pytorch.cpu.runtime.MultiStreamModuleHint((0, 0)) -multi_Stream_model = intel_extension_for_pytorch.cpu.runtime.MultiStreamModule(traced_model2, +multi_Stream_model = intel_extension_for_pytorch.cpu.runtime.MultiStreamModule(traced_model2, num_streams=2, cpu_pool=cpu_pool, input_split_hint=input_hint, @@ -108,29 +108,29 @@ with torch.no_grad(): ``` #### Performance recipes -There are 2 motivations to use the `MultiStreamModule`: +There are two motivations to use the `MultiStreamModule`: 1. Better cache locality: With `MultiStreamModule`, the activations will be limited in the CPU cores allocated to this stream instead of the whole cpu_pool. 2. Reduce the OMP sync overhead: if one CPU core allocated to one stream, the whole execution needs to do OMP sync once after all streams finish execution instead of sync per layer. -Thus, `MultiStreamModule` may benefit performance for inference in throughput mode. However, the end-to-end performance is still subject to: -1. The kernels' efficiency which are different under different OMP threads' number. +Thus, `MultiStreamModule` may benefit performance for inference in throughput mode. However, the end-to-end performance is impacted by these issues: +1. The kernels' efficiency, which are different under different OMP threads' number. 2. The overhead of inputs' auto split and outputs' auto concat for each stream. 3. The overhead of pthread (stream async execution) wakes up and threads' synchronization after stream execution. -Below are some performance receipts we suggest to use for better multi stream performance. +Here are some performance receipes that we recommend for better multi-stream performance. -* When creating `MultiStreamModule` with `torch.nn.Module` as imperative path module, each stream inside `MultiStreamModule` suffers the GIL issue when do inference together which hurts end-to-end performance. As the results, we suggest to create `MultiStreamModule` with the `torch.jit.ScriptModule`. +* When creating `MultiStreamModule` with `torch.nn.Module` as imperative path module, each stream inside `MultiStreamModule` suffers the GIL issue when doing inference together. This hurts end-to-end performance. We recommend creating `MultiStreamModule` with the `torch.jit.ScriptModule`. -* For convolution network, `intel_extension_for_pytorch` has the quick path getting convolution primitive to mitigate overhead when `OMP_NUM_THREADS` is same between the phase of `torch.jit.trace` and model execution. To use this quick path for better performance, we suggest to set the `OMP_NUM_THREADS` environment before launch the model script. The suggested value of `OMP_NUM_THREADS` should equal to the threads number used by each stream. For example, creating `MultiStreamModule` as stream number of `s1`, CPUPool with core number `c1`, each stream will allocate threads number as `c1/s1`. Then we should set `OMP_NUM_THREADS` as this value. +* For convolution network, `intel_extension_for_pytorch` has the quick path getting convolution primitive to mitigate overhead when `OMP_NUM_THREADS` is the same between the `torch.jit.trace` and model execution phases. To use this quick path for better performance, we recommend setting the `OMP_NUM_THREADS` environment before launching the model script. The recommended value of `OMP_NUM_THREADS` should equal the threads number used by each stream. For example, creating `MultiStreamModule` as stream number `s1` and CPUPool with core number `c1`, each stream will allocate threads number as `c1/s1`. We recommend setting `OMP_NUM_THREADS` as this value. -* `Numactl` and the threads management in `MultiStreamModule` works in different levels. `MultiStreamModule` has the thread affinity setting for each stream which works in the thread level. However, for the python modules outside the stream, such as the dataloader, are out of radar for `MultiStreamModule`. As the result, we suggest to use `numactl -C core_ids -m node_id` for the process level core and memory resource management. For the core resource setting by `numactl`, suggest to set the same or superset of the core resource to create `CPUPool`. Otherwise, the behavior is undefined in current implementation. +* `Numactl` and the threads management in `MultiStreamModule` work at different levels. `MultiStreamModule` has the thread affinity setting for each stream, which works in the thread level. However, for the Python modules outside the stream, such as the dataloader, are out of view for `MultiStreamModule`. As the result, we recommend using `numactl -C core_ids -m node_id` for the process level core and memory resource management. For the core resource setting by `numactl`, set it the same or superset of the core resource to create `CPUPool`. Otherwise, the behavior is undefined in current implementation. #### Known issues -* Int8 data type does not support dynamic shape well. To avoid the performance issue, we suggest setting the batchsize to do `jit.trace` with same mini batchsize used by each stream. For example, creating `MultiStreamModule` as stream number of `s1`, input global batchsize as `gb`, each stream will inference with mini-batchsize as `gb/s1`. Then we should use this mini-batchsize value to do `jit.trace`. To be aware of the `num_streams` value, we suggest creating `MultiStreamModule` with `num_streams` setting explicitly instead of "AUTO". Due to the same limitation, the behavior that each stream inference with different mini batchsize of int8 data type is undefined and not supported. +* Intel® Extension for PyTorch\* runtime extension feature with Int8 data type does not support dynamic shape well. To avoid performance issues, we recommend setting the batchsize to do `jit.trace` with same mini batchsize used by each stream. For example, creating `MultiStreamModule` as stream number of `s1` and input global batchsize as `gb`, each stream will inference with mini-batchsize of `gb/s1`. We should use this mini-batchsize value to do `jit.trace`. To be aware of the `num_streams` value, we recommend creating `MultiStreamModule` with `num_streams` setting explicitly instead of "AUTO". Due to the same limitation, the behavior that each stream inference with different mini batchsize of int8 data type is undefined and not supported. ### Example of asynchronous task -Here is an example about how to use the asynchronous task. With the support of runtime API, you can run 2 modules simultaneously. Each module runs on the corresponding cpu pool. +Here is an example for using asynchronous tasks. With the support of a runtime API, you can run 2 modules simultaneously. Each module runs on the corresponding cpu pool. ``` # Create the cpu pool and numa aware memory allocator @@ -161,12 +161,12 @@ with intel_extension_for_pytorch.cpu.runtime.pin(cpu_pool): ### How the core binding is implemented -The Runtime Extension relies on the `kmp_*` API inside `iomp` share library to fulfill the core binding. The idea is that during the initialization of async threads, `kmp_*` API functions are invoked internally to start up an openmp group with specified number of worker threads. Each worker thread is then bound to the designated physical core(s) inside this openmp group. After initialization, any time you submit a task, the openmp group will serve the requested task. +The Runtime Extension relies on the `kmp_*` API inside `iomp` share library to fulfill the core binding. During the initialization of async threads, `kmp_*` API functions are invoked internally to start up an OpenMP group with specified number of worker threads. Each worker thread is then bound to the designated physical core(s) inside this OpenMP group. After initialization, when you submit a task, the OpenMP group will serve the requested task. ### Design of Task -Task is an abstraction of computation based on PyTorch module and is scheduled asynchronously. When a task created with specific `nn.Module` or `jit module`, a sub-thread which is bound to this task initialized. During the initialization, an openmp worker group is created and bound to this sub-thread. After initialization, the sub-thread spins to wait input. When the main thread submits an input to this task, the sub-thread will wake up and execute the input. The main thread returns a `FutureTensor` and not block until an explicit `FutureTensor.get()` invoking to get the results executed in sub-thread. +Task is an abstraction of computation based on PyTorch module and is scheduled asynchronously. When a task is created with specific `nn.Module` or `jit module`, a sub-thread is initialized and bound to this task. During the initialization, an OpenMP worker group is created and bound to this sub-thread. After initialization, the sub-thread waits for input. When the main thread submits an input to this task, the sub-thread will wake up and execute the input. The main thread returns a `FutureTensor` and is not block until an explicit `FutureTensor.get()` is invoked to get the results executed in the sub-thread. ### IOMP preload or load during the runtime -Since Runtime Extension rely on the APIs from IOMP, we need to preload IOMP before executing the application. And we want Intel® Extension for PyTorch\* default build with Runtime API enabled, which means it should work fine w/o loading IOMP if user didn't use the runtime API. Here we choose to `dlopen` IOMP library during runtime. And we ensure the IOMP symbols initialized once globally. +Since Runtime Extension relies on the APIs from IOMP, we need to preload IOMP before executing the application. We want Intel® Extension for PyTorch\* built with Runtime API enabled. This means it should work fine without loading IOMP if the user didn't use the runtime API. Here we choose to `dlopen` IOMP library during runtime and we ensure the IOMP symbols are initialized once globally. diff --git a/docs/tutorials/features/split_sgd.rst b/docs/tutorials/features/split_sgd.rst index d8098d2bc..e9576611b 100644 --- a/docs/tutorials/features/split_sgd.rst +++ b/docs/tutorials/features/split_sgd.rst @@ -1,19 +1,19 @@ Split SGD ========= -Not only optimizations for inference workloads are Intel's focus, training workloads are also within Intel's optimization scope. As part of it, optimizations for train optimizer functions are an important perspective. The optimizations as implemented as a mechanism called **Split SGD**, taking advantage of BFloat16 data type and operator fusion. Optimizer **adagrad**, **lamb** and **sgd** are supported. +Both optimizations for inference workloads and training workloads are within Intel's optimization scope. Optimizations for train optimizer functions are an important perspective. The optimizations use a mechanism called **Split SGD** and take advantage of BFloat16 data type and operator fusion. Optimizer **adagrad**, **lamb** and **sgd** are supported. BFloat16 -------- -The figure below shows definition of Float32 (top) and `BFloat16 `_ (bottom) data types. Comparing to Float32, BFloat16 is only half-long, and thus saves half memory. It is supported natively at instruction set level to boost deep learning workloads from the 3rd Generation of Xeon Scalable Processors. It is highly compatible to Float32, both have the same bit length for "sign" and "exponent" part. Though, BFloat16 only has 7-bit "mantissa" part while Float32 has 23 bits. This makes BFloat16 has the same capacity to represent "digit ranges" with that of Float32, but has shorter "precision" part. +The figure below shows definition of Float32 (top) and `BFloat16 `_ (bottom) data types. Compared to Float32, BFloat16 is only half as long, and thus saves half the memory. It is supported natively at the instruction set level to boost deep learning workloads from the 3rd Generation of Intel® Xeon® Scalable Processors. It is compatible to Float32 since both have the same bit length for "sign" and "exponent" part. BFloat16 only has a 7-bit "mantissa" part while Float32 has 23 bits. BFloat16 has the same capacity to represent "digit ranges" with that of Float32, but has a shorter "precision" part. .. image:: https://user-images.githubusercontent.com/33838455/86600181-00f5c200-bfa0-11ea-93f0-95af3f0bff08.png :width: 1200 :align: center :alt: Data types -Advantage of BFloat16 is that it saves memory and reduces computation workload, but the less mantissa bits brings negative effects as well. Let's use an "ADD" operation as an example to explain the disadvantage. To perform addition of 2 floating point numbers, we need to shift the mantissa part of them left or right to align their exponent parts. Since BFloat16 has shorter mantissa part, it is much easier than Float32 to lose its mantissa part after the shifting, and thus cause accuracy issue. +An advantage of BFloat16 is that it saves memory and reduces computation workload, but the fewer mantissa bits brings negative effects as well. Let's use an "ADD" operation as an example to explain the disadvantage. To perform addition of 2 floating point numbers, we need to shift the mantissa part of the numbers left or right to align their exponent parts. Since BFloat16 has a shorter mantissa part, it is much easier than Float32 to lose its mantissa part after the shifting, and thus cause an accuracy loss issue. Let's use the following two decimal numbers **x** and **y** as an example. We first do the calculation in a high precision data type (10 valid numbers after decimal point). @@ -27,7 +27,7 @@ Let's use the following two decimal numbers **x** and **y** as an example. We fi This makes sense because after shifting **y** right by 5 digits, the fraction part is still there. -Then, let's do the calculation in a low precision data type (5 valid numbers after decimal point) +Let's do the calculation using a low precision data type (5 valid numbers after decimal point): .. math:: @@ -37,7 +37,7 @@ Then, let's do the calculation in a low precision data type (5 valid numbers aft &= 0.12345*10^{10} + 0.00000*10^{10} \\ &= 0.12345*10^{10} -Since the data type has only 5 digits for the fraction part, after shifting y by 5 digits, its fraction part is fully removed. This brings accuracy loss. This is a drawback of lower precision data types form their nature. +Since the data type has only 5 digits for the fraction part, after shifting **y** by 5 digits, its fraction part is fully removed. This causes significant accuracy loss and, buy their nature, is a drawback of lower-precision data types. Stochastic Gradient Descent (SGD) --------------------------------- @@ -48,31 +48,31 @@ Basically, training involves 3 steps: 2. Backward propagation: Utilize chain rule to calculate gradients of parameters based on the loss number. 3. Parameter update: Update value of parameters by gradients along with calculated loss values. -The training is actually a loop of these 3 steps in sequence untill the loss number meets requirements or after a determine timeout duration. The Stochastic Gradient Descent (SGD) is most widely used at the 3rd step to update parameter values. To make it easy to understand, the 3rd step is described as the following formula: +The training is actually a loop of these 3 steps in sequence until the loss number meets requirements or after a determined timeout duration. The Stochastic Gradient Descent (SGD) is most widely used at the 3rd step to update parameter values. To make it easy to understand, the 3rd step is described as the following formula: .. math:: W = W + α * gW -Where :math:`W` denotes parameters to be updated. :math:`gW` denotes gradient got during backward propagation and :math:`α` denotes learning rate. +Where :math:`W` denotes parameters to be updated. :math:`gW` denotes gradient received during backward propagation and :math:`α` denotes learning rate. Split SGD --------- -Since the addition applied in SGD is repeated again and again, according to the drawback that we mentioned before of low precision data types, if both the :math:`W` and :math:`gW` are stored in BFloat16 data type, we will most likely lose valid bits and make the training results not accurate. Using FP32 master parameters is a common practice of avoiding the round-off errors at parameter update step. +Since the addition applied in SGD is repeated, because of the low precision data loss mentioned earlier, if both the :math:`W` and :math:`gW` are stored in BFloat16 data type, we will most likely lose valid bits and make the training results inaccurate. Using FP32 master parameters is a common practice for avoiding the round-off errors at parameter update step. To keep FP32 master parameters, we have 3 design choices: -(1) Only save FP32 parameters: For this choice, we need introduce additional FP32->BF16 cast at each iter to get benefit from BF16 at forward and backward propagation steps. -(2) Save both FP32 and BF16 parameters: BF16 parameter are used at forward and backward propagation steps. And use FP32 master parameters at update steps. For this choice we introduce more memory footprint. -(3) "Split" choice: In order to get performance benefits with BFloat16 at forward and backward propagation steps, while avoiding increase the memory footprint, we propose the mechanism **"Split SGD"**. +1. Only save FP32 parameters: For this choice, we need introduce additional FP32->BF16 cast at each iter to get benefit from BF16 at forward and backward propagation steps. +2. Save both FP32 and BF16 parameters: BF16 parameter is used at forward and backward propagation steps. Use FP32 master parameter at update steps. For this choice we introduce more memory footprint. +3. "Split" choice: In order to get performance benefits with BFloat16 at forward and backward propagation steps, while avoiding increase the memory footprint, we propose the mechanism **"Split SGD"**. The idea is to "split" a 32-bit floating point number into 2 parts: 1. Top half: First 16 bits can be viewed as exactly a BFloat16 number. 2. Bottom half: Last 16 bits are still kept to avoid accuracy loss. -FP32 parameters are split into "Top half" and "Bottom half". When performing forward and backward propagations, the Top halfs are used to benefit from Intel BFloat16 support. When performing paramter update with SGD, we concatenate the Top half and the Bottom half to recover the parameters back to FP32 and then perform regular SGD operations. +FP32 parameters are split into "Top half" and "Bottom half". When performing forward and backward propagations, the Top halves are used to take advantage of Intel BFloat16 support. When performing parameter update with SGD, we concatenate the Top half and the Bottom half to recover the parameters back to FP32 and then perform regular SGD operations. -It is a common pratice to use FP32 for master parameters in order to avoid round-off errors with BF16 parameter update. **SplitSGD** is an optimization of storing FP32 master parameters with reduced memory footprint. +It is a common practice to use FP32 for master parameters in order to avoid round-off errors with BF16 parameter update. **SplitSGD** is an optimization of storing FP32 master parameters with reduced memory footprint. .. image:: ../../../images/split_sgd/split_sgd.png :width: 800 diff --git a/docs/tutorials/installation.md b/docs/tutorials/installation.md index dc6e15d16..1aa6b36c9 100644 --- a/docs/tutorials/installation.md +++ b/docs/tutorials/installation.md @@ -5,17 +5,18 @@ Installation Guide |Category|Content| |--|--| -|Compiler|Recommend to use GCC 9| +|Compiler|Recommend using GCC newer than 11.2| |Operating System|CentOS 7, RHEL 8, Rocky Linux 8.5, Ubuntu newer than 18.04| |Python|See prebuilt wheel files availability matrix below| ## Install PyTorch -You need to make sure PyTorch is installed in order to get the extension working properly. For each PyTorch release, we have a corresponding release of the extension. Here is the PyTorch versions that we support and the mapping relationship: +Make sure PyTorch is installed so that the extension will work properly. For each PyTorch release, we have a corresponding release of the extension. Here are the PyTorch versions that we support and the mapping relationship: |PyTorch Version|Extension Version| |--|--| -|[v1.11.\*](https://github.com/pytorch/pytorch/tree/v1.11.0 "v1.11.0")|[v1.11.\*](https://github.com/intel/intel-extension-for-pytorch/tree/v1.11.0)| +|[v1.12.\*](https://github.com/pytorch/pytorch/tree/v1.12.0 "v1.12.0")|[v1.12.\*](https://github.com/intel/intel-extension-for-pytorch/tree/v1.12.0)| +|[v1.11.\*](https://github.com/pytorch/pytorch/tree/v1.11.0 "v1.11.0")|[v1.11.\*](https://github.com/intel/intel-extension-for-pytorch/tree/v1.11.200)| |[v1.10.\*](https://github.com/pytorch/pytorch/tree/v1.10.0 "v1.10.0")|[v1.10.\*](https://github.com/intel/intel-extension-for-pytorch/tree/v1.10.100)| |[v1.9.0](https://github.com/pytorch/pytorch/tree/v1.9.0 "v1.9.0")|[v1.9.0](https://github.com/intel/intel-extension-for-pytorch/tree/v1.9.0)| |[v1.8.0](https://github.com/pytorch/pytorch/tree/v1.8.0 "v1.8.0")|[v1.8.0](https://github.com/intel/intel-extension-for-pytorch/tree/v1.8.0)| @@ -25,15 +26,15 @@ You need to make sure PyTorch is installed in order to get the extension working |[v1.5.0-rc3](https://github.com/pytorch/pytorch/tree/v1.5.0-rc3 "v1.5.0-rc3")|[v1.0.1](https://github.com/intel/intel-extension-for-pytorch/tree/v1.0.1)| |[v1.5.0-rc3](https://github.com/pytorch/pytorch/tree/v1.5.0-rc3 "v1.5.0-rc3")|[v1.0.0](https://github.com/intel/intel-extension-for-pytorch/tree/v1.0.0)| -Here is an example showing how to install PyTorch. For more details, please refer to [pytorch.org](https://pytorch.org/get-started/locally/) +Here is an example showing how to install PyTorch. For more details, refer to [pytorch.org](https://pytorch.org/get-started/locally/). --- **Note:** -For the extension version earlier than 1.8.0, a patch has to be manually applied to PyTorch source code. Please check previous installation guide. +For the extension version earlier than 1.8.0, a patch has to be manually applied to PyTorch source code. Check that version's installation guide. -From 1.8.0, compiling PyTorch from source is not required. If you still want to compile PyTorch, please follow instructions [here](https://github.com/pytorch/pytorch#installation). Please make sure to checkout the correct PyTorch version according to the table above. +From 1.8.0, compiling PyTorch from source is not required. If you still want to compile PyTorch, follow these [installation instructions](https://github.com/pytorch/pytorch#installation). Make sure to check out the correct PyTorch version according to the table above. --- @@ -43,13 +44,15 @@ Prebuilt wheel files availability matrix for Python versions | Extension Version | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 | | :--: | :--: | :--: | :--: | :--: | :--: | +| 1.12.0 | | ✔️ | ✔️ | ✔️ | ✔️ | +| 1.11.200 | | ✔️ | ✔️ | ✔️ | ✔️ | | 1.11.0 | | ✔️ | ✔️ | ✔️ | ✔️ | | 1.10.100 | ✔️ | ✔️ | ✔️ | ✔️ | | | 1.10.0 | ✔️ | ✔️ | ✔️ | ✔️ | | | 1.9.0 | ✔️ | ✔️ | ✔️ | ✔️ | | | 1.8.0 | | ✔️ | | | | -**Note:** Intel® Extension for PyTorch\* has PyTorch version requirement. Please check the mapping table above. +**Note:** Intel® Extension for PyTorch\* has PyTorch version requirement. Check the mapping table above. Starting from 1.11.0, you can use normal pip command to install the package. @@ -60,15 +63,21 @@ python -m pip install intel_extension_for_pytorch Alternatively, you can also install the latest version with the following commands: ``` -python -m pip install intel_extension_for_pytorch -f https://software.intel.com/ipex-whl-stable +python -m pip install intel_extension_for_pytorch -f https://developer.intel.com/ipex-whl-stable ``` -**Note:** For version prior to 1.10.0, please use package name `torch_ipex`, rather than `intel_extension_for_pytorch`. +For pre-built wheel files with oneDNN Graph Compiler, use the following command to perform the installation. + +``` +python -m pip install intel_extension_for_pytorch -f https://developer.intel.com/ipex-whl-dev +``` -**Note:** To install a package with a specific version, please run with the following command. +**Note:** For versions before 1.10.0, use package name `torch_ipex`, rather than `intel_extension_for_pytorch`. + +**Note:** To install a package with a specific version, run with the following command: ``` -python -m pip install == -f https://software.intel.com/ipex-whl-stable +python -m pip install == -f https://developer.intel.com/ipex-whl-stable ``` ## Install via source compilation @@ -76,7 +85,7 @@ python -m pip install == -f https://software.intel.c ```bash git clone --recursive https://github.com/intel/intel-extension-for-pytorch cd intel-extension-for-pytorch -git checkout v1.11.0 +git checkout v1.12.0 # if you are updating an existing checkout git submodule sync @@ -85,15 +94,47 @@ git submodule update --init --recursive python setup.py install ``` +## Install via Docker container + +### Build Docker container from Dockerfile + +Run the following commands to build the `pip` based deployment container: + +```console +$ cd docker +$ DOCKER_BUILDKIT=1 docker build -f Dockerfile.pip -t intel-extension-for-pytorch:pip . +$ docker run --rm intel-extension-for-pytorch:pip python -c "import torch; import intel_extension_for_pytorch as ipex; print('torch:', torch.__version__,' ipex:',ipex.__version__)" +``` + +Run the following commands to build the `conda` based development container: + +```console +$ cd docker +$ DOCKER_BUILDKIT=1 docker build -f Dockerfile.conda -t intel-extension-for-pytorch:conda . +$ docker run --rm intel-extension-for-pytorch:conda python -c "import torch; import intel_extension_for_pytorch as ipex; print('torch:', torch.__version__,' ipex:',ipex.__version__)" +``` + +### Get docker container from dockerhub + +Pre-built docker images are available at [DockerHub](https://hub.docker.com/r/intel/intel-optimized-pytorch/tags). + +Run the following command to pull the image to your local machine. + +```console +docker pull intel/intel-optimized-pytorch:latest +``` + ## Install C++ SDK |Version|Pre-cxx11 ABI|cxx11 ABI| |--|--|--| +| 1.12.0 | [libintel-ext-pt-1.12.0+cpu.run](http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/libtorch_zip/libintel-ext-pt-1.12.0%2Bcpu.run) | [libintel-ext-pt-cxx11-abi-1.12.0+cpu.run](http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/libtorch_zip/libintel-ext-pt-cxx11-abi-1.12.0%2Bcpu.run) | +| 1.11.200 | [libintel-ext-pt-1.11.200+cpu.run](http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/libtorch_zip/libintel-ext-pt-shared-with-deps-1.11.200%2Bcpu.run) | [libintel-ext-pt-cxx11-abi-1.11.200+cpu.run](http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/libtorch_zip/libintel-ext-pt-cxx11-abi-shared-with-deps-1.11.200%2Bcpu.run) | | 1.11.0 | [libintel-ext-pt-1.11.0+cpu.run](http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/libtorch_zip/libintel-ext-pt-1.11.0%2Bcpu.run) | [libintel-ext-pt-cxx11-abi-1.11.0+cpu.run](http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/libtorch_zip/libintel-ext-pt-cxx11-abi-1.11.0%2Bcpu.run) | | 1.10.100 | [libtorch-shared-with-deps-1.10.0%2Bcpu-intel-ext-pt-cpu-1.10.100.zip](http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/wheels/v1.10/libtorch-shared-with-deps-1.10.0%2Bcpu-intel-ext-pt-cpu-1.10.100.zip) | [libtorch-cxx11-abi-shared-with-deps-1.10.0%2Bcpu-intel-ext-pt-cpu-1.10.100.zip](http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/wheels/v1.10/libtorch-cxx11-abi-shared-with-deps-1.10.0%2Bcpu-intel-ext-pt-cpu-1.10.100.zip) | | 1.10.0 | [intel-ext-pt-cpu-libtorch-shared-with-deps-1.10.0+cpu.zip](https://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/wheels/v1.10/intel-ext-pt-cpu-libtorch-shared-with-deps-1.10.0%2Bcpu.zip) | [intel-ext-pt-cpu-libtorch-cxx11-abi-shared-with-deps-1.10.0+cpu.zip](https://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/wheels/v1.10/intel-ext-pt-cpu-libtorch-cxx11-abi-shared-with-deps-1.10.0%2Bcpu.zip) | -**Usage:** For version newer than 1.11.0, donwload one run file above according to your scenario, run the following command to install it and follow the [C++ example](./examples.html#c). +**Usage:** For version newer than 1.11.0, download one run file above according to your scenario, run the following command to install it and follow the [C++ example](./examples.md#c). ``` bash .run install ``` @@ -104,4 +145,4 @@ You can get full usage help message by running the run file alone, as the follow bash .run ``` -**Usage:** For version prior to 1.11.0, donwload one zip file above according to your scenario, unzip it and follow the [C++ example](./examples.html#c). +**Usage:** For version before 1.11.0, download one zip file above according to your scenario, unzip it and follow the [C++ example](./examples.md#c). diff --git a/docs/tutorials/performance.md b/docs/tutorials/performance.md index 7f76fa1d9..080dd3256 100644 --- a/docs/tutorials/performance.md +++ b/docs/tutorials/performance.md @@ -5,7 +5,9 @@ Performance This page shows performance boost with Intel® Extension for PyTorch\* on several popular topologies. -## Performance Numbers +## INT8 with v1.11 + +### Performance Numbers @@ -25,7 +27,256 @@ This page shows performance boost with Intel® Extension for PyTorch\* on severa - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Realtime Inference3 Model Type DatasetMisc.Input Data ShapeTunable Parameters
Batch SizeBoost RatioBatch SizeBoost Ratio
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHzResNet50INT8801.83x11.44xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
SSD-ResNet34INT8802.16x11.83xComputer VisionCOCOInput shape
[3, 1200, 1200]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ResNext 32x16dINT8801.81x11.21xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
VGG-11INT8801.75x11.19xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ShuffleNetv2_x1.0INT8802.07x11.47xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
BERT-LargeINT8802.78x12.04xNLPSquadmax_seq_len=384
Task: Question Answering
Jemalloc;
Intel(R) OpenMP;
inference scripts
Bert-BaseINT8802.05x11.96xNLPMRPCmax_seq_len=128
Task: Text Classification
Jemalloc;
Intel(R) OpenMP;
inference scripts
DistilBERT-BaseINT8802.12x11.57xNLPSquadmax_seq_len=384
Task: Question Answering
Jemalloc;
Intel(R) OpenMP;
inference scripts
+ +
+1. Model Zoo for Intel® Architecture +
+2. Throughput inference runs with single instance per socket. +
+3. Realtime inference runs with multiple instances, 4 cores per instance. +
+ +*Note:* Performance numbers with stock PyTorch are measured with its most performant configuration. + +*Note:* Environment variable *DNNL_PRIMITIVE_CACHE_CAPACITY* is set to *1024*. + +### Accuracy + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
WorkloadMetricFP32INT8INT8/FP32
BERT-base_text_classificationf10.810.8199.79%
BERT-Largef193.1693.0299.85%
Distilbert-basef186.8486.1399.19%
ResNet50Top176.1575.9899.78%
ResNext 32x16dTop184.1784.0599.86%
SSD-ResNet34mAP0.2000.19999.48%
VGG11Top169.0467.9698.44%
Shufflenetv2_x1.0Top169.3667.9297.93%1
+ +
+1. ShuffleNet INT8 accuracy is expected to improve w/o performance trade-off via histogram calibration algorithm. +
+ +### Configuration + +#### Software Version + +| Software | Version | +| :-: | :-: | +| PyTorch | [v1.11.0](https://pytorch.org/get-started/locally/) | +| Intel® Extension for PyTorch\* | [v1.11.0](https://github.com/intel/intel-extension-for-pytorch/releases) | + +#### Hardware Configuration + +| | 3rd Generation Intel® Xeon® Scalable Processors | +| :-: | :-: | +| CPU | Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz | +| Number of nodes | 1 | +| Number of sockets | 2 | +| Cores/Socket | 40 | +| Threads/Core | 2 | +| uCode | 0xd0002a0 | +| Hyper-Threading | ON | +| TurboBoost | ON | +| BIOS version | 04.12.02 | +| Number of DDR Memory slots | 16 | +| Capacity of DDR memory per slot | 16GB | +| DDR frequency | 3200 | +| Total Memory/Node (DDR+DCPMM) | 256GB | +| Host OS | CentOS Linux release 8.4.2105 | +| Host Kernel | 4.18.0-305.10.2.el8\_4.x86\_64 | +| Docker OS | Ubuntu 18.04.5 LTS | +| [Spectre-Meltdown Mitigation](https://github.com/speed47/spectre-meltdown-checker) | Mitigated | + +## FP32 and BFloat16 with v1.10 + +### Performance Numbers + + + + + + + + + + + + + + + + + + + + + @@ -44,6 +295,7 @@ This page shows performance boost with Intel® Extension for PyTorch\* on severa + @@ -55,6 +307,7 @@ This page shows performance boost with Intel® Extension for PyTorch\* on severa + @@ -66,6 +319,7 @@ This page shows performance boost with Intel® Extension for PyTorch\* on severa + @@ -77,6 +331,7 @@ This page shows performance boost with Intel® Extension for PyTorch\* on severa + @@ -88,6 +343,7 @@ This page shows performance boost with Intel® Extension for PyTorch\* on severa + @@ -99,6 +355,7 @@ This page shows performance boost with Intel® Extension for PyTorch\* on severa + @@ -110,6 +367,7 @@ This page shows performance boost with Intel® Extension for PyTorch\* on severa + @@ -121,6 +379,7 @@ This page shows performance boost with Intel® Extension for PyTorch\* on severa + @@ -132,6 +391,7 @@ This page shows performance boost with Intel® Extension for PyTorch\* on severa + @@ -143,6 +403,7 @@ This page shows performance boost with Intel® Extension for PyTorch\* on severa + @@ -155,6 +416,7 @@ This page shows performance boost with Intel® Extension for PyTorch\* on severa + @@ -166,6 +428,7 @@ This page shows performance boost with Intel® Extension for PyTorch\* on severa +
HardwareWorkload1PrecisionThroughput Inference2Real-time Inference3Model TypeDatasetInput Data ShapeTunable Parameters
Batch SizeComputer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
SSD-ResNet34Computer Vision COCO Input shape
[3, 1200, 1200]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ResNext 32x16dComputer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
Faster R-CNN ResNet50 FPNComputer Vision COCO Input shape
[3, 1200, 1200]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
VGG-11Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ShuffleNetv2_x1.0Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
MobileNet v2Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
DLRMRecommendation Terabyte -Default memory allocator;
Intel(R) OpenMP;
inference scripts
BERT-LargeNLP Squad max_seq_len=384
Task: Question Answering
Default memory allocator;
Intel(R) OpenMP;
inference scripts;
Recommend to set auto_kernel_selection to ON when seq_len exceeds 64
Bert-BaseNLP MRPC max_seq_len=128
Task: Text Classification
Jemalloc;
Intel(R) OpenMP;
inference scripts;
Recommend to set auto_kernel_selection to ON when seq_len exceeds 128
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHzNLP Squad max_seq_len=384
Task: Question Answering
Jemalloc;
Intel(R) OpenMP;
inference scripts
Bert-BaseNLP MRPC max_seq_len=128
Task: Text Classification
Jemalloc;
Intel(R) OpenMP;
inference scripts
@@ -180,16 +443,18 @@ This page shows performance boost with Intel® Extension for PyTorch\* on severa *Note:* Performance numbers with stock PyTorch are measured with its most performant configuration. -## Configuration +*Note:* Environment variable *DNNL_PRIMITIVE_CACHE_CAPACITY* is set to *1024*. + +### Configuration -### Software Version +#### Software Version | Software | Version | | :-: | :-: | | PyTorch | [v1.10.1](https://pytorch.org/get-started/locally/) | | Intel® Extension for PyTorch\* | [v1.10.100](https://github.com/intel/intel-extension-for-pytorch/releases) | -### Hardware Configuration +#### Hardware Configuration | | 3rd Generation Intel® Xeon® Scalable Processors | Products formerly Cooper Lake | | :-: | :-: | :-: | diff --git a/docs/tutorials/performance_tuning.rst b/docs/tutorials/performance_tuning.rst index cb70605dc..b2f0aa99b 100644 --- a/docs/tutorials/performance_tuning.rst +++ b/docs/tutorials/performance_tuning.rst @@ -1,7 +1,7 @@ Performance Tuning Guide ======================== -Intel® Extension for PyTorch\* should yield a satisfying performance with its default configuration for general usage cases. To squeeze usage of hardware resources further, there are still several configurations that users can tune with. This page shows tutorials for performance tuning guides, as well as introduction of an easy-to-use tool. +Intel® Extension for PyTorch\* should yield a satisfying performance with its default configuration for general use cases. To squeeze usage of hardware resources further, there are still several configurations that users can tune. This page shows tutorials for performance tuning guides, as well as an introduction of an easy-to-use tool. - `Performance Tuning Guide `_ - `Launch Script Usage Guide `_ diff --git a/docs/tutorials/performance_tuning/known_issues.md b/docs/tutorials/performance_tuning/known_issues.md index f3f60bfe2..0b2bf45ba 100644 --- a/docs/tutorials/performance_tuning/known_issues.md +++ b/docs/tutorials/performance_tuning/known_issues.md @@ -1,21 +1,45 @@ Known Issues ============ +- `RuntimeError: Overflow when unpacking long` when a tensor's min max value exceeds int range while performing int8 calibration. Please customize QConfig to use min-max calibration method. + +- For models with dynamic control flow, please try dynamic quantization. Users are likely to get performance gain for GEMM models. + +- Calibrating with quantize_per_tensor, when benchmarking with 1 OpenMP\* thread, results might be incorrect with large tensors (find more detailed info [here](https://github.com/pytorch/pytorch/issues/80501). Editing your code following the pseudocode below can workaround this issue, if you do need to explicitly set OMP_NUM_THREAEDS=1 for benchmarking. However, there could be a performance regression if oneDNN graph compiler prototype feature is utilized. + + Workaround pseudocode: + ``` + # perform convert/trace/freeze with omp_num_threads > 1(N) + torch.set_num_threads(N) + prepared_model = prepare(model, input) + converted_model = convert(prepared_model) + traced_model = torch.jit.trace(converted_model, input) + freezed_model = torch.jit.freeze(traced_model) + # run freezed model to apply optimization pass + freezed_model(input) + + # benchmarking with omp_num_threads = 1 + torch.set_num_threads(1) + run_benchmark(freezed_model, input) + ``` + - BF16 AMP(auto-mixed-precision) runs abnormally with the extension on the AVX2-only machine if the topology contains `Conv`, `Matmul`, `Linear`, and `BatchNormalization` -- Runtime extension does not support the scenario that the BS is not divisible by the stream number +- Runtime extension of MultiStreamModule doesn't support DLRM inference, since the input of DLRM (EmbeddingBag specifically) can't be simplely batch split. + +- Runtime extension of MultiStreamModule has poor performance of RNNT Inference comparing with native throughput mode. Only part of the RNNT models (joint_net specifically) can be jit traced into graph. However, in one batch inference, `joint_net` is invoked multi times. It increases the overhead of MultiStreamModule as input batch split, thread synchronization and output concat. - Incorrect Conv and Linear result if the number of OMP threads is changed at runtime The oneDNN memory layout depends on the number of OMP threads, which requires the caller to detect the changes for the # of OMP threads while this release has not implemented it yet. -- INT8 performance of EfficientNet and DenseNet with Intel® Extension for PyTorch\* is slower than that of FP32 - - Low performance with INT8 support for dynamic shapes - The support for dynamic shapes in Intel® Extension for PyTorch\* INT8 integration is still working in progress. For the use cases where the input shapes are dynamic, for example inputs of variable image sizes in an object detection task or of variable sequence lengths in NLP tasks, the Intel® Extension for PyTorch\* INT8 path may slow down the model inference. In this case, please utilize stock PyTorch INT8 functionality. + The support for dynamic shapes in Intel® Extension for PyTorch\* INT8 integration is still work in progress. When the input shapes are dynamic, for example inputs of variable image sizes in an object detection task or of variable sequence lengths in NLP tasks, the Intel® Extension for PyTorch\* INT8 path may slow down the model inference. In this case, use stock PyTorch INT8 functionality. -- Low throughtput with DLRM FP32 Train + **Note**: Using Runtime Extension feature if batch size cannot be divided by number of streams, because mini batch size on each stream are not equivalent, scripts run into this issues. + +- Low throughput with DLRM FP32 Train A 'Sparse Add' [PR](https://github.com/pytorch/pytorch/pull/23057) is pending on review. The issue will be fixed when the PR is merged. @@ -24,23 +48,23 @@ Known Issues ``` import torch import intel_pytorch_extension as ipex - + class Module(torch.nn.Module): def __init__(self): super(Module, self).__init__() self.conv = torch.nn.Conv2d(1, 10, 5, 1) self.bn = torch.nn.BatchNorm2d(10) self.relu = torch.nn.ReLU() - + def forward(self, x): x = self.conv(x) x = self.bn(x) x = self.relu(x) return x - + def inference(self, x): return self.forward(x) - + if __name__ == '__main__': m = Module() m.eval() @@ -50,4 +74,4 @@ Known Issues m.inference(d) ``` - This is PyTorch FX limitation, user can avoid this error by calling `m = ipex.optimize(m, level="O0")`, which doesn't apply ipex optimization, or disable `conv+bn` folding by calling `m = ipex.optimize(m, level="O1", conv_bn_folding=False)`. + This is a PyTorch FX limitation. You can avoid this error by calling `m = ipex.optimize(m, level="O0")`, which doesn't apply ipex optimization, or disable `conv+bn` folding by calling `m = ipex.optimize(m, level="O1", conv_bn_folding=False)`. diff --git a/docs/tutorials/performance_tuning/launch_script.md b/docs/tutorials/performance_tuning/launch_script.md index 2227a0d8b..c24ef8618 100644 --- a/docs/tutorials/performance_tuning/launch_script.md +++ b/docs/tutorials/performance_tuning/launch_script.md @@ -3,12 +3,12 @@ Launch Script Usage Guide ## Overview -As introduced in [Performance Tuning Guide](tuning_guide.md), there are several factors that influence performance very much. Setting those configurations properly contributes to performance boost. However, there is no unified configuration that is optimal to all topologies. Users need to try different combinations by themselves. A *launch* script is provided to automate these configuration settings to free users from this complicated work. This guide helps you to learn some most frequent usage examples. They covers optimized configurations in most cases. +As introduced in the [Performance Tuning Guide](tuning_guide.md), there are several factors that influence performance. Setting configuration options properly contributes to a performance boost. However, there is no unified configuration that is optimal to all topologies. Users need to try different combinations by themselves. A *launch* script is provided to automate these configuration settings to free users from this complicated work. This guide helps you to learn some common usage examples that cover many optimized configuration cases. -The configurations are mainly around the following perspectives. Italic values are default if applicable. -1. OpenMP library: [*Intel OpenMP library* | GNU OpenMP library] -2. Memory allocator: [PyTorch default memory allocator | Jemalloc | *TCMalloc*] -3. Number of instances: [*Single instance* | Multiple instances] +The configurations are mainly around the following perspectives. +1. OpenMP library: [**Intel OpenMP library** (default) | GNU OpenMP library] +2. Memory allocator: [PyTorch default memory allocator | Jemalloc | **TCMalloc** (default)] +3. Number of instances: [**Single instance** (default) | Multiple instances] ## Usage of launch script @@ -17,7 +17,7 @@ The *launch* script is provided as a module of *intel_extension_for_pytorch*. Yo python -m intel_extension_for_pytorch.cpu.launch [knobs] [args] ``` -Available knobs are listed below: +Available option settings (knobs) are listed below: | knob | type | default value | help | | :-- | :--: | :--: | :-- | @@ -43,13 +43,15 @@ Available knobs are listed below: **Note:** ```--latency_mode``` and ```--throughput_mode``` are exclusive knobs to ```--ninstances```, ```--ncore_per_instance```, ```--node_id``` and ```--use_logical_core```. I.e., setting either of ```--latency_mode``` or ```--throughput_mode``` overwrites settings of ```--ninstances```, ```--ncore_per_instance```, ```--node_id``` and ```--use_logical_core``` if they are explicitly set in command line. ```--latency_mode``` and ```--throughput_mode``` are mutually exclusive. -```--skip_cross_node_cores``` is exclusive knob to ```--ninstances```. I.e., setting ```--skip_cross_node_cores``` overwrites setting of ```--ninstances``` if it is explicitly set in command line. +```--skip_cross_node_cores``` is exclusive knob to ```--ninstances```. Setting ```--skip_cross_node_cores``` overwrites setting of ```--ninstances``` if it is explicitly set on the command line. -The *launch* script respects existing environment variables when it get launched, expect for *LD_PRELOAD*. If you have your favorite values for certain environment variables, you can set them before running the *launch* script. A typical usage scenario is as the following. Intel OpenMP library uses an environment variable *KMP_AFFINITY* to control its behavior. Different settings result in different performance numbers. By default, if you enable Intel OpenMP library, the *launch* script will set *KMP_AFFINITY* to "granularity=fine,compact,1,0". If you want to try with other values, you can use *export* command on Linux to set *KMP_AFFINITY* before you run the *launch* script. In this case, the script will not set the default value but take the existing value of *KMP_AFFINITY*, and print a message to stdout. +The *launch* script respects existing environment variables when it get launched, except for *LD_PRELOAD*. If you have your favorite values for certain environment variables, you can set them before running the *launch* script. Intel OpenMP library uses an environment variable *KMP_AFFINITY* to control its behavior. Different settings result in different performance numbers. By default, if you enable Intel OpenMP library, the *launch* script will set *KMP_AFFINITY* to "granularity=fine,compact,1,0". If you want to try with other values, you can use *export* command on Linux to set *KMP_AFFINITY* before you run the *launch* script. In this case, the script will not set the default value but take the existing value of *KMP_AFFINITY*, and print a message to stdout. -Execution via the *launch* script can dump logs into files under a designated log directory so that it will be convenient to do some investigations afterward. By default, it is disabled to avoid undesired log files. You can enable logging by setting knob ```--log_path``` to be 1 directory to save log files. It can be absolute path or relative path. 2 types of files will be generated. One file (```_timestamp_instances.log```) contains command and information when the script was launched. Another type of files (```_timestamp_instance_N_core#-core#....log```) contain stdout print of each instance. +Execution via the *launch* script can dump logs into files under a designated log directory so you can do some investigations afterward. By default, it is disabled to avoid undesired log files. You can enable logging by setting knob ```--log_path``` to be: +* directory to save log files. It can be an absolute path or relative path. +* types of log files to generate. One file (```_timestamp_instances.log```) contains command and information when the script was launched. Another type of file (```_timestamp_instance_N_core#-core#....log```) contain stdout print of each instance. -E.g. +For example: ``` run_20210712212258_instances.log run_20210712212258_instance_0_cores_0-43.log @@ -77,7 +79,7 @@ Example script [resnet50.py](../examples/resnet50.py) will be used in this guide - [Intel OpenMP library](#intel-openmp-library) - [GNU OpenMP library](#gnu-openmp-library) -__Note:__ GIF files below intend to show CPU usage ONLY. Please do NOT induct performance numbers. +__Note:__ GIF files below illustrate CPU usage ONLY. Do NOT infer performance numbers. ### Single instance for inference @@ -165,7 +167,7 @@ If you check your log directory, you will find directory structure as below. └── logs ├── run_20210712214504_instances.log └── run_20210712214504_instance_0_cores_22-43.log - + ``` The ```run_20210712214504_instances.log``` contains information and command that were used for this execution launch. @@ -338,9 +340,9 @@ $ cat logs/run_20210712221305_instances.log 2021-07-12 22:13:05,476 - __main__ - INFO - numactl -C 22-32 -m 1 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221305_instance_2_cores_22-32.log 2021-07-12 22:13:05,479 - __main__ - INFO - numactl -C 33-43 -m 1 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221305_instance_3_cores_33-43.log ``` -#### VIII. Your designated number of instances and instance index +#### VIII. Your designated number of instances and instance index -Launcher by default runs all `ninstances` for multi-instance inference/training as shown above. You can specify `instance_idx` to independely run that instance only among `ninstances` +Launcher by default runs all `ninstances` for multi-instance inference/training as shown above. You can specify `instance_idx` to independently run that instance only among `ninstances` ``` python -m intel_extension_for_pytorch.cpu.launch --ninstances 4 --instance_idx 0 --log_path ./logs resnet50.py diff --git a/docs/tutorials/performance_tuning/torchserve.md b/docs/tutorials/performance_tuning/torchserve.md index dca7daceb..ce6f94b2b 100644 --- a/docs/tutorials/performance_tuning/torchserve.md +++ b/docs/tutorials/performance_tuning/torchserve.md @@ -1,11 +1,11 @@ # TorchServe with Intel® Extension for PyTorch* -TorchServe can be used with Intel® Extension for PyTorch* (IPEX) to give performance boost on Intel hardware1. +TorchServe can be used with Intel® Extension for PyTorch* (IPEX) to give a performance boost on Intel hardware1. Here we show how to use TorchServe with IPEX. -1. While IPEX benefits all platforms, platforms with AVX512 benefit the most. +1. While Intel® Extension for PyTorch\* benefits all platforms, those with AVX512 benefit the most. -## Contents of this Document +## Contents of this Document * [Install Intel Extension for PyTorch](#install-intel-extension-for-pytorch) * [Serving model with Intel Extension for PyTorch](#serving-model-with-intel-extension-for-pytorch) * [TorchServe with Launcher](#torchserve-with-launcher) @@ -14,35 +14,35 @@ Here we show how to use TorchServe with IPEX. * [Performance Boost with IPEX and Launcher](#performance-boost-with-ipex-and-launcher) -## Install Intel Extension for PyTorch -Refer to the documentation [here](../installation.html). +## Install Intel Extension for PyTorch +Refer to the [installation documentation](../installation.md). -## Serving model with Intel Extension for PyTorch -After installation, all it needs to be done to use TorchServe with IPEX is to enable it in `config.properties`. +## Serving model with Intel Extension for PyTorch +After installation, use TorchServe with IPEX by enabling it in `config.properties`. ``` ipex_enable=true ``` -Once IPEX is enabled, deploying PyTorch model follows the same procedure shown [here](https://pytorch.org/serve/use_cases.html). TorchServe with IPEX can deploy any model and do inference. +Once IPEX is enabled, deploying PyTorch model follows the same procedure shown in the [PyTorch use cases documentation](https://pytorch.org/serve/use_cases.html). TorchServe with IPEX can deploy any model and do inference. ## TorchServe with Launcher -Launcher is a script to automate the process of tunining configuration setting on intel hardware to boost performance. Tuning configurations such as OMP_NUM_THREADS, thread affininty, memory allocator can have a dramatic effect on performance. Please refer to [here](https://github.com/intel/intel-extension-for-pytorch/blob/master/docs/tutorials/performance_tuning/tuning_guide.md) and [here](https://github.com/intel/intel-extension-for-pytorch/blob/master/docs/tutorials/performance_tuning/launch_script.md) for details on performance tuning with launcher. +Launcher is a script to automate tuning configuration setting on Intel hardware to boost performance. Tuning configurations such as OMP_NUM_THREADS, thread affinity, and memory allocator can have a dramatic effect on performance. Refer to the [Performance Tuning Guide](tuning_guide.md) and [performance tuning launch script](launch_script.md) documentation for details. -All it needs to be done to use TorchServe with launcher is to set its configuration in `config.properties`. +Enable TorchServe with launcher by setting its configuration in `config.properties`. -Add the following lines in `config.properties` to use launcher with its default configuration. +Add the following lines in `config.properties` to use launcher with its default configuration. ``` ipex_enable=true cpu_launcher_enable=true ``` -Launcher by default uses `numactl` if its installed to ensure socket is pinned and thus memory is allocated from local numa node. To use launcher without numactl, add the following lines in `config.properties`. +Launcher uses `numactl` if it's installed to ensure a socket is pinned and thus memory is allocated from local numa node. To use launcher without numactl, add the following lines in `config.properties`. ``` ipex_enable=true cpu_launcher_enable=true cpu_launcher_args=--disable_numactl ``` -Launcher by default uses only non-hyperthreaded cores if hyperthreading is present to avoid core compute resource sharing. To use launcher with all cores, both physical and logical, add the following lines in `config.properties`. +Launcher by default uses only non-hyperthreaded cores to avoid core compute resource sharing. To use launcher with all cores, both physical and logical (hyperthreaded), add the following lines in `config.properties`. ``` ipex_enable=true cpu_launcher_enable=true @@ -53,30 +53,30 @@ Below is an example of passing multiple args to `cpu_launcher_args`. ``` ipex_enable=true cpu_launcher_enable=true -cpu_launcher_args=--use_logical_core --disable_numactl +cpu_launcher_args=--use_logical_core --disable_numactl ``` Some useful `cpu_launcher_args` to note are: 1. Memory Allocator: [ PTMalloc `--use_default_allocator` | *TCMalloc `--enable_tcmalloc`* | JeMalloc `--enable_jemalloc`] - * PyTorch by defualt uses PTMalloc. TCMalloc/JeMalloc generally gives better performance. + * PyTorch by default uses PTMalloc. TCMalloc/JeMalloc generally gives better performance. 2. OpenMP library: [GNU OpenMP `--disable_iomp` | *Intel OpenMP*] * PyTorch by default uses GNU OpenMP. Launcher by default uses Intel OpenMP. Intel OpenMP library generally gives better performance. 3. Node id: [`--node_id`] * Launcher by default uses all physical cores. Limit memory access to local memories on the Nth socket to avoid Non-Uniform Memory Access (NUMA). -Please refer to [here](https://github.com/intel/intel-extension-for-pytorch/blob/master/docs/tutorials/performance_tuning/launch_script.md) for a full list of tunable configuration of launcher. +Refer to the [performance tuning launch script](launch_script.md) for a full list of tunable configuration of launcher. Some notable launcher configurations are: -1. `--ninstances`: Number of instances for multi-instance inference/training. -2. `--instance_idx`: Launcher by default runs all `ninstances` when running multiple instances. Specifying `instance_idx` runs a single instance among `ninstances`. This is useful when running each instance independently. +1. `--ninstances`: Number of instances for multi-instance inference/training. +2. `--instance_idx`: Launcher by default runs all `ninstances` when running multiple instances. Specifying `instance_idx` runs a single instance among `ninstances`. This is useful when running each instance independently. -Please refer to [here](https://github.com/intel/intel-extension-for-pytorch/blob/master/docs/tutorials/performance_tuning/launch_script.md) for more details. +Refer to the [performance tuning launch script](launch_script.md) for more details. ## Creating and Exporting INT8 model for IPEX -Intel Extension for PyTorch supports both eager and torchscript mode. In this section, we show how to deploy INT8 model for IPEX. +Intel® Extension for PyTorch\* supports both eager and torchscript mode. In this section, we show how to deploy INT8 model for IPEX. -### 1. Creating a serialized file -First create `.pt` serialized file using IPEX INT8 inference. Here we show two examples with BERT and ResNet50. +### 1. Creating a serialized file +First create `.pt` serialized file using IPEX INT8 inference. Here we show two examples with BERT and ResNet50. #### BERT @@ -86,7 +86,7 @@ import intel_extension_for_pytorch as ipex import transformers from transformers import AutoModelForSequenceClassification, AutoConfig -# load the model +# load the model config = AutoConfig.from_pretrained( "bert-base-uncased", return_dict=False, torchscript=True, num_labels=2) model = AutoModelForSequenceClassification.from_pretrained( @@ -94,10 +94,10 @@ model = AutoModelForSequenceClassification.from_pretrained( model = model.eval() # define dummy input tensor to use for the model's forward call to record operations in the model for tracing -N, max_length = 1, 384 +N, max_length = 1, 384 dummy_tensor = torch.ones((N, max_length), dtype=torch.long) -# calibration +# calibration # ipex supports two quantization schemes to be used for activation: torch.per_tensor_affine and torch.per_tensor_symmetric # default qscheme is torch.per_tensor_affine conf = ipex.quantization.QuantConf(qscheme=torch.per_tensor_affine) @@ -107,21 +107,21 @@ with torch.no_grad(): with ipex.quantization.calibrate(conf): model(dummy_tensor, dummy_tensor, dummy_tensor) -# optionally save the configuraiton for later use +# optionally save the configuration for later use # save: # conf.save("model_conf.json") # load: # conf = ipex.quantization.QuantConf("model_conf.json") -# conversion +# conversion jit_inputs = (dummy_tensor, dummy_tensor, dummy_tensor) model = ipex.quantization.convert(model, conf, jit_inputs) -# save to .pt +# save to .pt torch.jit.save(model, 'bert_int8_jit.pt') ``` -#### ResNet50 +#### ResNet50 ``` import torch @@ -148,7 +148,7 @@ with torch.no_grad(): with ipex.quantization.calibrate(conf): model(dummy_tensor) -# optionally save the configuraiton for later use +# optionally save the configuration for later use # save: # conf.save("model_conf.json") # load: @@ -162,37 +162,37 @@ model = ipex.quantization.convert(model, conf, jit_inputs) torch.jit.save(model, 'rn50_int8_jit.pt') ``` -### 2. Creating a Model Archive -Once the serialized file ( `.pt`) is created, it can be used with `torch-model-archiver` as ususal. Use the following command to package the model. +### 2. Creating a Model Archive +Once the serialized file ( `.pt`) is created, it can be used with `torch-model-archiver` as usual. Use the following command to package the model. ``` -torch-model-archiver --model-name rn50_ipex_int8 --version 1.0 --serialized-file rn50_int8_jit.pt --handler image_classifier +torch-model-archiver --model-name rn50_ipex_int8 --version 1.0 --serialized-file rn50_int8_jit.pt --handler image_classifier ``` -### 3. Start TorchServe to serve the model -Make sure to set `ipex_enable=true` in `config.properties`. Use the following command to start TorchServe with IPEX. +### 3. Start TorchServe to serve the model +Make sure to set `ipex_enable=true` in `config.properties`. Use the following command to start TorchServe with IPEX. ``` torchserve --start --ncs --model-store model_store --ts-config config.properties ``` -### 4. Registering and Deploying model -Registering and deploying the model follows the same steps shown [here](https://pytorch.org/serve/use_cases.html). +### 4. Registering and Deploying model +Registering and deploying the model follows the same steps shown in the [PyTorch Use Case documentation](https://pytorch.org/serve/use_cases.html). -## Benchmarking with Launcher +## Benchmarking with Launcher Launcher can be used with TorchServe official [benchmark](https://github.com/pytorch/serve/tree/master/benchmarks) to launch server and benchmark requests with optimal configuration on Intel hardware. In this section we provide examples of benchmarking with launcher with its default configuration. -Add the following lines to `config.properties` in the benchmark directory to use launcher with its default setting. +Add the following lines to `config.properties` in the benchmark directory to use launcher with its default setting. ``` ipex_enable=true cpu_launcher_enable=true ``` -The rest of the steps for benchmarking follows the same steps shown [here](https://github.com/pytorch/serve/tree/master/benchmarks). +The rest of the steps for benchmarking follows the same steps shown in the [benchmark documentation](https://github.com/pytorch/serve/tree/master/benchmarks). -`model_log.log` contains information and command that were used for this execution launch. +`model_log.log` contains information and command that were used for this execution launch. -CPU usage on a machine with Intel(R) Xeon(R) Platinum 8180 CPU, 2 sockets, 28 cores per socket, 2 threads per core is shown as below: +CPU usage on a machine with Intel(R) Xeon(R) Platinum 8180 CPU, 2 sockets, 28 cores per socket, 2 threads per core is shown as below: ![launcher_default_2sockets](https://user-images.githubusercontent.com/93151422/144373537-07787510-039d-44c4-8cfd-6afeeb64ac78.gif) ``` @@ -206,7 +206,7 @@ $ cat logs/model_log.log 2021-12-01 21:22:40,096 - __main__ - WARNING - Numa Aware: cores:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55] in different NUMA node ``` -CPU usage on a machine with Intel(R) Xeon(R) Platinum 8375C CPU, 1 socket, 2 cores per socket, 2 threads per socket is shown as below: +CPU usage on a machine with Intel(R) Xeon(R) Platinum 8375C CPU, 1 socket, 2 cores per socket, 2 threads per socket is shown as below: ![launcher_default_1socket](https://user-images.githubusercontent.com/93151422/144372993-92b2ca96-f309-41e2-a5c8-bf2143815c93.gif) ``` @@ -224,24 +224,24 @@ $ cat logs/model_log.log ![pdt_perf](https://github.com/min-jean-cho/frameworks.ai.pytorch.ipex-cpu-1/assets/93151422/a158ba6c-a151-4115-befb-39acb7545936) -Above shows performance improvement of Torchserve with IPEX and launcher on ResNet50 and BERT-base-uncased. Torchserve official [apache-bench benchmark](https://github.com/pytorch/serve/tree/master/benchmarks#benchmarking-with-apache-bench) on Amazon EC2 m6i.24xlarge was used to collect the results. Add the following lines in ```config.properties``` to reproduce the results. Notice that launcher is configured such that a single instance uses all physical cores on a single socket to avoid cross socket communication and core overlap. +Above shows performance improvement of Torchserve with IPEX and launcher on ResNet50 and BERT-base-uncased. Torchserve official [apache-bench benchmark](https://github.com/pytorch/serve/tree/master/benchmarks#benchmarking-with-apache-bench) on Amazon EC2 m6i.24xlarge was used to collect the results. Add the following lines in ```config.properties``` to reproduce the results. Notice that launcher is configured such that a single instance uses all physical cores on a single socket to avoid cross socket communication and core overlap. ``` ipex_enable=true cpu_launcher_enable=true cpu_launcher_args=--node_id 0 --ninstance 1 --enable_jemalloc ``` -Use the following command to reproduce the results. +Use the following command to reproduce the results. ``` -python benchmark-ab.py --url {modelUrl} --input {inputPath} --concurrency 1 +python benchmark-ab.py --url {modelUrl} --input {inputPath} --concurrency 1 ``` -For example, run the following command to reproduce latency performance of ResNet50 with data type of IPEX int8 and batch size of 1. +For example, run the following command to reproduce latency performance of ResNet50 with data type of IPEX int8 and batch size of 1. ``` python benchmark-ab.py --url 'file:///model_store/rn50_ipex_int8.mar' --concurrency 1 ``` -For example, run the following command to reproduce latency performance of BERT with data type of IPEX int8 and batch size of 1. +For example, run the following command to reproduce latency performance of BERT with data type of IPEX int8 and batch size of 1. ``` python benchmark-ab.py --url 'file:///model_store/bert_ipex_int8.mar' --input '../examples/Huggingface_Transformers/Seq_classification_artifacts/sample_text_captum_input.txt' --concurrency 1 ``` diff --git a/docs/tutorials/performance_tuning/tuning_guide.md b/docs/tutorials/performance_tuning/tuning_guide.md index 8e5af4890..7fec0896a 100644 --- a/docs/tutorials/performance_tuning/tuning_guide.md +++ b/docs/tutorials/performance_tuning/tuning_guide.md @@ -3,62 +3,64 @@ Performance Tuning Guide ## Overview -Intel Extension for PyTorch (IPEX) is a Python package to extend official PyTorch. It is designed to make the Out-of-Box user experience of PyTorch CPU better while achieving good performance. To fully utilize the power of Intel® architecture and thus yield high performance, PyTorch, as well as IPEX, are powered by [oneAPI Deep Neural Network Library (oneDNN)](https://github.com/oneapi-src/oneDNN), an open-source cross-platform performance library of basic building blocks for deep learning applications. It is developed and optimized for Intel Architecture Processors, Intel Processor Graphics and Xe architecture-based Graphics. - -Although by default primitives of PyTorch and IPEX are highly optimized, there are still something that users can do to optimize for performance further more. Most optimized configurations can be automatically set by the launcher script. This article introduces common methods that Intel developers recommend to take. - -- Hardware Configuration - - Intel CPU Structure - - Non-Uniform Memory Access (NUMA) -- Software Configuration - - Numactl - - OpenMP - - OMP_NUM_THREADS - - GNU OpenMP - - GOMP_CPU_AFFINITY - - OMP_PROC_BIND - - OMP_SCHEDULE - - Intel OpenMP - - KMP_AFFINITY - - KMP_BLOCKTIME - - Memory Allocator - - Jemalloc - - TCMalloc - - Denormal Number +Intel® Extension for PyTorch\* (IPEX) is a Python package to extend official PyTorch. It makes the out-of-box user experience of PyTorch CPU better while achieving good performance. To fully utilize the power of Intel® architecture and thus yield high performance, PyTorch, as well as IPEX, are powered by [oneAPI Deep Neural Network Library (oneDNN)](https://github.com/oneapi-src/oneDNN), an open-source cross-platform performance library of basic building blocks for deep learning applications. It is developed and optimized for Intel Architecture Processors, Intel Processor Graphics, and Xe architecture-based Graphics. + +Although default primitives of PyTorch and IPEX are highly optimized, there are things users can do improve performance. Most optimized configurations can be automatically set by the launcher script. This article introduces common methods recommended by Intel developers. + +## Contents of this Document +* [Hardware Configuration](#hardware-configuration) + * [Intel CPU Structure](#intel-cpu-structure) + * [Non-Uniform Memory Access (NUMA)](#non-uniform-memory-access-numa) +* [Software Configuration](#software-configuration) + * [Channels Last](#channels-last) + * [Numactl](#numactl) + * [OpenMP](#openmp) + * [OMP_NUM_THREADS](#omp-num-threads) + * [GNU OpenMP](#gnu-openmp) + * [Intel OpenMP](#intel-openmp) + * [Memory Allocator](#memory-allocator) + * [Jemalloc](#jemalloc) + * [TCMalloc](#tcmalloc) + * [Denormal Number](#denormal-number) + * [OneDNN primitive cache](#onednn-primitive-cache) ## Hardware Configuration -This section briefly instroduces structure of Intel CPUs, as well as concept of Non-Uniform Memory Access (NUMA), as background knowledges. +This section briefly introduces the structure of Intel CPUs, as well as concept of Non-Uniform Memory Access (NUMA). ### Intel CPU Structure -There are a bunch of SKUs or families of Intel CPUs. In this article, Intel® Xeon® processor Scalable family is used as an example to show briefly what is Intel CPU, and how it works. Understanding these background knowledge is helpful to understand the optimization methodologies that Intel engineers recommend to use. +There are many families of Intel CPUs. We'll use Intel® Xeon® processor Scalable family as an example to discuss an Intel CPU and how it works. Understanding this background knowledge is helpful to understand the PyTorch optimization methodologies that Intel engineers recommend. -![Intel® Xeon® processor Scalable family](https://www.trentonsystems.com/hs-fs/hubfs/Intel-Xeon-Scalable-1.jpg?width=2520&name=Intel-Xeon-Scalable-1.jpg) +On the Intel® Xeon® Scalable Processors with Intel® C620 Series Chipsets, (formerly Purley) platform, each chip provides up to 28 cores. Each core has a non-inclusive last-level cache and an 1MB L2 cache. The CPU features fast 2666 MHz DDR4 memory, six memory channels per CPU, Intel Ultra Path Interconnect (UPI) high speed point-to-point processor interconnect, and more. Figure 1 shows microarchitecture of the Intel® Xeon® processor Scalable family chips. Each CPU chip consists of a number of cores, along with core-specific cache. 6 channels of DDR4 memory are connected to the chip directly. Meanwhile, chips communicates through the Intel UPI interconnect, which features a transfer speed of up to 10.4 GT/s. -Figure 1.1 Intel® Xeon® processor Scalable family - -Figure 1.1 shows a series of Intel Xeon processor Scalable family CPU chips. On the Purley platform each chip provides up to 28 cores. Each core has a non-inclusive last-level cache and an 1MB L2 cache. The CPU features fast 2666 MHz DDR4 memory, six memory channels per CPU, Intel Ultra Path Interconnect (UPI) high speed point-to-point processor interconnect, and more. Figure 1.2 shows microarchitecture of the Intel® Xeon® processor Scalable family chips. Each CPU chip consists of a number of cores, along with core-specific cache. 6 channels of DDR4 memory are connected to the chip directly. Meanwhile, chips communicates through the Intel UPI interconnect, which features a transfer speed of up to 10.4 GT/s. +
![Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture](https://software.intel.com/content/dam/develop/external/us/en/images/xeon-processor-scalable-family-tech-overview-fig03-737410.png) -Figure 1.2 Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture +Figure 1: Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture. + +
-Usually, a CPU chip is called a socket. A typical two-socket configuration is illustrated as Figure 1.3. Two CPU chips, or say two sockets, are equipped on one motherboard. Each socket is connected to up to 6 channels of memory, which is called its local memory, from socket perspective. Sockets are connected to each other via Intel UPI. It is possible for each socket to access memories attached on other sockets, usually called remote memory access. Local memory access is always faster than remote memory access. Meanwhile, cores on one socket share a space of high speed cache memory, which is much faster than communication via Intel UPI. Figure 1.4 shows an ASUS Z11PA-D8 Intel® Xeon® server motherboard, equipping with two sockets for Intel® Xeon® processor Scalable family CPUs. +Usually, a CPU chip is called a socket. A typical two-socket configuration is illustrated as Figure 2. Two CPU sockets are equipped on one motherboard. Each socket is connected to up to 6 channels of memory, called its local memory, from socket perspective. Sockets are connected to each other via Intel UPI. It is possible for each socket to access memories attached on other sockets, usually called remote memory access. Local memory access is always faster than remote memory access. Meanwhile, cores on one socket share a space of high speed cache memory, which is much faster than communication via Intel UPI. Figure 3 shows an ASUS Z11PA-D8 Intel® Xeon® server motherboard, equipping with two sockets for Intel® Xeon® processor Scalable family CPUs. + +
![Typical two-socket configuration](https://software.intel.com/content/dam/develop/external/us/en/images/xeon-processor-scalable-family-tech-overview-fig06-737410.png) -Figure 1.3 Typical two-socket configuration +Figure 2: Typical two-socket configuration. ![ASUS Z11PA-D8 Intel® Xeon® server motherboard](https://dlcdnimgs.asus.com/websites/global/products/MCCApMgGOdr9WJxN/MB-Z11PAD8-overview-01-s.jpg) -Figure 1.4 An ASUS Z11PA-D8 Intel® Xeon® server motherboard. It contains two sockets for Intel® Xeon® processor Scalable family CPUs. +Figure 3: An ASUS Z11PA-D8 Intel® Xeon® server motherboard. It contains two sockets for Intel® Xeon® processor Scalable family CPUs. + +
### Non-Uniform Memory Access (NUMA) It is a good thing that more and more CPU cores are provided to users in one socket, because this brings more computation resources. However, this also brings memory access competitions. Program can stall because memory is busy to visit. To address this problem, Non-Uniform Memory Access (NUMA) was introduced. Comparing to Uniform Memory Access (UMA), in which scenario all memories are connected to all cores equally, NUMA tells memories into multiple groups. Certain number of memories are directly attached to one socket's integrated memory controller to become local memory of this socket. As described in the previous section, local memory access is much faster than remote memory access. -Usrs can get CPU information with ```lscpu``` command on Linux to learn how many cores, sockets there on the machine. Also, NUMA information like how CPU cores are distributed can also be retrieved. The following is an example of ```lscpu``` execution on a machine with two Intel(R) Xeon(R) Platinum 8180M CPUs. 2 sockets were detected. Each socket has 28 physical cores onboard. Since Hyper-Threading is enabled, each core can run 2 threads. I.e. each socket has another 28 logical cores. Thus, there are 112 CPU cores on service. When indexing CPU cores, usually physical cores are indexed prior to logical core. In this case, the first 28 cores (0-27) are physical cores on the first NUMA socket (node), the second 28 cores (28-55) are physical cores on the second NUMA socket (node). Logical cores are indexed afterward. 56-83 are 28 logical cores on the first NUMA socket (node), 84-111 are the second 28 logical cores on the second NUMA socket (node). Typically, running IPEX should avoid using logical cores to get a good performance. +Users can get CPU information with ```lscpu``` command on Linux to learn how many cores, sockets there on the machine. Also, NUMA information like how CPU cores are distributed can also be retrieved. The following is an example of ```lscpu``` execution on a machine with two Intel(R) Xeon(R) Platinum 8180M CPUs. 2 sockets were detected. Each socket has 28 physical cores onboard. Since Hyper-Threading is enabled, each core can run 2 threads. I.e. each socket has another 28 logical cores. Thus, there are 112 CPU cores on service. When indexing CPU cores, usually physical cores are indexed before logical core. In this case, the first 28 cores (0-27) are physical cores on the first NUMA socket (node), the second 28 cores (28-55) are physical cores on the second NUMA socket (node). Logical cores are indexed afterward. 56-83 are 28 logical cores on the first NUMA socket (node), 84-111 are the second 28 logical cores on the second NUMA socket (node). Typically, running IPEX should avoid using logical cores to get a good performance. ``` $ lscpu ... @@ -82,17 +84,17 @@ This section introduces software configurations that helps to boost performance. ### Channels Last -Please take advantage of **Channels Last** memory format for image processing tasks. Comparing to PyTorch default NCHW (`torch.contiguous_format`) memory format, NHWC (`torch.channels_last`) is more friendly to Intel platforms, and thus generally yields better performance. More detailed introduction can be found at [Channels Last page](../features/nhwc.html). You can get sample codes with Resnet50 at [Example page](../examples.html). +Take advantage of **Channels Last** memory format for image processing tasks. Comparing to PyTorch default NCHW (`torch.contiguous_format`) memory format, NHWC (`torch.channels_last`) is more friendly to Intel platforms, and thus generally yields better performance. More detailed introduction can be found at [Channels Last page](../features/nhwc.md). You can get sample codes with Resnet50 at [Example page](../examples.md). ### Numactl Since NUMA largely influences memory access performance, this functionality should also be implemented in software side. -During development of Linux kernels, more and more sophisticated implementations/optimizations/strategies had been brought out. Version 2.5 of the Linux kernel already contained basic NUMA support, which was further improved in subsequent kernel releases. Version 3.8 of the Linux kernel brought a new NUMA foundation that allowed development of more efficient NUMA policies in later kernel releases. Version 3.13 of the Linux kernel brought numerous policies that aim at putting a process near its memory, together with the handling of cases such as having memory pages shared between processes, or the use of transparent huge pages; new sysctl settings allow NUMA balancing to be enabled or disabled, as well as the configuration of various NUMA memory balancing parameters.[1] Behaviour of Linux kernels are thus different according to kernel version. Newer Linux kernels may contain further optimizations of NUMA strategies, and thus have better performances. For some workloads, NUMA strategy influences performance great. +During development of Linux kernels, more and more sophisticated implementations/optimizations/strategies had been brought out. Version 2.5 of the Linux kernel already contained basic NUMA support, which was further improved in subsequent kernel releases. Version 3.8 of the Linux kernel brought a new NUMA foundation that allowed development of more efficient NUMA policies in later kernel releases. Version 3.13 of the Linux kernel brought numerous policies that aim at putting a process near its memory, together with the handling of cases such as having memory pages shared between processes, or the use of transparent huge pages. New sysctl settings allow NUMA balancing to be enabled or disabled, as well as the configuration of various NUMA memory balancing parameters.[1] Behavior of Linux kernels are thus different according to kernel version. Newer Linux kernels may contain further optimizations of NUMA strategies, and thus have better performances. For some workloads, NUMA strategy influences performance great. -Linux provides a tool, ```numactl```, to allow users to control NUMA policy for processes or shared memory. It runs processes with a specific NUMA scheduling or memory placement policy. As described in previous section, cores share high-speed cache in one socket, thus it is a good idea to avoid cross socket computations. From memory access perspective, bounding memory access to local ones is much faster than accessing remote memories. +Linux provides a tool, ```numactl```, that allows user control of NUMA policy for processes or shared memory. It runs processes with a specific NUMA scheduling or memory placement policy. As described in previous section, cores share high-speed cache in one socket, thus it is a good idea to avoid cross socket computations. From a memory access perspective, bounding memory access locally is much faster than accessing remote memories. -The following is an example of numactl usage to run a workload on the Nth socket, and limit memory access to its local memories on the Nth socket. More detailed description of numactl command can be found [here](https://linux.die.net/man/8/numactl). +The following is an example of numactl usage to run a workload on the Nth socket and limit memory access to its local memories on the Nth socket. More detailed description of numactl command can be found [on the numactl man page](https://linux.die.net/man/8/numactl). ```numactl --cpunodebind N --membind N python