Skip to content

Commit

Permalink
update tutorial for 1.12 release (#942)
Browse files Browse the repository at this point in the history
update the runtime document

update known issue of runtime extension

[doc] update supported fusion patterns of fp32/bf16/int8 (#854)

* update supported fusion patterns of fp32/bf16/int8

* fix typo

doc: editor review of all tutorial docs (#863)

- Lots of edits across the tutorial documents for grammar, clarity,
  simplification, and spelling
- Fixed malformed md and rst causing layout issues (including indenting)
- Removed trailing whitespace
- Fixed UTF-8 characters in code examples (e.g, curley quotes vs.
  straight quotes)
- Changed pygments language (code highlight) to bash for unsupported cmd
- Changed absolute links to relative where appropriate.
- Added toctree items to make documents visible in navigation menu.

Signed-off-by: David B. Kinder <[email protected]>

update docs

update int8.md

update performance page with tunable parameters description

update int8 example

update torch-ccl package name

update version in README

update int8.md: change customer qconfig for dynamic quantization

Add performance tuning guide for OneDNN primitive cache (#905)

* Add performance tuning guide for OneDNN primitive cache

* Update docs/tutorials/performance_tuning/tuning_guide.md

Co-authored-by: Jiong Gong <[email protected]>

* Update tuning_guide.md

Co-authored-by: Jiong Gong <[email protected]>

update doc for autocast (#899)

add 2 known issues of MultiStreamModule

update known issues

update known issues

update int8 doc

add 1.12 release notes

correct intel_extension_for_pytorch_structure.png

update release notes, correct model zoo url in examples

update docs

update docs

update graph_optimization.md
  • Loading branch information
jingxu10 authored Jul 6, 2022
1 parent 1f633c0 commit 2e4d306
Show file tree
Hide file tree
Showing 26 changed files with 1,458 additions and 690 deletions.
70 changes: 37 additions & 33 deletions docs/design_doc/isa_dyndisp.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# IPEX CPU ISA Dynamic Dispatch Design Doc
# Intel® Extension for PyTorch\* CPU ISA Dynamic Dispatch Design Doc

This document explains the dynamic kernel dispatch mechanism based on CPU ISA. It is an extension to the similar mechanism in PyTorch.
This document explains the dynamic kernel dispatch mechanism for Intel® Extension for PyTorch\* (IPEX) based on CPU ISA. It is an extension to the similar mechanism in PyTorch.

## Overview
---
IPEX dyndisp is forked from **PyTorch:** `ATen/native/DispatchStub.h` and `ATen/native/DispatchStub.cpp`. Besides that, IPEX add more CPU ISA level support, such as `AVX512_VNNI`, `AVX512_BF16` and `AMX`.

IPEX dyndisp is forked from **PyTorch:** `ATen/native/DispatchStub.h` and `ATen/native/DispatchStub.cpp`. IPEX adds additional CPU ISA level support, such as `AVX512_VNNI`, `AVX512_BF16` and `AMX`.

PyTorch & IPEX CPU ISA support statement:
| | DEFAULT | AVX2 | AVX512 | AVX512_VNNI | AVX512_BF16 | AMX |
Expand All @@ -23,19 +23,19 @@ PyTorch & IPEX CPU ISA support statement:
| AVX512_BF16 | GCC 10.3+ |
| AMX | GCC 11.2+ |

\* Detailed compiler check, please check with `cmake/Modules/FindAVX.cmake`
\* Check with `cmake/Modules/FindAVX.cmake` for detailed compiler checks.

## Dynamic Dispatch Design
---
Dynamic dispatch major mechanism is to copy the kernel implementation source file to multiple folders for each ISA level. And then build each file using its ISA specific parameters. Each generated object file will contains its function body(**Kernel Implementation**).

Kernel Implementation use anonymous namespace so that different cpu versions won't conflict.
Dynamic dispatch copies the kernel implementation source files to multiple folders for each ISA level. It then builds each file using its ISA specific parameters. Each generated object file will contain its function body (**Kernel Implementation**).

**Kernel Stub** is a "virtual function" with polymorphic kernel implementations w.r.t. ISA levels.
Kernel Implementation uses an anonymous namespace so that different CPU versions won't conflict.

At the runtime, **Dispatch Stub implementation** will check CPUIDs and OS status to determins which ISA level pointer to best matching function body.
**Kernel Stub** is a "virtual function" with polymorphic kernel implementations pertaining to ISA levels.

### Code Folder Struct
At the runtime, **Dispatch Stub implementation** will check CPUIDs and OS status to determins which ISA level pointer best matches the function body.

### Code Folder Struct
>#### **Kernel implementation:** `intel_extension_for_pytorch/csrc/aten/cpu/kernels/xyzKrnl.cpp`
>#### **Kernel Stub:** `intel_extension_for_pytorch/csrc/aten/cpu/xyz.cpp` and `intel_extension_for_pytorch/csrc/aten/cpu/xyz.h`
>#### **Dispatch Stub implementation:** `intel_extension_for_pytorch/csrc/dyndisp/DispatchStub.cpp` and `intel_extension_for_pytorch/csrc/dyndisp/DispatchStub.h`
Expand All @@ -46,8 +46,10 @@ IPEX build system will generate code for each ISA level with specifiy complier p
The CodeGen will copy each cpp files from **Kernel implementation**, and then add ISA level as new file suffix.

> **Sample:**
>
> ----
> **Origin file:**
>
> **Origin file:**
>
> `intel_extension_for_pytorch/csrc/aten/cpu/kernels/AdaptiveAveragePoolingKrnl.cpp`
>
Expand All @@ -64,7 +66,9 @@ The CodeGen will copy each cpp files from **Kernel implementation**, and then ad
> AVX512_BF16: `build/Release/intel_extension_for_pytorch/csrc/aten/cpu/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_BF16.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -DCPU_CAPABILITY=AVX512_BF16 -DCPU_CAPABILITY_AVX512_BF16`
>
> AMX: `build/Release/intel_extension_for_pytorch/csrc/aten/cpu/kernels/AdaptiveAveragePoolingKrnl.cpp.AMX.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -DCPU_CAPABILITY_AVX512_BF16 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -mamx-tile -mamx-int8 -mamx-bf16 -DCPU_CAPABILITY=AMX -DCPU_CAPABILITY_AMX`
---

>**Note:**
>1. DEFAULT level kernels is not fully implemented in IPEX. In order to align to PyTorch, we build default use AVX2 parameters in stead of that. So, IPEX minimal required executing machine support AVX2.
>2. `-D__AVX__` and `-D__AVX512F__` is defined for depends library [sleef](https://sleef.org/) .
Expand All @@ -73,12 +77,12 @@ The CodeGen will copy each cpp files from **Kernel implementation**, and then ad
>5. Higher ISA level is compatible to lower ISA levels, so it needs to contains level ISA feature definitions. Such as AVX512_BF16 need contains `-DCPU_CAPABILITY_AVX512` `-DCPU_CAPABILITY_AVX512_VNNI`. But AVX512 don't contains AVX2 definitions, due to there are different vec register width.
## Add Custom Kernel
---
If you want to add new custom kernel, and the kernel using CPU ISA instruction. Please reference to below steps.

1. Please add CPU ISA related kernel implementation to the folder: `intel_extension_for_pytorch/csrc/aten/cpu/kernels/NewKernelKrnl.cpp`
2. Please add kernel stub to the folder: `intel_extension_for_pytorch/csrc/aten/cpu/NewKernel.cpp`
3. Please include header file: `intel_extension_for_pytorch/csrc/dyndisp/DispatchStub.h`, and reference to the comment in the header file.
If you want to add a new custom kernel, and the kernel uses CPU ISA instructions, refer to these tips:

1. Add CPU ISA related kernel implementation to the folder: `intel_extension_for_pytorch/csrc/aten/cpu/kernels/NewKernelKrnl.cpp`
2. Add kernel stub to the folder: `intel_extension_for_pytorch/csrc/aten/cpu/NewKernel.cpp`
3. Include header file: `intel_extension_for_pytorch/csrc/dyndisp/DispatchStub.h`, and reference to the comment in the header file.
```c++
// Implements instruction set specific function dispatch.
//
Expand Down Expand Up @@ -111,9 +115,9 @@ If you want to add new custom kernel, and the kernel using CPU ISA instruction.

>**Note:**
>
>1. Some kernel only call **oneDNN** or **iDeep** implementation, or other backend implementation. Which is not need to add kernel implementation. (Refer: `BatchNorm.cpp`)
>2. Vec related header file must be included in kernel implementation file, but can not be included in kernel stub. Kernel stub is common code for all ISA level, and can't pass ISA related compiler parameters.
>3. More intrinsics please check at [Intel® Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html).
>1. Some kernels only call **oneDNN** or **iDeep** implementation, or other backend implementation, which is not needed to add kernel implementations. (Refer: `BatchNorm.cpp`)
>2. Vec related header file must be included in kernel implementation files, but can not be included in kernel stub. Kernel stub is common code for all ISA level, and can't pass ISA related compiler parameters.
>3. For more intrinsics, check the [Intel® Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html).
### ISA intrinics specific kernel example:

Expand Down Expand Up @@ -163,7 +167,7 @@ void cvt_fp32_to_bf16(at::BFloat16* dst, const float* src, int len) {
```
Macro `CPU_CAPABILITY_AVX512` and `CPU_CAPABILITY_AVX512_BF16` are defined by compiler check, it is means that current compiler havs capability to generate defined ISA level code.

Because of `AVX512_BF16` is higher level than `AVX512`, and it compatible to `AVX512`. `CPU_CAPABILITY_AVX512_BF16` can be contained in `CPU_CAPABILITY_AVX512` region.
Because of `AVX512_BF16` is higher level than `AVX512`, and it compatible to `AVX512`. `CPU_CAPABILITY_AVX512_BF16` can be contained in `CPU_CAPABILITY_AVX512` region.
```c++
//csrc/aten/cpu/kernels/CvtFp32ToBf16Krnl.cpp

Expand Down Expand Up @@ -247,7 +251,7 @@ REGISTER_DISPATCH(cvt_fp32_to_bf16_kernel_stub, &cvt_fp32_to_bf16_kernel_impl);
```
### Vec specific kernel example:
This example show get data type size and Its Vec size. In different ISA, Vec has different register width, and it has different Vec size also.
This example shows how to get the data type size and its Vec size. In different ISA, Vec has a different register width and a different Vec size.
```c++
//csrc/aten/cpu/GetVecLength.h
Expand Down Expand Up @@ -354,19 +358,19 @@ REGISTER_DISPATCH(
```
## Private Debug APIs
---
Here three ISA related private APIs could do same debug work. Which contains:

Here are three ISA-related private APIs that can help debugging::
1. Query current ISA level.
2. Query max CPU supported ISA level.
3. Query max binary supported ISA level.
>**Note:**
>
>1. Max CPU supported ISA level only depends on CPU features.
>2. Max binary supported ISA level only depends on built complier version.
>3. Current ISA level, it is equal minimal of `max CPU ISA level` and `max binary ISA level`.
>3. Current ISA level, it is the smaller of `max CPU ISA level` and `max binary ISA level`.

### Example:
```cmd
```bash
python
Python 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Expand All @@ -382,24 +386,24 @@ Type "help", "copyright", "credits" or "license" for more information.
```

## Select ISA level manually.
---
By default, IPEX dispatches to the kernels with maximum ISA level supported by the underlying CPU hardware. This ISA level can be overridden by the environment variable `ATEN_CPU_CAPABILITY` (same environment variable from PyTorch). The available values are {`avx2`, `avx512`, `avx512_vnni`, `avx512_bf16`, `amx`}. The effective ISA level would be the minimal level between `ATEN_CPU_CAPABILITY` and the maximum level supported by the hardware.

By default, IPEX dispatches to the kernels with the maximum ISA level supported by the underlying CPU hardware. This ISA level can be overridden by the environment variable `ATEN_CPU_CAPABILITY` (same environment variable as PyTorch). The available values are {`avx2`, `avx512`, `avx512_vnni`, `avx512_bf16`, `amx`}. The effective ISA level would be the minimal level between `ATEN_CPU_CAPABILITY` and the maximum level supported by the hardware.
### Example:
```cmd
```bash
$ python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())'
AMX
$ ATEN_CPU_CAPABILITY=avx2 python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())'
AVX2
```
>**Note:**
>
>`core._get_current_isa_level()` is an IPEX internal function used for checking the current effective ISA level. It is used for debugging purpose only and subjects to change.
>`core._get_current_isa_level()` is an IPEX internal function used for checking the current effective ISA level. It is used for debugging purpose only and subject to change.
## CPU feature check
---

An addtional CPU feature check tool in the subfolder: `tests/cpu/isa`

```cmd
```bash
$ cmake .
-- The C compiler identification is GNU 11.2.1
-- The CXX compiler identification is GNU 11.2.1
Expand Down Expand Up @@ -466,4 +470,4 @@ amx_tile: true
amx_int8: true
prefetchw: true
prefetchwt1: false
```
```
17 changes: 11 additions & 6 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,24 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to Intel® Extension for PyTorch* documentation!
#######################################################
Welcome to Intel® Extension for PyTorch* Documentation
######################################################

Intel® Extension for PyTorch* extends PyTorch with optimizations for extra performance boost on Intel hardware. Most of the optimizations will be included in stock PyTorch releases eventually, and the intention of the extension is to deliver up-to-date features and optimizations for PyTorch on Intel hardware, examples include AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX).
Intel® Extension for PyTorch* extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware. Example optimizations use AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX). Over time, most of these optimizations will be included directly into stock PyTorch releases.

Intel® Extension for PyTorch* is structured as the following figure. It is loaded as a Python module for Python programs or linked as a C++ library for C++ programs. Users can enable it dynamically in script by importing `intel_extension_for_pytorch`. It covers optimizations for both imperative mode and graph mode. Optimized operators and kernels are registered through PyTorch dispatching mechanism. These operators and kernels are accelerated from native vectorization feature and matrix calculation feature of Intel hardware. During execution, Intel® Extension for PyTorch* intercepts invocation of ATen operators, and replace the original ones with these optimized ones. In graph mode, further operator fusions are applied manually by Intel engineers or through a tool named *oneDNN Graph* to reduce operator/kernel invocation overheads, and thus increase performance.
Intel® Extension for PyTorch* provides optimizations for both eager mode and graph mode, however, compared to eager mode, graph mode in PyTorch normally yields better performance from optimization techniques such as operation fusion, and Intel® Extension for PyTorch* amplified them with more comprehensive graph optimizations. Therefore we recommended you to take advantage of Intel® Extension for PyTorch* with `TorchScript <https://pytorch.org/docs/stable/jit.html>`_ whenever your workload supports it. You could choose to run with `torch.jit.trace()` function or `torch.jit.script()` function, but based on our evaluation, `torch.jit.trace()` supports more workloads so we recommend you to use `torch.jit.trace()` as your first choice. More detailed information can be found at `pytorch.org website <https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html#tracing-modules>`_.

.. image:: ../images/intel_extension_for_pytorch_structure.png
The extension can be loaded as a Python module for Python programs or linked as a C++ library for C++ programs. In Python scripts users can enable it dynamically by importing `intel_extension_for_pytorch`.

Intel® Extension for PyTorch* is structured as shown in the following figure:

.. figure:: ../images/intel_extension_for_pytorch_structure.png
:width: 800
:align: center
:alt: Structure of Intel® Extension for PyTorch*

|

PyTorch components are depicted with white boxes while Intel Extensions are with blue boxes. Extra performance of the extension is delivered via both custom addons and overriding existing PyTorch components. In eager mode, the PyTorch frontend is extended with custom Python modules (such as fusion modules), optimal optimizers and INT8 quantization API. Further performance boosting is available by converting the eager-mode model into graph mode via the extended graph fusion passes. Intel® Extension for PyTorch* dispatches the operators into their underlying kernels automatically based on ISA that it detects and leverages vectorization and matrix acceleration units available in Intel hardware, as much as possible. oneDNN library is used for computation intensive operations. Intel Extension for PyTorch runtime extension brings better efficiency with finer-grained thread runtime control and weight sharing.

Intel® Extension for PyTorch* has been released as an open–source project at `Github <https://github.com/intel/intel-extension-for-pytorch>`_.

Expand Down
3 changes: 1 addition & 2 deletions docs/tutorials/api_doc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@ Quantization
************

.. automodule:: intel_extension_for_pytorch.quantization
.. autofunction:: QuantConf
.. autoclass:: calibrate
.. autofunction:: prepare
.. autofunction:: convert

CPU Runtime
Expand Down
1 change: 1 addition & 0 deletions docs/tutorials/blogs_publications.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
Blogs & Publications
====================

* [Accelerating PyTorch with Intel® Extension for PyTorch\*](https://medium.com/pytorch/accelerating-pytorch-with-intel-extension-for-pytorch-3aef51ea3722)
* [Intel and Facebook Accelerate PyTorch Performance with 3rd Gen Intel® Xeon® Processors and Intel® Deep Learning Boost’s new BFloat16 capability](https://www.intel.com/content/www/us/en/artificial-intelligence/posts/intel-facebook-boost-bfloat16.html)
* [Accelerate PyTorch with the extension and oneDNN using Intel BF16 Technology](https://medium.com/pytorch/accelerate-pytorch-with-ipex-and-onednn-using-intel-bf16-technology-dca5b8e6b58f)
* *Note*: APIs mentioned in it are deprecated.
Expand Down
Loading

0 comments on commit 2e4d306

Please sign in to comment.