From 5d07b1bbee2aefbb709849e4549e02e1e59e7b0f Mon Sep 17 00:00:00 2001
From: Jing Xu <jing.xu@intel.com>
Date: Thu, 19 May 2022 17:31:57 +0900
Subject: [PATCH] updated installation guide and blogs_publications (#773)

add wheel links

fine tune docs

fine tune

fine tune

fine tune

fine tune
---
 docs/tutorials/blogs_publications.md |  1 +
 docs/tutorials/examples.md           | 81 +++++++++++++++++++++++-----
 docs/tutorials/installation.md       |  9 +++-
 docs/tutorials/releases.md           | 15 ++++++
 4 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/docs/tutorials/blogs_publications.md b/docs/tutorials/blogs_publications.md
index fcc05ff5d..ec01860d9 100644
--- a/docs/tutorials/blogs_publications.md
+++ b/docs/tutorials/blogs_publications.md
@@ -1,6 +1,7 @@
 Blogs & Publications
 ====================
 
+* [Accelerating PyTorch with Intel® Extension for PyTorch\*](https://medium.com/pytorch/accelerating-pytorch-with-intel-extension-for-pytorch-3aef51ea3722)
 * [Intel and Facebook Accelerate PyTorch Performance with 3rd Gen Intel® Xeon® Processors and Intel® Deep Learning Boost’s new BFloat16 capability](https://www.intel.com/content/www/us/en/artificial-intelligence/posts/intel-facebook-boost-bfloat16.html)
 * [Accelerate PyTorch with the extension and oneDNN using Intel BF16 Technology](https://medium.com/pytorch/accelerate-pytorch-with-ipex-and-onednn-using-intel-bf16-technology-dca5b8e6b58f)
   * *Note*: APIs mentioned in it are deprecated.
diff --git a/docs/tutorials/examples.md b/docs/tutorials/examples.md
index 7058537ec..9eaf84d65 100644
--- a/docs/tutorials/examples.md
+++ b/docs/tutorials/examples.md
@@ -7,12 +7,20 @@ Examples
 
 #### Code Changes Highlight
 
+There are only a few lines code change required to use Intel® Extension for PyTorch\* on training.
+
+Recommended code changes involve:
+1. `torch.channels_last` is recommended to be applied to both of the model object and data to raise CPU resource usage efficiency.
+2. `ipex.optimize` function applies optimizations against the model object, as well as an optimizer object.
+
+
 ```
 ...
 import torch
 import intel_extension_for_pytorch as ipex
 ...
 model = Model()
+model = model.to(memory_format=torch.channels_last)
 criterion = ...
 optimizer = ...
 model.train()
@@ -56,6 +64,7 @@ train_loader = torch.utils.data.DataLoader(
 )
 
 model = torchvision.models.resnet50()
+model = model.to(memory_format=torch.channels_last)
 criterion = torch.nn.CrossEntropyLoss()
 optimizer = torch.optim.SGD(model.parameters(), lr = LR, momentum=0.9)
 model.train()
@@ -104,6 +113,7 @@ train_loader = torch.utils.data.DataLoader(
 )
 
 model = torchvision.models.resnet50()
+model = model.to(memory_format=torch.channels_last)
 criterion = torch.nn.CrossEntropyLoss()
 optimizer = torch.optim.SGD(model.parameters(), lr = LR, momentum=0.9)
 model.train()
@@ -116,7 +126,7 @@ for batch_idx, (data, target) in enumerate(train_loader):
         data = data.to(memory_format=torch.channels_last)
         output = model(data)
         loss = criterion(output, target)
-    loss.backward()
+        loss.backward()
     optimizer.step()
     print(batch_idx)
 torch.save({
@@ -193,6 +203,10 @@ torch.save({
 
 ## Inference
 
+Channels last is a memory layout format that is more friendly to Intel Architecture. It is recommended for users to utilize this memory layout format for computer vision workloads. It is as simple as invoking `to(memory_format=torch.channels_last)` function against the model object and input data.
+
+Moreover, `optimize` function of Intel® Extension for PyTorch\* applies optimizations to the model, and could bring performance boosts. For both computer vision workloads and NLP workloads, it is recommended to apply the `optimize` function against the model object.
+
 ### Float32
 
 #### Imperative Mode
@@ -244,7 +258,7 @@ with torch.no_grad():
 
 #### TorchScript Mode
 
-It is highly recommended for users to take advantage of Intel® Extension for PyTorch* with [TorchScript](https://pytorch.org/docs/stable/jit.html) for further optimizations.
+It is highly recommended for users to take advantage of Intel® Extension for PyTorch\* with [TorchScript](https://pytorch.org/docs/stable/jit.html) for further optimizations.
 
 ##### Resnet50
 
@@ -301,6 +315,10 @@ with torch.no_grad():
 
 ### BFloat16
 
+Similar to running with FP32, the `optimize` function also works for BFloat16 data type. The only difference is setting `dtype` parameter to `torch.bfloat16`.
+
+Auto Mixed Precision (AMP) is recommended to be working with BFloat16 data type.
+
 #### Imperative Mode
 
 ##### Resnet50
@@ -352,7 +370,7 @@ with torch.no_grad():
 
 #### TorchScript Mode
 
-It is highly recommended for users to take advantage of Intel® Extension for PyTorch* with [TorchScript](https://pytorch.org/docs/stable/jit.html) for further optimizations.
+It is highly recommended for users to take advantage of Intel® Extension for PyTorch\* with [TorchScript](https://pytorch.org/docs/stable/jit.html) for further optimizations.
 
 ##### Resnet50
 
@@ -412,6 +430,18 @@ with torch.no_grad():
 
 #### Calibration
 
+For calibrating a model with INT8 data type, code changes are highlighted in the code snippet below.
+
+Please follow the steps below:
+
+1. Utilize `torch.fx.experimental.optimization.fuse` function to perform op folding for better performance.
+2. Import `intel_extension_for_pytorch` as `ipex`.
+3. Instantiate a config object with `ipex.quantization.QuantConf` function to save configuration data during calibration.
+4. Iterate through calibration dataset under `ipex.quantization.calibrate` scope to perform the calibration.
+5. Save the calibration data into a `json` file.
+6. Invoke `ipex.quantization.convert` function to apply the calibration configure object to the fp32 model object to get an INT8 model.
+7. Save the INT8 model into a `pt` file.
+
 ```
 import os
 import torch
@@ -420,31 +450,42 @@ model = Model()
 model.eval()
 data = torch.rand(<shape>)
 
-# Applying torch.fx.experimental.optimization.fuse against model performs 
+# Applying torch.fx.experimental.optimization.fuse against model performs
 # conv-batchnorm folding for better performance.
 import torch.fx.experimental.optimization as optimization
 model = optimization.fuse(model, inplace=True)
 
 #################### code changes ####################
 import intel_extension_for_pytorch as ipex
-conf = ipex.quantization.QuantConf(qscheme=torch.per_tensor_affine) 
-######################################################
+conf = ipex.quantization.QuantConf(qscheme=torch.per_tensor_affine)
 
-for d in calibration_data_loader(): 
-  # conf will be updated with observed statistics during calibrating with the dataset 
+for d in calibration_data_loader():
+  # conf will be updated with observed statistics during calibrating with the dataset
   with ipex.quantization.calibrate(conf):
-    model(d) 
+    model(d)
 
 conf.save('int8_conf.json', default_recipe=True)
 with torch.no_grad():
-  model = ipex.quantization.convert(model, conf, torch.rand(<shape>)) 
-  model.save('quantization_model.pt')
+  model = ipex.quantization.convert(model, conf, torch.rand(<shape>))
+######################################################
+
+model.save('quantization_model.pt')
 ```
 
 #### Deployment
 
 ##### Imperative Mode
 
+In imperative mode, the INT8 model conversion is done on-the-fly.
+
+Please follow the steps below:
+
+1. Utilize `torch.fx.experimental.optimization.fuse` function to perform op folding for better performance.
+2. Import `intel_extension_for_pytorch` as `ipex`.
+3. Load the calibration configuration object from the saved file.
+4. Invoke `ipex.quantization.convert` function to apply the calibration configure object to the fp32 model object to get an INT8 model.
+5. Run inference.
+
 ```
 import torch
 
@@ -452,7 +493,7 @@ model = Model()
 model.eval()
 data = torch.rand(<shape>)
 
-# Applying torch.fx.experimental.optimization.fuse against model performs 
+# Applying torch.fx.experimental.optimization.fuse against model performs
 # conv-batchnorm folding for better performance.
 import torch.fx.experimental.optimization as optimization
 model = optimization.fuse(model, inplace=True)
@@ -463,15 +504,25 @@ conf = ipex.quantization.QuantConf('int8_conf.json')
 ######################################################
 
 with torch.no_grad():
-  model = ipex.quantization.convert(model, conf, torch.rand(<shape>)) 
+  model = ipex.quantization.convert(model, conf, torch.rand(<shape>))
   model(data)
 ```
 
 ##### Graph Mode
 
+In graph mode, the INT8 model is loaded from the local file and can be used directly on the inference.
+
+Please follow the steps below:
+
+1. Import `intel_extension_for_pytorch` as `ipex`.
+2. Load the INT8 model from the saved file.
+3. Run inference.
+
 ```
 import torch
+#################### code changes ####################
 import intel_extension_for_pytorch as ipex
+######################################################
 
 model = torch.jit.load('quantization_model.pt')
 model.eval()
@@ -481,6 +532,8 @@ with torch.no_grad():
   model(data)
 ```
 
+oneDNN provides [oneDNN Graph Compiler](https://github.com/oneapi-src/oneDNN/tree/dev-graph-preview4/doc#onednn-graph-compiler) as a prototype feature which could boost performance for selective topologies. No code change is required. Please install [a binary](https://intel.github.io/intel-extension-for-pytorch/1.11.200/tutorials/installation.html#installation_onednn_graph_compiler) with this feature enabled. We verified this feature with `Bert-large`, `bert-base-cased`, `roberta-base`, `xlm-roberta-base`, `google-electra-base-generator` and `google-electra-base-discriminator`.
+
 ## C++
 
 To work with libtorch, C++ library of PyTorch, Intel® Extension for PyTorch\* provides its C++ dynamic library as well. The C++ library is supposed to handle inference workload only, such as service deployment. For regular development, please use Python interface. Comparing to usage of libtorch, no specific code changes are required, except for converting input data into channels last data format. Compilation follows the recommended methodology with CMake. Detailed instructions can be found in [PyTorch tutorial](https://pytorch.org/tutorials/advanced/cpp_export.html#depending-on-libtorch-and-building-the-application).
@@ -582,4 +635,4 @@ $ ldd example-app
 
 ## Model Zoo
 
-Use cases that had already been optimized by Intel engineers are available at [Model Zoo for Intel® Architecture](https://github.com/IntelAI/models/tree/pytorch-r1.10-models). A bunch of PyTorch use cases for benchmarking are also available on the [Github page](https://github.com/IntelAI/models/tree/pytorch-r1.10-models/benchmarks#pytorch-use-cases). You can get performance benefits out-of-box by simply running scipts in the Model Zoo.
+Use cases that had already been optimized by Intel engineers are available at [Model Zoo for Intel® Architecture](https://github.com/IntelAI/models/tree/pytorch-r1.11-models). A bunch of PyTorch use cases for benchmarking are also available on the [Github page](https://github.com/IntelAI/models/tree/pytorch-r1.11-models/benchmarks#pytorch-use-cases). You can get performance benefits out-of-box by simply running scipts in the Model Zoo.
diff --git a/docs/tutorials/installation.md b/docs/tutorials/installation.md
index 8e70d09c9..479b80b47 100644
--- a/docs/tutorials/installation.md
+++ b/docs/tutorials/installation.md
@@ -43,6 +43,7 @@ Prebuilt wheel files availability matrix for Python versions
 
 | Extension Version | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 |
 | :--: | :--: | :--: | :--: | :--: | :--: |
+| 1.11.200 |  | ✔️ | ✔️ | ✔️ | ✔️ |
 | 1.11.0 |  | ✔️ | ✔️ | ✔️ | ✔️ |
 | 1.10.100 | ✔️ | ✔️ | ✔️ | ✔️ |  |
 | 1.10.0 | ✔️ | ✔️ | ✔️ | ✔️ |  |
@@ -63,6 +64,11 @@ Alternatively, you can also install the latest version with the following comman
 python -m pip install intel_extension_for_pytorch -f https://software.intel.com/ipex-whl-stable
 ```
 
+For pre-built wheel files with [oneDNN Graph Compiler](#installation_onednn_graph_compiler), please use the following command to perform the installation.
+```
+python -m pip install intel_extension_for_pytorch -f https://developer.intel.com/ipex-whl-dev
+```
+
 **Note:** For version prior to 1.10.0, please use package name `torch_ipex`, rather than `intel_extension_for_pytorch`.
 
 **Note:** To install a package with a specific version, please run with the following command.
@@ -76,7 +82,7 @@ python -m pip install <package_name>==<version_name> -f https://software.intel.c
 ```bash
 git clone --recursive https://github.com/intel/intel-extension-for-pytorch
 cd intel-extension-for-pytorch
-git checkout v1.11.0
+git checkout v1.11.200
 
 # if you are updating an existing checkout
 git submodule sync
@@ -119,6 +125,7 @@ docker pull intel/intel-optimized-pytorch:latest
 
 |Version|Pre-cxx11 ABI|cxx11 ABI|
 |--|--|--|
+| 1.11.200 | [libintel-ext-pt-1.11.200+cpu.run](http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/libtorch_zip/libintel-ext-pt-shared-with-deps-1.11.200%2Bcpu.run) | [libintel-ext-pt-cxx11-abi-1.11.200+cpu.run](http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/libtorch_zip/libintel-ext-pt-cxx11-abi-shared-with-deps-1.11.200%2Bcpu.run) |
 | 1.11.0 | [libintel-ext-pt-1.11.0+cpu.run](http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/libtorch_zip/libintel-ext-pt-1.11.0%2Bcpu.run) | [libintel-ext-pt-cxx11-abi-1.11.0+cpu.run](http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/libtorch_zip/libintel-ext-pt-cxx11-abi-1.11.0%2Bcpu.run) |
 | 1.10.100 | [libtorch-shared-with-deps-1.10.0%2Bcpu-intel-ext-pt-cpu-1.10.100.zip](http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/wheels/v1.10/libtorch-shared-with-deps-1.10.0%2Bcpu-intel-ext-pt-cpu-1.10.100.zip) | [libtorch-cxx11-abi-shared-with-deps-1.10.0%2Bcpu-intel-ext-pt-cpu-1.10.100.zip](http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/wheels/v1.10/libtorch-cxx11-abi-shared-with-deps-1.10.0%2Bcpu-intel-ext-pt-cpu-1.10.100.zip) |
 | 1.10.0 | [intel-ext-pt-cpu-libtorch-shared-with-deps-1.10.0+cpu.zip](https://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/wheels/v1.10/intel-ext-pt-cpu-libtorch-shared-with-deps-1.10.0%2Bcpu.zip) | [intel-ext-pt-cpu-libtorch-cxx11-abi-shared-with-deps-1.10.0+cpu.zip](https://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/wheels/v1.10/intel-ext-pt-cpu-libtorch-cxx11-abi-shared-with-deps-1.10.0%2Bcpu.zip) |
diff --git a/docs/tutorials/releases.md b/docs/tutorials/releases.md
index 8d453c45e..df709f924 100644
--- a/docs/tutorials/releases.md
+++ b/docs/tutorials/releases.md
@@ -1,6 +1,21 @@
 Releases
 =============
 
+## 1.11.200
+
+### Highlights
+
+- Enable more fused operators to accelerate particular models.
+- Fuse `Convolution` and `LeakyReLU` ([#648](https://github.com/intel/intel-extension-for-pytorch/commit/d7603133f37375b3aba7bf744f1095b923ba979e))
+- Support [`torch.einsum`](https://pytorch.org/docs/stable/generated/torch.einsum.html) and fuse it with `add` ([#684](https://github.com/intel/intel-extension-for-pytorch/commit/b66d6d8d0c743db21e534d13be3ee75951a3771d))
+- Fuse `Linear` and `Tanh` ([#685](https://github.com/intel/intel-extension-for-pytorch/commit/f0f2bae96162747ed2a0002b274fe7226a8eb200))
+- In addition to the original installation methods, this release provides Docker installation from [DockerHub](https://hub.docker.com/).
+- Provided the [evaluation wheel packages](https://intel.github.io/intel-extension-for-pytorch/1.11.200/tutorials/installation.html#installation_onednn_graph_compiler) that could boost performance for selective topologies on top of oneDNN graph compiler prototype feature.
+***NOTE***: This is still at an early development stage and not fully mature yet, but feel free to reach out through GitHub tickets if you have any suggestions.
+
+**[Full Changelog](https://github.com/intel/intel-extension-for-pytorch/compare/v1.11.0...v1.11.200)**
+
+
 ## 1.11.0
 
 We are excited to announce Intel® Extension for PyTorch\* 1.11.0-cpu release by tightly following PyTorch 1.11 release. Along with extension 1.11, we focused on continually improving OOB user experience and performance. Highlights include: