From 8bd78d683f480b99681a92e550f180010f1a2d72 Mon Sep 17 00:00:00 2001
From: ZhaoqiongZ <106125927+ZhaoqiongZ@users.noreply.github.com>
Date: Tue, 30 Apr 2024 20:33:42 +0800
Subject: [PATCH] Doc content finetune (#4215)

update known issues, profiler and torch.compile doc contents.

---------

Co-authored-by: Ye Ting <ting.ye@intel.com>
---
 docs/index.rst                                |  2 +-
 docs/tutorials/features.rst                   | 22 +----
 docs/tutorials/features/ipex_log.md           | 89 +++++++++---------
 docs/tutorials/features/profiler_kineto.md    | 10 ++
 docs/tutorials/features/profiler_legacy.md    | 91 +------------------
 docs/tutorials/features/torch_compile_gpu.md  |  4 +-
 docs/tutorials/known_issues.md                | 37 ++------
 docs/tutorials/llm.rst                        |  4 +-
 .../llm/int4_weight_only_quantization.md      |  2 +-
 docs/tutorials/releases.md                    | 18 ++++
 docs/tutorials/technical_details.rst          | 24 +++++
 .../technical_details/ipex_optimize.md        | 47 ++++++++++
 examples/gpu/inference/python/llm/README.md   | 15 ++-
 13 files changed, 173 insertions(+), 192 deletions(-)
 create mode 100644 docs/tutorials/technical_details/ipex_optimize.md

diff --git a/docs/index.rst b/docs/index.rst
index 81b897a39..2325966a4 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -26,7 +26,7 @@ Intel® Extension for PyTorch* has been released as an open–source project at
 
 You can find more information about the product at:
 
-- `Features <https://intel.github.io/intel-extension-for-pytorch/gpu/latest/tutorials/features>`_
+- `Features <https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/features>`_
 - `Performance <https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/performance>`_ 
 
 Architecture
diff --git a/docs/tutorials/features.rst b/docs/tutorials/features.rst
index f1c30b8ba..b4687da34 100644
--- a/docs/tutorials/features.rst
+++ b/docs/tutorials/features.rst
@@ -137,19 +137,6 @@ For more detailed information, check `torch.compile for GPU <features/torch_comp
 
    features/torch_compile_gpu
 
-Legacy Profiler Tool (Prototype)
---------------------------------
-
-The legacy profiler tool is an extension of PyTorch* legacy profiler for profiling operators' overhead on XPU devices. With this tool, you can get the information in many fields of the run models or code scripts. Build Intel® Extension for PyTorch* with profiler support as default and enable this tool by adding a `with` statement before the code segment.
-
-For more detailed information, check `Legacy Profiler Tool <features/profiler_legacy.md>`_.
-
-.. toctree::
-   :hidden:
-   :maxdepth: 1
-
-   features/profiler_legacy
-
 Simple Trace Tool (Prototype)
 -----------------------------
 
@@ -191,14 +178,13 @@ For more detailed information, check `Compute Engine <features/compute_engine.md
    features/compute_engine
 
 
+``IPEX_LOGGING`` (Prototype feature for debug)
+----------------------------------------------
 
 
-IPEX LOG (Prototype)
---------------------
-
-IPEX_LOGGING provides the capacity to log IPEX internal information. If you would like to use torch-style log, that is, the log/verbose is introduced by torch, and refer to cuda code, pls still use torch macro to show the log. For example, TORCH_CHECK, TORCH_ERROR. If the log is IPEX specific, or is going to trace IPEX execution, pls use IPEX_LOGGING. For some part of usage are still discussed with habana side, if has change some feature will update here.
+``IPEX_LOGGING`` provides the capability to log verbose information from Intel® Extension for PyTorch\* . Please use ``IPEX_LOGGING`` to get the log information or trace the execution from Intel® Extension for PyTorch\*. Please continue using PyTorch\* macros such as ``TORCH_CHECK``, ``TORCH_ERROR``, etc. to get the log information from PyTorch\*.
 
-For more detailed information, check `IPEX LOG <features/ipex_log.md>`_.
+For more detailed information, check `IPEX_LOGGING <features/ipex_log.md>`_.
 
 .. toctree::
    :hidden:
diff --git a/docs/tutorials/features/ipex_log.md b/docs/tutorials/features/ipex_log.md
index e6691d97b..fbcf8d2b8 100644
--- a/docs/tutorials/features/ipex_log.md
+++ b/docs/tutorials/features/ipex_log.md
@@ -1,54 +1,57 @@
-IPEX Logging usage
-===============================================
-<style>
-table {
-    margin: auto;
-}
-
-</style>
-
+`IPEX_LOGGING` (Prototype)
+==========================
 
 ## Introduction
 
-IPEX_LOGGING provides the capacity to log IPEX internal information. If you would like to use torch-style log, that is, the log/verbose is introduced by torch, and refer to cuda code, pls still use torch macro to show the log. For example, TORCH_CHECK, TORCH_ERROR. If the log is IPEX specific, or is going to trace IPEX execution, pls use IPEX_LOGGING. For some part of usage are still discussed with habana side, if has change some feature will update here.
+`IPEX_LOGGING` provides the capability to log verbose information from Intel® Extension for PyTorch\* . Please use `IPEX_LOGGING` to get the log information or trace the execution from Intel® Extension for PyTorch\*. Please continue using PyTorch\* macros such as `TORCH_CHECK`, `TORCH_ERROR`, etc. to get the log information from PyTorch\*.
 
-## Feature for IPEX Log
-### Log level
-Currently supported log level and usage are as follow, default using log level is `WARN`:
+## `IPEX_LOGGING` Definition
+### Log Level
+The supported log levels are defined as follows, default log level is `DISABLED`:
 
 |  log level   | number  | usage |
 |  :----:   | :----:   | :----: |
-| TRACE  | 0 | reserve it for further usage extension|
-| DEBUG  | 1 | We would like to insert DEBUG inside each host function, when log level is debug, we can get the whole calling stack |
-| INFO  | 2 | Record calls to other library functions and environment variable settings, such as onemkl calling and set verbose level|
-| WARN  | 3 | On the second attempt of the program, such as memory reallocation |
-| ERR  | 4 | Found error in try catch |
-| CRITICAL  | 5 | reserve it for further usage extension |
+| DISABLED  | -1 | Disable the logging |
+| TRACE  | 0 | Reserve for further usage |
+| DEBUG  | 1 | Provide the whole calling stack info |
+| INFO  | 2 | Record calling info to other library functions and environment variable settings |
+| WARN  | 3 | Warn the second attempt of an action, such as memory reallocation |
+| ERR  | 4 | Report error in try catch |
+| CRITICAL  | 5 | Reserve for further usage |
 
-### Log component
-Log component is for specify which part of IPEX does this log belongs to, currently we have seprate IPEX into four parts, shown as table below.
+### Log Component
+Log component is used to specify which part from Intel® Extension for PyTorch\* does this log information belong to. The supported log components are defined as follows:
 
 |  log component   | description |
 |  :----:   | :----:   
-| OPS  | Intergrate/Launch sycl onednn, onemkl ops | 
-| SYNGRAPH  | Habana Syngraph related | 
+| OPS  | Launch SYCL, oneDNN, oneMKL operators | 
+| SYNGRAPH  | Syngraph related | 
 | MEMORY  | Allocate/Free memory, Allocate/Free cache | 
 | RUNTIME  | Device / Queue related |
+| ALL  | All output log |
+
+## Usage in C++
+All the usage are defined in `utils/LogUtils.h`. Currently Intel® Extension for PyTorch\* supports:
 
-For `SYNGRAPH` you can also add log sub componment which is no restriction on categories.
+### Simple Log
+You can use `IPEX_XXX_LOG`, XXX represents the log level as mentioned above. There are four parameters defined for simple log:
+- Log component, representing which part of Intel® Extension for PyTorch\* does this log belong to.
+- Log sub component, input an empty string("") for general usages. For `SYNGRAPH` you can add any log sub componment.
+- Log message template format string.
+- Log name.
 
+Below is an example for using simple log inside abs kernel:
 
-## How to add log in IPEX
-All the usage are inside file `utils/LogUtils.h`. IPEX Log support two types of log usage, the first one is simple log, you can use IPEX_XXX_LOG, XXX represents the log level, including six log level mentioned above. There are for params for simple log, the first one is log component, representing which part of IPEX does this log belongs to. The second on is log sub component, for most of the IPEX usage, just input a empty string("") here. For the third param, it is an log message template, you can use it as python format string, or you can also refer to fmt lib https://github.com/fmtlib/fmt. Here is an example for simple log, for add a log inside abs kernel:
 ``` c++
 
 IPEX_INFO_LOG("OPS", "", "Add a log for inside ops {}", "abs");
 
 ```
+### Event Log
+Event log is used for recording a whole event, such as an operator calculation. The whole event is identified by an unique `event_id`. You can also mark each step by using `step_id`. Use `IPEX_XXX_EVENT_END()` to complete the logging of the whole event.
 
-For the second log level is event log, which is used for recording a whole event, such as a ops calculation. A whole event is identify through a event_id, for the whole event this event_id should be the same, but cannot be duplicate with other event_id, or it will met an undefined behaviour. You can also mark each step by using a step_id, there are no limitation for step_id. For the end of the whole event, should use IPEX_XXX_EVENT_END(), XXX is the log level mention aboved.
+Below is an example for using event log:
 
-Below is an example for ipex event log:
 ```c++
 IPEX_EVENT_END("OPS", "", "record_avg_pool", "start", "Here record the time start with arg:{}", arg);
 prepare_data();
@@ -58,23 +61,23 @@ IPEX_INFO_EVENT_END("OPS", "", "record_avg_pool", "finish conv", "Here record th
 ```
 
 ## Enviornment settings
-IPEX privide some enviornment setting used for log output settings, currently, we support below five settings.
+Intel® Extension for PyTorch\* provides five enviornment variables for configuring log output:
 
-1. IPEX_LOGGING_LEVEL, accept int or string, default is 3 for `WARN`. Currently you can choose seven different log level within ipex, including -1 `DISABLED` under this setting, all the usage related with IPEX_LOGGING will be disabled. Another six log levels are we mentioned above.
-2. IPEX_LOG_COMPONENT, accept a string, sepreated by `/` first part is log component and the second part is for log sub component, used for state which component and sub log component you would like to log, default is "ALL". Currently we supoort 5 different log component,, `ALL` is for all the output in IPEX, the other four log component are we mentioned above. You could also specify several log component, sepreating using `,` such as "OPS;MEMORY".For log sub component, it still discussed with habana side, pls don't use it first.
-3. IPEX_LOG_OUTPUT, accept a string. If you are using IPEX_LOG_OUTPUT, than all the logs will log inside the file rather than log into the console, you can use it like export IPEX_LOG_OUTPUT="./ipex.log", all the log will log inside ipex.log in current work folder.
-4. IPEX_LOG_ROTATE_SIZE, accept a int, default =10. Only validate when export IPEX_LOG_OUTPUT, specifing how large file will be used when rotating this log, size is MB.
-5. IPEX_LOG_SPLIT_SIZE, accept a int, default = null. Only validate when export IPEX_LOG_OUTPUT, specifing how large file will be used when split this log, size is MB.
+- `IPEX_LOGGING_LEVEL`, accept integar or string, default is -1 for `DISABLED`. 
+- `IPEX_LOG_COMPONENT`, accept string, used for specifying the log component and sub log component you would like to log, default is "ALL". The log component and sub log component are separated by `/`. You could also specify several log components, such as "OPS;MEMORY".
+- `IPEX_LOG_OUTPUT`, accept string. If you are using `IPEX_LOG_OUTPUT`, than all the logs will recorded inside a file rather than the console. Example: export IPEX_LOG_OUTPUT="./ipex.log".
+- `IPEX_LOG_ROTATE_SIZE`, accept integar, default is 10. Can be used only with `IPEX_LOG_OUTPUT`, for specifing how large file will be used when rotating this log, size is MB.
+- `IPEX_LOG_SPLIT_SIZE`, accept integar, default = null. Can be used only with `IPEX_LOG_OUTPUT`, for specifing how large file will be used when splitting the logs, size is MB.
 
 ## Usage in python
-1. torch.xpu.set_log_level(log_level) and torch.xpu.get_log_level(), these two functions are used for get and set the log level.
-2. torch.xpu.set_log_output_file_path(log_path) and torch.xpu.get_log_output_file_path(), these two functions are used for get and set the log output file path, once log output file path is set, logs will not be print on the console, will only output in the file.
-3. torch.xpu.set_log_rotate_file_size(file size) and torch.xpu.get_log_rotate_file_size(), these two functions are used for get and set the log rotate file size, only validate when output file path is set.
-4. torch.xpu.set_log_split_file_size(file size) and torch.xpu.get_log_split_file_size(), these two functions are used for get and set the log split file size, only validate when output file path is set.
-5. torch.xpu.set_log_component(log_component), and torch.xpu.get_log_component(), these two functions are used for get and set the log component, log component string are same with enviornment settings.
+- `torch.xpu.set_log_level(log_level)` and `torch.xpu.get_log_level()`, these two functions are used for getting and setting the log level.
+- `torch.xpu.set_log_output_file_path(log_path)` and `torch.xpu.get_log_output_file_path()`, these two functions are used for getting and setting the log output file path, once log output file path is set, logs will be recorded in file only.
+- `torch.xpu.set_log_rotate_file_size(file size)` and `torch.xpu.get_log_rotate_file_size()`, these two functions are used for getting and setting the log rotate file size. Can be used when output file path is set.
+- `torch.xpu.set_log_split_file_size(file size)` and `torch.xpu.get_log_split_file_size()`, these two functions are used for getting and setting the log split file size. Can be used when output file path is set.
+- `torch.xpu.set_log_component(log_component)`, and `torch.xpu.get_log_component()`, these two functions are used for getting and setting the log component. The log component string are the same as defined in enviornment settings.
 
-## Use IPEX log for simple trace
-For now, IPEX_SIMPLE_TRACE is depre deprecated, and pls use torch.xpu.set_log_level(0), it will show logs like previous IPEX_SIMPLE_TRACE.
+## Replace `IPEX_SIMPLE_TRACE`
+Use `torch.xpu.set_log_level(0)` to get logs to replace the previous usage in `IPEX_SIMPLE_TRACE`.
 
-## Use IPEX log for verbose
-For now, IPEX_VERBOSE is deprecated, pls use torch.xpu.set_log_level(1), it will show logs like previous IPEX_VERBOSE.
\ No newline at end of file
+## Replace `IPEX_VERBOSE`
+Use `torch.xpu.set_log_level(1)` to get logs to replace the previous usage in `IPEX_VERBOSE`.
diff --git a/docs/tutorials/features/profiler_kineto.md b/docs/tutorials/features/profiler_kineto.md
index 5e1755236..2358274ab 100644
--- a/docs/tutorials/features/profiler_kineto.md
+++ b/docs/tutorials/features/profiler_kineto.md
@@ -168,3 +168,13 @@ prof.export_chrome_trace("trace_file.json")
 You can examine the sequence of profiled operators, runtime functions and XPU kernels in these trace viewers. Here shows a trace result for ResNet50 run on XPU backend viewed by Perfetto viewer:
 
 ![profiler_kineto_result_perfetto_viewer](../../images/profiler_kineto/profiler_kineto_result_perfetto_viewer.png)
+
+## Known issues
+
+You may meet an issue that cannot collect profiling information of XPU kernels and device memory operations due to the failures in creating the tracers, when using Kineto profiler based on oneTrace. If you meet such failures that any tracer or collector could not be successfully created, please try the following workaround.
+
+```bash
+export ZE_ENABLE_TRACING_LAYER=1
+```
+
+> Note that this environment variable should be set as global before running any user level applications.
diff --git a/docs/tutorials/features/profiler_legacy.md b/docs/tutorials/features/profiler_legacy.md
index 7d0ecbdd0..d60962619 100644
--- a/docs/tutorials/features/profiler_legacy.md
+++ b/docs/tutorials/features/profiler_legacy.md
@@ -1,93 +1,6 @@
-Legacy Profiler Tool (Prototype)
+Legacy Profiler Tool (Deprecated)
 ================================
 
 ## Introduction
 
-The legacy profiler tool is an extension of PyTorch\* legacy profiler for profiling operators' overhead on XPU devices. With this tool, users can get the information in many fields of the run models or code scripts. User should build Intel® Extension for PyTorch\* with profiler support as default and enable this tool by a `with` statement before the code segment.
-
-## Use Case
-
-To use the legacy profiler tool, you need to build Intel® Extension for PyTorch\* from source or install it via prebuilt wheel. You also have various methods to disable this tool.
-
-### Build Tool
-
-The build option `BUILD_PROFILER` is switched on as default but you can switch it off via setting `BUILD_PROFILER=OFF` while building Intel® Extension for PyTorch\* from source. With `BUILD_PROFILER=OFF`, no profiler code will be compiled and all python scripts using profiler with XPU support will raise a runtime error to user.
-
-```bash
-[BUILD_PROFILER=ON] python setup.py install     # build from source with profiler tool
-BUILD_PROFILER=OFF python setup.py install      # build from source without profiler tool
-```
-
-### Use Tool
-
-In your model script, write `with` statement to enable the legacy profiler tool ahead of your code snippets, as shown in the following example:
-
-```python
-# import all necessary libraries
-import torch
-import intel_extension_for_pytorch
-
-# these lines won't be profiled before enabling profiler tool
-input_tensor = torch.randn(1024, dtype=torch.float32, device='xpu:0')
-
-# enable legacy profiler tool with a `with` statement
-with torch.autograd.profiler_legacy.profile(use_xpu=True) as prof:
-    # do what you want to profile here after the `with` statement with proper indent
-    output_tensor_1 = torch.nonzero(input_tensor)
-    output_tensor_2 = torch.unique(input_tensor)
-
-# print the result table formatted by the legacy profiler tool as your wish
-print(prof.key_averages().table())
-```
-
-There are a number of useful parameters defined in `torch.autograd.profiler_legacy.profile()`. Many of them are aligned with usages defined in PyTorch\*'s official profiler, such as `record_shapes`, a very useful parameter to control whether to record the shape of input tensors for each operator. To enable legacy profiler on XPU devices, pass `use_xpu=True`. For the usage of more parameters, please refer to [PyTorch\*'s tutorial page](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html).
-
-### Disable Tool in Model Script
-
-To disable the legacy profiler tool temporarily in your model script, pass `enabled=False` to `torch.autograd.profiler_legacy.profile()`:
-
-```python
-with torch.autograd.profiler_legacy.profile(enabled=False, use_xpu=True) as prof:
-    # as `enabled` is set to false, the profiler won't work on these lines of code
-    output_tensor_1 = torch.nonzero(input_tensor)
-    output_tensor_2 = torch.unique(input_tensor)
-
-# This print will raise an error to user as the profiler was disabled
-print(prof.key_averages().table())
-```
-
-### Results
-
-Using the script shown above in **Use Tool** part, you'll see the result table printed out to the console as below:
-
-![Legacy_profiler_result_1](../../images/profiler_legacy/Legacy_profiler_result_1.png)
-
-In this result, you can find several fields like:
-
-- `Name`: the name of run operators
-- `Self CPU %`, `Self CPU`: the time consumed by the operator itself at host excluded its children operator call. The column marked with percentage sign shows the propotion of time to total self cpu time. While an operator calls more than once in a run, the self cpu time may increase in this field.
-- `CPU total %`, `CPU total`: the time consumed by the operator at host included its children operator call. The column marked with percentasge sign shows the propotion of time to total cpu time. While an operator calls more than once in a run, the cpu time may increase in this field.
-- `CPU time avg`: the average time consumed by each once call of the operator at host. This average is calculated on the cpu total time.
-- `Self XPU`, `Self XPU %`: similar to `Self CPU (%)` but shows the time consumption on XPU devices.
-- `XPU total`: similar to `CPU total` but shows the time consumption on XPU devices.
-- `XPU time avg`: similar to `CPU time avg` but shows average time sonsumption on XPU devices. This average is calculated on the XPU total time.
-- `# of Calls`: number of call for each operators in a run.
-
-You can print result table in different styles, such as sort all called operators in reverse order via `print(prof.table(sort_by='id'))` like:
-
-![Legacy_profiler_result_2](../../images/profiler_legacy/Legacy_profiler_result_2.png)
-
-### Export to Chrome Trace
-
-You can export the result to a json file and then load it in the Chrome trace viewer (`chrome://tracing`) by add this line in your model script:
-
-```python
-prof.export_chrome_trace("trace_file.json")
-```
-
-In Chrome trace viewer, you may find the result shows like:
-
-![Legacy_profiler_result_3](../../images/profiler_legacy/Legacy_profiler_result_3.png)
-
-For more example results, please refer to [PyTorch\*'s tutorial page](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html).
-
+The legacy profiler tool will be deprecated from Intel® Extension for PyTorch* very soon. Please use [Kineto Supported Profiler Tool](./profiler_kineto.md) instead for profiling operators' executing time cost on Intel® GPU devices.
diff --git a/docs/tutorials/features/torch_compile_gpu.md b/docs/tutorials/features/torch_compile_gpu.md
index 36bd5edb2..248bcaa4e 100644
--- a/docs/tutorials/features/torch_compile_gpu.md
+++ b/docs/tutorials/features/torch_compile_gpu.md
@@ -10,9 +10,9 @@ Intel® Extension for PyTorch\* now empowers users to seamlessly harness graph c
 ## Required Dependencies
 
 **Verified version**:
-- `torch` : > v2.1.0
+- `torch` : v2.1.0
 - `intel_extension_for_pytorch` : > v2.1.10
-- `triton` : > [v2.1.0](https://github.com/intel/intel-xpu-backend-for-triton/releases/tag/v2.1.0) with Intel® XPU Backend for Triton* backend enabled.
+- `triton` : [v2.1.0](https://github.com/intel/intel-xpu-backend-for-triton/releases/tag/v2.1.0) with Intel® XPU Backend for Triton* backend enabled.
 
 Follow [Intel® Extension for PyTorch\* Installation](https://intel.github.io/intel-extension-for-pytorch/xpu/2.1.30+xpu/tutorials/installation.html) to install `torch` and `intel_extension_for_pytorch` firstly.
 
diff --git a/docs/tutorials/known_issues.md b/docs/tutorials/known_issues.md
index d308a03a1..0e04456c2 100644
--- a/docs/tutorials/known_issues.md
+++ b/docs/tutorials/known_issues.md
@@ -1,12 +1,10 @@
 Troubleshooting
 ===============
 
-## GPU-specific Issues
-
-### General Usage
+## General Usage
 
 - **Problem**: FP64 data type is unsupported on current platform.
-  - **Cause**: FP64 is not natively supported by the [Intel® Data Center GPU Flex Series](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/data-center-gpu/flex-series/overview.html) platform. 
+  - **Cause**: FP64 is not natively supported by the [Intel® Data Center GPU Flex Series](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/data-center-gpu/flex-series/overview.html) and [Intel® Arc™ A-Series Graphics](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/arc.html) platforms. 
     If you run any AI workload on that platform and receive this error message, it means a kernel requires FP64 instructions that are not supported and the execution is stopped.
 - **Problem**: Runtime error `invalid device pointer` if `import horovod.torch as hvd` before `import intel_extension_for_pytorch`
   - **Cause**: Intel® Optimization for Horovod\* uses utilities provided by Intel® Extension for PyTorch\*. The improper import order causes Intel® Extension for PyTorch\* to be unloaded before Intel®
@@ -25,9 +23,9 @@ Troubleshooting
   - **Solution**: Pass `export GLIBCXX_USE_CXX11_ABI=1` and compile PyTorch\* with particular compiler which supports `_GLIBCXX_USE_CXX11_ABI=1`. We recommend using prebuilt wheels 
     in [download server](https:// developer.intel.com/ipex-whl-stable-xpu) to avoid this issue.
 - **Problem**: Bad termination after AI model execution finishes when using Intel MPI.
-  - **Cause**: This is a random issue when the AI model (e.g. RN50 training) execution finishes in an Intel MPI environment. It is not user-friendly as the model execution ends ungracefully.
+  - **Cause**: This is a random issue when the AI model (e.g. RN50 training) execution finishes in an Intel MPI environment. It is not user-friendly as the model execution ends ungracefully. It has been fixed in PyTorch* 2.3 ([#116312](https://github.com/pytorch/pytorch/commit/f657b2b1f8f35aa6ee199c4690d38a2b460387ae)). 
   - **Solution**: Add `dist.destroy_process_group()` during the cleanup stage in the model script, as described 
-    in [Getting Started with Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
+    in [Getting Started with Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html), before Intel® Extension for PyTorch* supports PyTorch* 2.3.
 - **Problem**: `-997 runtime error` when running some AI models on Intel® Arc™ A-Series GPUs.
   - **Cause**:  Some of the `-997 runtime error` are actually out-of-memory errors. As Intel® Arc™ A-Series GPUs have less device memory than Intel® Data Center GPU Flex Series 170 and Intel® Data Center GPU 
     Max  Series, running some AI models on them may trigger out-of-memory errors and cause them to report failure such as `-997 runtime error` most likely. This is expected. Memory usage optimization is a work in progress to allow Intel® Arc™ A-Series GPUs to support more AI models.
@@ -38,22 +36,14 @@ Troubleshooting
 - **Problem**: Some workloads terminate with an error `CL_DEVICE_NOT_FOUND` after some time on WSL2.
   - **Cause**:  This issue is due to the [TDR feature](https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys#tdrdelay) on Windows.
   - **Solution**: Try increasing TDRDelay in your Windows Registry to a large value, such as 20 (it is 2 seconds, by default), and reboot.
-- **Problem**: Runtime error `Unable to find TSan function` might be raised when running some CPU AI workloads in certain scenarios.
-  - **Cause**:  This issue is probably caused by the compatibility issue of OMP tool libraries.
-  - **Solution**: Please try the workaround: disable OMP tool libraries by `export OMP_TOOL="disabled"`, to unblock your workload. We are working on the final solution and will release it as soon as possible.
-- **Problem**: The profiled data on GPU operators using legacy profiler is not accurate sometimes.
-  - **Cause**: Compiler in 2024.0 oneAPI basekit optimizes barrier implementation which brings negative impact on legacy profiler.
-  - **Solution**: Use Kineto profiler instead. Or use legacy profiler with `export UR_L0_IN_ORDER_BARRIER_BY_SIGNAL=0` to workaround this issue.
 - **Problem**: Random bad termination after AI model convergence test (>24 hours) finishes.
   - **Cause**: This is a random issue when some AI model convergence test execution finishes. It is not user-friendly as the model execution ends ungracefully.
   - **Solution**: Kill the process after the convergence test finished, or use checkpoints to divide the convergence test into several phases and execute separately.
-- **Problem**: Random GPU hang issue when executing the first allreduce in LLM inference workloads on 1 Intel® Data Center GPU Max 1550 card.
-  - **Cause**: Race condition happens between oneDNN kernels and oneCCL Bindings for Pytorch\* allreduce primitive. 
-  - **Solution**: Use `TORCH_LLM_ALLREDUCE=0` to workaround this issue.
-- **Problem**: GPU hang issue when executing LLM inference workloads on multi Intel® Data Center GPU Max series cards over PCIe communication.
-  - **Cause**: oneCCL Bindings for Pytorch\* allreduce primitive does not support PCIe for cross-cards communication.
-  - **Solution**: Enable XeLink for cross-cards communication, or use `TORCH_LLM_ALLREDUCE=0` for the PCIe only environments.
-### Library Dependencies
+- **Problem**: Random instability issues such as page fault or atomic access violation when executing LLM inference workloads on Intel® Data Center GPU Max series cards.
+  - **Cause**:  This issue is reported on LTS driver [803.29](https://dgpu-docs.intel.com/releases/LTS_803.29_20240131.html). The root cause is under investigation.
+  - **Solution**: Use active rolling stable release driver [775.20](https://dgpu-docs.intel.com/releases/stable_775_20_20231219.html) or latest driver version to workaround.
+
+## Library Dependencies
 
 - **Problem**: Cannot find oneMKL library when building Intel® Extension for PyTorch\* without oneMKL.
 
@@ -97,17 +87,10 @@ Troubleshooting
 
     If you continue seeing similar issues for other shared object files, add the corresponding files under `${MKL_DPCPP_ROOT}/lib/intel64/` by `LD_PRELOAD`. Note that the suffix of the libraries may change (e.g. from .1 to .2), if more than one oneMKL library is installed on the system.
 
-### Unit Test
+## Unit Test
 
 - Unit test failures on Intel® Data Center GPU Flex Series 170
 
   The following unit test fails on Intel® Data Center GPU Flex Series 170 but the same test case passes on Intel® Data Center GPU Max Series. The root cause of the failure is under investigation.
     - `test_weight_norm.py::TestNNMethod::test_weight_norm_differnt_type`
 
-  The following unit tests fail in Windows environment on Intel® Arc™ A770 Graphic card. The root cause of the failures is under investigation.
-     - `test_foreach.py::TestTorchMethod::test_foreach_cos`
-     - `test_foreach.py::TestTorchMethod::test_foreach_sin`
-     - `test_polar.py::TestTorchMethod::test_polar_float`
-     - `test_special_ops.py::TestTorchMethod::test_special_spherical_bessel_j0`
-     - `test_transducer_loss.py::TestNNMethod::test_vallina_transducer_loss`
-
diff --git a/docs/tutorials/llm.rst b/docs/tutorials/llm.rst
index 82621ab9f..a13deac7a 100644
--- a/docs/tutorials/llm.rst
+++ b/docs/tutorials/llm.rst
@@ -123,9 +123,7 @@ Large Language Models (LLMs) have shown remarkable performance in various natura
 
 However, deploying them on devices with limited resources is challenging due to their high computational and memory requirements. 
 
-To overcome this issue, we propose quantization methods that reduce the size and complexity of LLMs. We focus on weight-only quantization (WOQ), which only quantizes the weights statically. 
-WOQ is a better trade-off between efficiency and accuracy, as we will demonstrate that the main bottleneck of deploying LLMs is the memory bandwidth and WOQ usually preserves more accuracy. 
-Experiments on Qwen-7B, a large-scale LLM, show that we can obtain accurate quantized models with minimal loss of quality.
+To overcome this issue, we propose quantization methods that reduce the size and complexity of LLMs. Unlike `normal quantization <https://github.com/intel/intel-extension-for-transformers/blob/main/docs/quantization.md>`_, such as w8a8, that quantizes both weights and activations, we focus on Weight-Only Quantization (WOQ), which only quantizes the weights statically. WOQ is a better trade-off between efficiency and accuracy, as the main bottleneck of deploying LLMs is the memory bandwidth and WOQ usually preserves more accuracy. Experiments on Qwen-7B, a large-scale LLM, show that we can obtain accurate quantized models with minimal loss of quality.
 
 For more detailed information, check `WOQ INT4 <llm/int4_weight_only_quantization.html>`_.
 
diff --git a/docs/tutorials/llm/int4_weight_only_quantization.md b/docs/tutorials/llm/int4_weight_only_quantization.md
index 8658e8af3..dacd1b9fe 100644
--- a/docs/tutorials/llm/int4_weight_only_quantization.md
+++ b/docs/tutorials/llm/int4_weight_only_quantization.md
@@ -1,4 +1,4 @@
-Weight-Only Quantization (WOQ)
+Weight-Only Quantization (Prototype)
 =====
 
 ## Introduction
diff --git a/docs/tutorials/releases.md b/docs/tutorials/releases.md
index b2a9c0f02..bbc9383ee 100644
--- a/docs/tutorials/releases.md
+++ b/docs/tutorials/releases.md
@@ -1,6 +1,24 @@
 Releases
 =============
 
+## 2.1.30+xpu
+
+Intel® Extension for PyTorch\* v2.1.30+xpu is an update release which supports Intel® GPU platforms (Intel® Data Center GPU Flex Series, Intel® Data Center GPU Max Series and Intel® Arc™ A-Series Graphics) based on PyTorch\* 2.1.0.
+
+### Highlights
+
+- Intel® oneDNN v3.4.1 integration
+- Intel® oneAPI Base Toolkit 2024.1 compatibility
+- Large Language Model (LLM) optimizations for FP16 inference on Intel® Data Center GPU Max Series (Beta): Intel® Extension for PyTorch* provides a lot of specific optimizations for LLM workloads in this release on Intel® Data Center GPU Max Series.  In operator level, we provide highly efficient GEMM kernel to speed up Linear layer and customized fused operators to reduce HBM access/kernel launch overhead. To reduce memory footprint, we define a segment KV Cache policy to save device memory and improve the throughput. Such optimizations are added in this release to enhance existing optimized LLM FP16 models and more Chinese LLM models such as Baichuan2-13B, ChatGLM3-6B and Qwen-7B.
+
+- LLM optimizations for INT4 inference on Intel® Data Center GPU Max Series and Intel® Arc™ A-Series Graphics (Prototype): Intel® Extension for PyTorch* shows remarkable performance when executing LLM models on Intel® GPU. However, deploying such models on GPUs with limited resources is challenging due to their high computational and memory requirements. To achieve a better trade-off, a low-precision solution, e.g., weight-only-quantization for INT4 is enabled to allow Llama 2-7B, GPT-J-6B and Qwen-7B to be executed efficiently on Intel® Arc™ A-Series Graphics. The same optimization makes INT4 models achieve 1.5x speeded up in total latency performance compared with FP16 models with the same configuration and parameters on Intel® Data Center GPU Max Series.
+
+- Opt-in collective performance optimization with oneCCL Bindings for Pytorch*: This opt-in feature can be enabled by setting `TORCH_LLM_ALLREDUCE=1` to provide better scale-up performance by enabling optimized collectives such as `allreduce`, `allgather`, `reducescatter` algorithms in Intel® oneCCL. This feature requires XeLink enabled for cross-cards communication.
+
+### Known Issues
+
+Please refer to [Known Issues webpage](./known_issues.md).
+
 ## 2.1.20+xpu
 
 Intel® Extension for PyTorch\* v2.1.20+xpu is a minor release which supports Intel® GPU platforms (Intel® Data Center GPU Flex Series, Intel® Data Center GPU Max Series and Intel® Arc™ A-Series Graphics) based on PyTorch\* 2.1.0.
diff --git a/docs/tutorials/technical_details.rst b/docs/tutorials/technical_details.rst
index 07c1474c9..87e3476a3 100644
--- a/docs/tutorials/technical_details.rst
+++ b/docs/tutorials/technical_details.rst
@@ -46,3 +46,27 @@ For more detailed information, check `Memory Management <technical_details/memor
    :maxdepth: 1
 
    technical_details/memory_management
+
+
+
+``ipex.optimize`` [GPU]
+-----------------------
+
+The ``ipex.optimize`` API is designed to optimize PyTorch\* modules
+(``nn.modules``) and specific optimizers within Python modules. Its
+optimization options for Intel® GPU device include:
+
+-  Automatic Channels Last
+-  Fusing Convolutional Layers with Batch Normalization
+-  Fusing Linear Layers with Batch Normalization
+-  Replacing Dropout with Identity
+-  Splitting Master Weights
+-  Fusing Optimizer Update Step
+
+For more detailed information, check `ipex.optimize <technical_details/ipex_optimize.html>`_.
+
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+
+   technical_details/ipex_optimize
diff --git a/docs/tutorials/technical_details/ipex_optimize.md b/docs/tutorials/technical_details/ipex_optimize.md
new file mode 100644
index 000000000..1e98f3bad
--- /dev/null
+++ b/docs/tutorials/technical_details/ipex_optimize.md
@@ -0,0 +1,47 @@
+`ipex.optimize` Frontend API
+======================================
+
+The `ipex.optimize` API is designed to optimize PyTorch\* modules (`nn.modules`) and specific optimizers within Python modules. Its optimization options for Intel® GPU device include:
+
+- Automatic Channels Last
+- Fusing Convolutional Layers with Batch Normalization
+- Fusing Linear Layers with Batch Normalization
+- Replacing Dropout with Identity
+- Splitting Master Weights
+- Fusing Optimizer Update Step
+
+The original python modules will be replaced to optimized versions automatically during model execution, if `ipex.optimize` is called in the model running script.
+
+The following sections provide detailed descriptions for each optimization flag supported by **XPU** models on Intel® GPU. For CPU-specific flags, please refer to the [API Docs page](../api_doc.html#ipex.optimize).
+
+### Automatic Channels Last
+
+By default, `ipex.optimize` checks if current running GPU platform supports 2D Block Array Load or not. If it does, the `Conv*d` and `ConvTranspose*d` modules inside the model will be optimized for using channels last memory format. Use `ipex.enable_auto_channels_last` or `ipex.disable_auto_channels_last` before calling `ipex.optimize` to enable or disable this feature manually.
+
+### `conv_bn_folding`
+
+This flag is applicable for model inference. Intel® Extension for PyTorch\* tries to match all connected `nn.Conv(1/2/3)d` and `nn.BatchNorm(1/2/3)d` layers with matching dimensions in the model and fuses them to improve performance. If the fusion fails, the optimization process will be ended and the model will be executed automatically in normal path.
+
+### `linear_bn_folding`
+
+This flag is applicable for model inference. Intel® Extension for PyTorch\* tries to match all connected `nn.Linear` and `nn.BatchNorm(1/2/3)d` layers in the model and fuse them to improve performance. If the fusion fails, the optimization process will be ended and the model will be executed automatically in normal path.
+
+### `replace_dropout_with_identity`
+
+This flag is applicable for model inference. All instances of `torch.nn.Dropout` will be replaced with `torch.nn.Identity`. The `Identity` modules will be ignored during the static graph generation. This optimization could potentially create additional fusion opportunities for the generated graph.
+
+### `split_master_weight_for_bf16`
+
+This flag is applicable for model training. The optimization will be enabled once the following requirements are met:
+- When calling `ipex.optimize`, the `dtype` flag must be set to `torch.bfloat16`.
+- `fuse_update_step` must be enabled.
+
+The optimization process is as follows:
+
+- Wrap all parameters of this model with `ParameterWrapper`.
+- Convert the parameters that meet the condition specified by `ipex.nn.utils._parameter_wrapper.can_cast_training`. This includes the original dtype `torch.float`, and module types defined in `ipex.nn.utils._parameter_wrapper.IPEX_WEIGHT_CONVERT_MODULE_XPU`.
+- Convert the parameters wrapped by `ParameterWrapper` to the user-specified dtype. If **split master weight** is needed, the optimizer can only be SGD. The original parameters will be divided into top and bottom parts. The top part will be used for forward and backward computation. When updating weights, both the top and bottom parts will be updated simultaneously.
+
+### fuse_update_step
+
+This flag is used to specify whether to replace the original optimizer step with a fused step for better performance. The supported optimizers can be referenced from `IPEX_FUSED_OPTIMIZER_LIST_XPU` in `ipex.optim._optimizer_utils`. During the optimization, the original step is saved as `optimizer._original_step`, `optimizer.step` is replaced with a SYCL-written kernel, and the `optimizer.fused` parameter is set to `True`.
diff --git a/examples/gpu/inference/python/llm/README.md b/examples/gpu/inference/python/llm/README.md
index b84759c9d..20b18d784 100644
--- a/examples/gpu/inference/python/llm/README.md
+++ b/examples/gpu/inference/python/llm/README.md
@@ -7,9 +7,10 @@ Here you can find the inference benchmarking scripts for large language models (
 - Cover model generation inference with low precision cases for different models with best performance and accuracy (fp16 AMP and weight only quantization)
 
 
-
 ## Optimized Models
 
+Currently, only support Transformers 4.31.0. Support for newer versions of Transformers and more models will be available in the future.
+
 | MODEL FAMILY | Verified < MODEL ID > (Huggingface hub)| FP16 | Weight only quantization INT4 | Optimized on Intel® Data Center GPU Max Series (1550/1100) | Optimized on Intel® Arc™ A-Series Graphics (A770) |
 |---|:---:|:---:|:---:|:---:|:---:|
 |Llama 2| "meta-llama/Llama-2-7b-hf", "meta-llama/Llama-2-13b-hf", "meta-llama/Llama-2-70b-hf" | ✅ | ✅|✅ | ✅|
@@ -23,13 +24,11 @@ Here you can find the inference benchmarking scripts for large language models (
 
 ## Supported Platforms
 
-\* Intel® Data Center GPU Max Series (1550/1100) and Optimized on Intel® Arc™ A-Series Graphics (A770) : support all the models in the model list above<br /> 
+\* Intel® Data Center GPU Max Series (1550/1100) and Optimized on Intel® Arc™ A-Series Graphics (A770) : support all the models in the model list above.<br />
 
 
 ## Environment Setup
 
-Note: The instructions in this section will setup an environment with a latest source build of IPEX on `release/xpu/2.1.30` branch.
-
 ### [Recommended] Docker-based environment setup with prebuilt wheels
 
 
@@ -120,7 +119,7 @@ where <br />
 
 
  
-## Run Models Generation
+## Run Models
 
 | Benchmark mode | FP16 | Weight only quantization INT4 |
 |---|:---:|:---:|
@@ -128,11 +127,11 @@ where <br />
 | Distributed (autotp) |  ✅ | ❎ |
 
 
-Note: During usage, you may need to log in to your Hugging Face account to access model files. Refer to [HuggingFace Login](https://huggingface.co/docs/huggingface_hub/quick-start#login)
+Note: During the execution, you may need to log in your Hugging Face account to access model files. Refer to [HuggingFace Login](https://huggingface.co/docs/huggingface_hub/quick-start#login)
 
 ### Run with Bash Script
 
-For all inference cases, can run LLM with the one-click bash script `run_benchmark.sh`:
+Run all inference cases with the one-click bash script `run_benchmark.sh`:
 ```
 bash run_benchmark.sh
 ```
@@ -188,7 +187,7 @@ LLM_ACC_TEST=1 mpirun -np 2 --prepend-rank python -u run_generation_with_deepspe
 ```
 
 
-### Weight only quantization with low precision checkpoint (Prototype)
+### Weight Only Quantization with low precision checkpoint (Prototype)
 
 Using INT4 weights can further improve performance by reducing memory bandwidth. However, direct per-channel quantization of weights to INT4 may result in poor accuracy. Some algorithms can modify weights through calibration before quantizing weights to minimize accuracy drop. You may generate modified weights and quantization info (scales, zero points) for a Llama 2/GPT-J/Qwen models with a dataset for specified tasks by such algorithms. We recommend intel extension for transformer to quantize the LLM model.