Skip to content

Commit

Permalink
update doc (#34478)
Browse files Browse the repository at this point in the history
* update doc

* Update docs/source/en/perf_train_cpu.md

Co-authored-by: Steven Liu <[email protected]>

* delete closing tip

---------

Co-authored-by: Steven Liu <[email protected]>
  • Loading branch information
jiqing-feng and stevhliu authored Oct 31, 2024
1 parent df8640c commit 2801d7b
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 31 deletions.
14 changes: 7 additions & 7 deletions docs/source/en/perf_train_cpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@ rendered properly in your Markdown viewer.
This guide focuses on training large models efficiently on CPU.

## Mixed precision with IPEX
Mixed precision uses single (fp32) and half-precision (bf16/fp16) data types in a model to accelerate training or inference while still preserving much of the single-precision accuracy. Modern CPUs such as 3rd and 4th Gen Intel® Xeon® Scalable processors natively support bf16, so you should get more performance out of the box by enabling mixed precision training with bf16.
Mixed precision uses single (fp32) and half-precision (bf16/fp16) data types in a model to accelerate training or inference while still preserving much of the single-precision accuracy. Modern CPUs such as 3rd, 4th, and 5th Gen Intel® Xeon® Scalable processors natively support bf16. 6th Gen Intel® Xeon® Scalable processors natively support bf16 and fp16. You should get more performance out of the box by enabling mixed precision training with bf16 or fp16.

To further maximize training performance, you can use Intel® Extension for PyTorch (IPEX), which is a library built on PyTorch and adds additional CPU instruction level architecture (ISA) level support such as Intel® Advanced Vector Extensions 512 Vector Neural Network Instructions (Intel® AVX512-VNNI), and Intel® Advanced Matrix Extensions (Intel® AMX) for an extra performance boost on Intel CPUs. However, CPUs with only AVX2 (e.g., AMD or older Intel CPUs) are not guaranteed to have better performance under IPEX.

Auto Mixed Precision (AMP) for CPU backends has been enabled since PyTorch 1.10. AMP support for bf16 on CPUs and bf16 operator optimization is also supported in IPEX and partially upstreamed to the main PyTorch branch. You can get better performance and user experience with IPEX AMP.
Auto Mixed Precision (AMP) for CPU backends has been enabled since PyTorch 1.10. AMP support for bf16/fp16 on CPUs and bf16/fp16 operator optimization is also supported in IPEX and partially upstreamed to the main PyTorch branch. You can get better performance and user experience with IPEX AMP.

Check more detailed information for [Auto Mixed Precision](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/features/amp.html).

Expand All @@ -32,10 +32,10 @@ IPEX release is following PyTorch, to install via pip:

| PyTorch Version | IPEX version |
| :---------------: | :----------: |
| 2.1.x | 2.1.100+cpu |
| 2.0.x | 2.0.100+cpu |
| 1.13 | 1.13.0+cpu |
| 1.12 | 1.12.300+cpu |
| 2.5.0 | 2.5.0+cpu |
| 2.4.0 | 2.4.0+cpu |
| 2.3.0 | 2.3.0+cpu |
| 2.2.0 | 2.2.0+cpu |

Please run `pip list | grep torch` to get your `pytorch_version`, so you can get the `IPEX version_name`.
```bash
Expand All @@ -46,7 +46,7 @@ You can check the latest versions in [ipex-whl-stable-cpu](https://developer.int
Check more approaches for [IPEX installation](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/installation.html).

### Usage in Trainer
To enable auto mixed precision with IPEX in Trainer, users should add `use_ipex`, `bf16` and `no_cuda` in training command arguments.
To enable auto mixed precision with IPEX in Trainer, users should add `use_ipex`, `bf16` or `fp16`, and `no_cuda` in training command arguments.

Take an example of the use cases on [Transformers question-answering](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering)

Expand Down
34 changes: 10 additions & 24 deletions docs/source/en/perf_train_cpu_many.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,46 +30,32 @@ Check more detailed information for [oneccl_bind_pt](https://github.com/intel/to

Wheel files are available for the following Python versions:

| Extension Version | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 |
| :---------------: | :--------: | :--------: | :--------: | :--------: | :---------: |
| 2.1.0 | |||||
| 2.0.0 | |||||
| 1.13.0 | |||||
| 1.12.100 | |||||
| 1.12.0 | |||||
| Extension Version | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 | Python 3.11 |
| :---------------: | :--------: | :--------: | :--------: | :---------: | :---------: |
| 2.5.0 | |||||
| 2.4.0 | |||||
| 2.3.0 | |||||
| 2.2.0 | |||||

Please run `pip list | grep torch` to get your `pytorch_version`.
```bash
pip install oneccl_bind_pt=={pytorch_version} -f https://developer.intel.com/ipex-whl-stable-cpu
```
where `{pytorch_version}` should be your PyTorch version, for instance 2.1.0.
where `{pytorch_version}` should be your PyTorch version, for instance 2.4.0.
Check more approaches for [oneccl_bind_pt installation](https://github.com/intel/torch-ccl).
Versions of oneCCL and PyTorch must match.

<Tip warning={true}>

oneccl_bindings_for_pytorch 1.12.0 prebuilt wheel does not work with PyTorch 1.12.1 (it is for PyTorch 1.12.0)
PyTorch 1.12.1 should work with oneccl_bindings_for_pytorch 1.12.100

</Tip>

## Intel® MPI library
Use this standards-based MPI implementation to deliver flexible, efficient, scalable cluster messaging on Intel® architecture. This component is part of the Intel® oneAPI HPC Toolkit.

oneccl_bindings_for_pytorch is installed along with the MPI tool set. Need to source the environment before using it.

for Intel® oneCCL >= 1.12.0
```bash
oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
source $oneccl_bindings_for_pytorch_path/env/setvars.sh
```

for Intel® oneCCL whose version < 1.12.0
```bash
torch_ccl_path=$(python -c "import torch; import torch_ccl; import os; print(os.path.abspath(os.path.dirname(torch_ccl.__file__)))")
source $torch_ccl_path/env/setvars.sh
```

#### Intel® Extension for PyTorch installation

Intel Extension for PyTorch (IPEX) provides performance optimizations for CPU training with both Float32 and BFloat16 (refer to the [single CPU section](./perf_train_cpu) to learn more).
Expand Down Expand Up @@ -155,7 +141,7 @@ This example assumes that you have:
The snippet below is an example of a Dockerfile that uses a base image that supports distributed CPU training and then
extracts a Transformers release to the `/workspace` directory, so that the example scripts are included in the image:
```dockerfile
FROM intel/intel-optimized-pytorch:2.3.0-pip-multinode
FROM intel/intel-optimized-pytorch:2.4.0-pip-multinode

RUN apt-get update -y && \
apt-get install -y --no-install-recommends --fix-missing \
Expand All @@ -165,7 +151,7 @@ RUN apt-get update -y && \
WORKDIR /workspace

# Download and extract the transformers code
ARG HF_TRANSFORMERS_VER="4.44.0"
ARG HF_TRANSFORMERS_VER="4.46.0"
RUN pip install --no-cache-dir \
transformers==${HF_TRANSFORMERS_VER} && \
mkdir transformers && \
Expand Down Expand Up @@ -319,4 +305,4 @@ with the job, the PyTorchJob resource can be deleted from the cluster using `kub

This guide covered running distributed PyTorch training jobs using multiple CPUs on bare metal and on a Kubernetes
cluster. Both cases utilize Intel Extension for PyTorch and Intel oneCCL Bindings for PyTorch for optimal training
performance, and can be used as a template to run your own workload on multiple nodes.
performance, and can be used as a template to run your own workload on multiple nodes.

0 comments on commit 2801d7b

Please sign in to comment.