Intel® Gen4 Xeon® Sapphire Rapids supports new hardware feature: Intel® Advanced Matrix Extensions (AMX) which accelerates deep learning inference by INT8/BF16 data type.
AMX is better than VNNI (AVX-512 Vector Neural Network Instructions supported by older Xeon®) to accelerate INT8 model. It's 8 times performance of VNNI in theory.
Intel® Neural Compressor helps quantize the FP32 model to INT8 and control the accuracy loss as expected.
This example shows a whole pipeline:
-
Train an image classification model VGG19 by transfer learning based on TensorFlow Hub trained model.
-
Quantize the FP32 Keras model and get a INT8 PB model by Intel® Neural Compressor.
-
Test and compare the performance of FP32 & INT8 models.
This example can be executed on Intel® CPU supports VNNI or AMX. There will be more performance improvement on Intel® CPU with AMX.
To learn more about Intel® Neural Compressor, please refer to the official website for detailed info and news: https://github.com/intel/neural-compressor
We will learn the acceleration of AI inference by Intel AI technology:
-
Intel® Advanced Matrix Extensions
-
Intel® Deep Learning Boost
-
Intel® Neural Compressor
-
Intel® Optimization for Tensorflow*
As we know, SPR support AMX-INT8 and AMX-BF16 instructions which accelerate the INT8 and BF16 layer inference.
Intel® Neural Compressor has this special function for SPR: during quantizing the model, it will convert the FP32 layers to BF16 which can't be quantized when execute the quantization on SPR automatically. Convert FP32 to BF16 is following the rule of AI framework too.
It will help accelerate the model on SPR as possible and control the accuracy loss as expected.
How to enable it?
- Install Intel® Optimization for Tensorflow*/Intel® Extension for Tensorflow* of the release support this feature.
Note, the public release can't support it now.
- Execute quantization process by calling Intel® Neural Compressor API on SPR.
we could force to enable this feature by setting environment variables, if the quantization is executed on the Xeon which doesn't support AMX.
import os
os.environ["FORCE_BF16"] = "1"
os.environ["MIX_PRECISION_TEST"] = "1"
How to disable it?
import os
os.environ["FORCE_BF16"] = "0"
os.environ["MIX_PRECISION_TEST"] = "0"
This example is used to highlight to this feature.
Function | Code | Input | Output |
---|---|---|---|
Train and quantize a CNN model | train_model.py | dataset: ibean | model_keras.fp32 model_pb.int8 |
Test performance | profiling_inc.py | model_keras.fp32 model_pb.int8 |
32.json 8.json |
Compare the performance | compare_perf.py | 32.json 8.json |
stdout/stderr log file fp32_int8_absolute.png fp32_int8_times.png |
Execute run_sample.sh in shell will call above scripts to finish the demo. Or execute inc_quantize_vgg19.ipynbrun_sample.sh in jupyter notebook to finish the demo.
It's recommended to use 4nd Generation Intel® Xeon® Scalable Processors (SPR) or newer, which include:
-
AVX512 instruction to speed up training & inference AI model.
-
Intel® Advanced Matrix Extensions (AMX) to accelerate AI/DL Inference with INT8/BF16 Model.
It's also executed on other Intel CPUs. If the CPU support Intel® Deep Learning Boost, the performance will be increased obviously. Without it, maybe it's 1.x times of FP32.
If you have no such hardware platform to support Intel® Advanced Matrix Extensions (AMX) or Intel® Deep Learning Boost, you could register to Intel® DevCloud and try this example on new Xeon with Intel® Deep Learning Boost freely. To learn more about working with Intel® DevCloud, please refer to Intel® DevCloud
Set up own running environment in local server, cloud (including Intel® DevCloud):
Create virtual environment env_inc:
pip_set_env.sh
Activate it by:
source env_inc/bin/activate
Create virtual environment env_inc:
conda_set_env.sh
Activate it by:
conda activate env_inc
Startup Jupyter Notebook:
./run_jupyter.sh
Please open inc_quantize_vgg19.ipynb in Jupyter Notebook.
After set the right kernel, following the guide in it to run this demo.
This article assumes you are familiar with Intel® DevCloud environment. To learn more about working with Intel® DevCloud, please refer to Intel® DevCloud. Specifically, this article assumes:
- You have an Intel® DevCloud account.
- You are familiar with usage of Intel® DevCloud, like login by SSH client..
- Developers are familiar with Python, AI model training and inference based on Tensorflow*.
-
SSH to Intel® DevCloud or Open terminal by Jupyter notebook.
-
Create virtual environment env_inc:
./devcloud_setup_env.sh
Activate it by:
conda activate env_inc
If you have no SPR server, you can try on Intel® DevCloud which provides SPR server running environment.
Job submit to compute node with the property 'clx' or 'icx' which support Intel® Deep Learning Boost (avx512_vnni); 'spr' which supports Intel® Advanced Matrix Extensions (AMX).
!qsub run_in_intel_devcloud.sh -d `pwd` -l nodes=1:spr:ppn=2
28029.v-qsvr-nda.aidevcloud
Note, please run above command in login node. There will be error as below if run it on compute node:
qsub: submit error (Bad UID for job execution MSG=ruserok failed validating uXXXXX/uXXXXX from s001-n054.aidevcloud)
qstat
After the job is over (successfully or fault), there will be log files, like:
- run_in_intel_devcloud.sh.o28029
- run_in_intel_devcloud.sh.e28029
tail -23 `ls -lAtr run_in_intel_devcloud.sh.o* | tail -1 | awk '{print $9}'`
Or Check the result in a log file, like : run_in_intel_devcloud.sh.o28029:
!tail -23 run_in_intel_devcloud.sh.o1842253
Model FP32 INT8
throughput(fps) 572.4982883964987 X030.70552731285
latency(ms) 2.8339174329018104 X.128233714979522
accuracy(%) 0.9799 X.9796
Save to fp32_int8_absolute.png
Model FP32 INT8
throughput_times 1 X.293824608282245
latency_times 1 X.7509864932092611
accuracy_times 1 X.9996938463108482
Save to fp32_int8_times.png
Please check the PNG files to see the performance!
This demo is finished successfully!
Thank you!
########################################################################
# End of output for job 1842253.v-qsvr-1.aidevcloud
# Date: Thu 27 Jan 2022 07:05:52 PM PST
########################################################################
...
We will see the performance and accuracy of FP32 and INT8 model. The performance could be obviously increased if running on Xeon with VNNI.
The demo creates figure files: fp32_int8_absolute.png, fp32_int8_times.png to show performance bar. They could be used in report.
Copy files from DevCloud in host:
scp devcloud:~/xxx/*.png ./