Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the calculation of the FullyConnected layer takes a lot of time #7273

Closed
zhaohb opened this issue Aug 27, 2021 · 57 comments
Closed

the calculation of the FullyConnected layer takes a lot of time #7273

zhaohb opened this issue Aug 27, 2021 · 57 comments
Assignees
Labels
bug Something isn't working category: build OpenVINO cmake script / infra PSE support_request

Comments

@zhaohb
Copy link
Contributor

zhaohb commented Aug 27, 2021

System information (version)
  • OpenVINO => 2021.4
  • Compiler => gcc

now I have a model, and I've pressed it with benchmark and found that the FullyConnected layer takes a lot of time,about 25% of the total inference time:

dense/BiasAdd                 EXECUTED       layerType: FullyConnected     realTime: 4481      cpu: 4481           execType: jit_gemm_FP32
...
dense_2/BiasAdd               EXECUTED       layerType: FullyConnected     realTime: 4490      cpu: 4490           execType: jit_gemm_FP32
...

In this link https://toscode.gitee.com/vinsonSpace/openvino/blob/master/build-instruction.md saw that GEMM can be accelerated through openblas or MKL:

image

I want to use mkl:

cmake ..     -DENABLE_CLDNN=OFF     -DENABLE_OPENCV=OFF     -DENABLE_VPU=OFF     -DENABLE_PYTHON=ON    -DNGRAPH_ONNX_IMPORT_ENABLE=ON -DNGRAPH_ONNX_FRONTEND_ENABLE=ON  -DNGRAPH_ONNX_EDITOR_ENABLE=ON  -DGEMM=MKL -DMKLROOT=/work/compile_ov_20214/mklml_lnx_2019.0.5.20190502  -DCMAKE_INSTALL_PREFIX=/work/compile_ov_20214/openvino_dist

but I am reminded that this GEMM macro is not available.

CMake Warning:
  Manually-specified variables were not used by the project:

    GEMM
    DMKLROOT
......

openvino doesn't support this macro?How to accelerate FullyConnected layer?

@zhaohb zhaohb added bug Something isn't working support_request labels Aug 27, 2021
@zhaohb
Copy link
Contributor Author

zhaohb commented Aug 27, 2021

my bad, GEMM macros can be set in version 2021.1, but now I'm using 2021.4.
but how do I choose gemm's implementation? Or will it automatically choose the best performance implementation, MKL or openblas?

@Iffa-Intel
Copy link

Iffa-Intel commented Aug 30, 2021

Hi,
according to the documentation, the default build uses an internal JIT GEMM implementation.
So if you didn't specify GEMM=OPENBLAS or -DGEMM=MKL, etc in the build, it would automatically use the internal JIT GEMM.

@Iffa-Intel Iffa-Intel added category: build OpenVINO cmake script / infra and removed bug Something isn't working labels Aug 30, 2021
@zhaohb
Copy link
Contributor Author

zhaohb commented Aug 30, 2021

but in 2021.4 I can not found this macros, why delete it ?

@zhaohb
Copy link
Contributor Author

zhaohb commented Aug 31, 2021

@Iffa-Meah can you help me?

@Iffa-Intel Iffa-Intel added the PSE label Sep 1, 2021
@jgespino
Copy link
Contributor

jgespino commented Sep 1, 2021

Hi @zhaohb

I see GEMM was removed starting on OpenVINO 2021.2 release, I would have to check with the development team. Could you provide your model? I want to reproduce the behavior and get the development team's input as well.

Regards,
Jesus

@jgespino jgespino self-assigned this Sep 1, 2021
@zhaohb
Copy link
Contributor Author

zhaohb commented Sep 2, 2021

ok, I will share my model with you later , but I want to know why remove GEMM? Is the performance similar between different implementations?
my model : https://drive.google.com/drive/folders/10FfO_AgJtJMJx5bcSEd-p0S6oeDWI1k-?usp=sharing you can dowmload it.

@zhaohb
Copy link
Contributor Author

zhaohb commented Sep 2, 2021

@jgespino I have tried to add GEMM in 2021.4, but failed, so I hope you can add GEMM and test my model to see if the performance of these methods is the same, thank you very much.

@zhaohb
Copy link
Contributor Author

zhaohb commented Sep 6, 2021

@jgespino Is there any progress now?

@jgespino
Copy link
Contributor

jgespino commented Sep 7, 2021

Hi @zhaohb

Not yet, I have to check with the development team. I see GEMM was removed by pull request #5642.

Regards,
Jesus

Ref. 65047

@jgespino jgespino added the bug Something isn't working label Sep 7, 2021
@zhaohb
Copy link
Contributor Author

zhaohb commented Sep 13, 2021

@jgespino I have added the code for the PR deletion #5642 but I still can't compile successfully
So when will this feature be available on the official?

@zhaohb
Copy link
Contributor Author

zhaohb commented Sep 13, 2021

@jgespino How can I tell if gemm=MKL has compiled successfully?
By default, I didn't add gemm=MKL, benchmark_app.py -pc 1 show that:

dense_1/BiasAdd               EXECUTED       layerType: FullyConnected     realTime: 3055      cpu: 3055           execType: jit_gemm_FP32

But I added gemm=MKL, compiled successfully, benchmark_app.py -pc still shows:

dense_1/BiasAdd               EXECUTED       layerType: FullyConnected     realTime: 3055      cpu: 3035           execType: jit_gemm_FP32

Will execType change after using MKL? and the execution time seem not changed.

@zhaohb
Copy link
Contributor Author

zhaohb commented Sep 26, 2021

Who can give me some advice?

@jgespino
Copy link
Contributor

Hi @zhaohb

I appreciate your patience, I've reached out to the development team for additional assistance.
I will let you know what I find out.

Regards,
Jesus

@zhaohb
Copy link
Contributor Author

zhaohb commented Sep 28, 2021

@jgespino thank you very much. let me know if you find anything!

@zhaohb
Copy link
Contributor Author

zhaohb commented Oct 14, 2021

@jgespino How is the progress now? I really need a solution to this problem

@zhaohb
Copy link
Contributor Author

zhaohb commented Oct 29, 2021

Hi, who can help me?

@dmitry-gorokhov
Copy link
Contributor

Hi @zhaohb.
As you correctly mentioned before we have alternative implementations for matrix multiplication routines: MKL, OpenBLAS. By default we are using oneDNN for such kind of operations. We performed huge amount of performance checks which showed that oneDNN provides best performance for matrix multiplication operations (layerType: FullyConnected) in all cases. This is justification for decision to drop MKL and OpenBLAS options support. In other words OpenVINO should provide best MatMul performance with default options.
BTW, which HW you are using for benchmarking?

@zhaohb
Copy link
Contributor Author

zhaohb commented Nov 3, 2021

@dmitry-gorokhov thank you for your reply.
image

this is part of my model, there are many combinations of HW, such as (1490x256、1490x4、256x256), and the slowest one should be 1490x256.
If it is not possible to accelerate FC from the operator, can we optimize FC from other ways?
I also tried to increase the number of CPU cores, but nothing happened.

@zhaohb
Copy link
Contributor Author

zhaohb commented Nov 4, 2021

@dmitry-gorokhov This part of the model is a bit wide, how can we increase parallelism in this part, which I think should improve performance.

@dmitry-gorokhov
Copy link
Contributor

@zhaohb By HW I actually mentioned hardware :). It is important to know which system your are using for benchmarking because it affects possible ways for perf improvements.

@dmitry-gorokhov
Copy link
Contributor

@zhaohb Glad to hear that. I expect the PR to be merged within 2 weeks.
Seems like you can also can try different values for -nstreams parameter. For example using 12 streams instead of 8 might improve throughput while preserving 100 ms latency. On the other hand there might cases then big number of streams might be harmful for performance because of L3 repletion.

@zhaohb
Copy link
Contributor Author

zhaohb commented Nov 10, 2021

@dmitry-gorokhov ok, I will try to test the optimal number of nstreams.
But I have another question, is based on https://github.com/xuchen-intel/openvino.git reduce_node_extension branch I compile openvino, or wait for this branch merge into the mster branch than I compile openvino? Is there a difference between the two ways?
Which one do you recommend

thank you very much.

@dmitry-gorokhov
Copy link
Contributor

@zhaohb There shouldn't be much difference in terms of performance. So you can use feature branch for benchmarking.

@zhaohb
Copy link
Contributor Author

zhaohb commented Nov 10, 2021

@dmitry-gorokhov More than just benchmarking, I wanted to add this branch to the Model Server and make it available to the Model Server for the best inference performance

@zhaohb
Copy link
Contributor Author

zhaohb commented Nov 12, 2021

@dmitry-gorokhov I compiled the reduce_node_extension branch, but found that I could not generate the opencv library, this is my compile command:

cmake .. -DCMAKE_BUILD_TYPE=Release -DENABLE_CLDNN=OFF -DENABLE_OPENCV=OFF -DTHREADING=TBB -DENABLE_GNA=OFF -DENABLE_VPU=OFF -DENABLE_PYTHON=ON -DNGRAPH_ONNX_FRONTEND_ENABLE=ON -DENABLE_OPENCV=ON -DCMAKE_INSTALL_PREFIX=/work/6686_openvino/out_6686_opencv/

but output :

-- OpenVINO version is 2022.1.0
-- CMAKE_BUILD_TYPE: Release
CMake Warning at cmake/developer_package/clang_format/clang_format.cmake:21 (message):
  Supported clang-format version is not found!
Call Stack (most recent call first):
  cmake/developer_package/IEDevScriptsConfig.cmake:294 (include)
  CMakeLists.txt:11 (find_package)


CMake Warning at cmake/developer_package/ncc_naming_style/ncc_naming_style.cmake:26 (message):
  Please, install libclang-[N]-dev package (required for ncc naming style
  check)
Call Stack (most recent call first):
  cmake/developer_package/IEDevScriptsConfig.cmake:295 (include)
  CMakeLists.txt:11 (find_package)


-- clang package is installed, but may have different version (5.0). Please use "/usr/bin/python3 -m pip install clang==9.0".
-- Inference Engine enabled features:
--
--     CI_BUILD_NUMBER: custom_reduce_node_extension_24b77d73c44f7058f4b0d05b59e079a7b80ab467
--     ENABLE_LTO = OFF
--     OS_FOLDER = OFF
--     USE_BUILD_TYPE_SUBFOLDER = ON
--     TREAT_WARNING_AS_ERROR = ON
--     ENABLE_INTEGRITYCHECK = OFF
--     ENABLE_SANITIZER = OFF
--     ENABLE_UB_SANITIZER = OFF
--     ENABLE_THREAD_SANITIZER = OFF
--     ENABLE_COVERAGE = OFF
--     ENABLE_SSE42 = ON
--     ENABLE_AVX2 = ON
--     ENABLE_AVX512F = ON
--     BUILD_SHARED_LIBS = ON
--     ENABLE_FASTER_BUILD = OFF
--     ENABLE_CPPLINT = ON
--     ENABLE_CPPLINT_REPORT = OFF
--     ENABLE_CLANG_FORMAT = OFF
--     ENABLE_NCC_STYLE = OFF
--     VERBOSE_BUILD = OFF
--     ENABLE_UNSAFE_LOCATIONS = OFF
--     ENABLE_FUZZING = OFF
--     ENABLE_MKL_DNN = ON
--     ENABLE_TESTS = OFF
--     ENABLE_STRICT_DEPENDENCIES = ON
--     ENABLE_CLDNN = OFF
--     ENABLE_PROFILING_ITT = OFF
--     ENABLE_PROFILING_FILTER = ALL
--     ENABLE_PROFILING_FIRST_INFERENCE = ON
--     SELECTIVE_BUILD = OFF
--     ENABLE_ERROR_HIGHLIGHT = OFF
--     ENABLE_PYTHON = ON
--     ENABLE_DOCS = OFF
--     ENABLE_GNA = OFF
--     ENABLE_CLDNN_TESTS = OFF
--     THREADING = TBB
--     ENABLE_VPU = OFF
--     ENABLE_MYRIAD = OFF
--     ENABLE_MYRIAD_NO_BOOT = OFF
--     ENABLE_GAPI_TESTS = OFF
--     GAPI_TEST_PERF = OFF
--     ENABLE_MYRIAD_MVNC_TESTS = OFF
--     ENABLE_DATA = OFF
--     ENABLE_BEH_TESTS = OFF
--     ENABLE_FUNCTIONAL_TESTS = OFF
--     ENABLE_SAMPLES = 0
--     ENABLE_OPENCV = ON
--     ENABLE_V7_SERIALIZE = OFF
--     ENABLE_TBB_RELEASE_ONLY = ON
--     ENABLE_SYSTEM_PUGIXML = OFF
--     ENABLE_DEBUG_CAPS = OFF
--     ENABLE_GPU_DEBUG_CAPS = OFF
--     ENABLE_CPU_DEBUG_CAPS = OFF
--     NGRAPH_ONNX_FRONTEND_ENABLE = ON
--     NGRAPH_PDPD_FRONTEND_ENABLE = ON
--     NGRAPH_IR_FRONTEND_ENABLE = ON
--     NGRAPH_USE_PROTOBUF_LITE = ON
--     NGRAPH_USE_SYSTEM_PROTOBUF = OFF
--     OPENVINO_DEBUG_ENABLE = OFF
--     ENABLE_REQUIREMENTS_INSTALL = ON
--
-- MODELS_PATH=
-- PROJECT ............................... OpenVINO
-- CMAKE_BINARY_DIR ...................... /work/6686_openvino/openvino/build
-- OpenVINO_SOURCE_DIR ................... /work/6686_openvino/openvino
-- CMAKE_GENERATOR ....................... Unix Makefiles
-- CMAKE_C_COMPILER_ID ................... GNU
-- CMAKE_BUILD_TYPE ...................... Release
-- The name pugixml::static is an ALIAS for pugixml-static. It will be exported to the InferenceEngineDeveloperPackage with the original name.
-- The name gflags is an ALIAS for gflags_nothreads_static. It will be exported to the InferenceEngineDeveloperPackage with the original name.
--
-- 3.9.2.0
-- Found PythonInterp: /usr/bin/python3 (found version "3.8.10")
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.8.so (found version "3.8.10")
Generated: /work/6686_openvino/openvino/build/thirdparty/onnx/onnx/onnx/onnx_ngraph_onnx-ml.proto
Generated: /work/6686_openvino/openvino/build/thirdparty/onnx/onnx/onnx/onnx-operators_ngraph_onnx-ml.proto
Generated: /work/6686_openvino/openvino/build/thirdparty/onnx/onnx/onnx/onnx-data_ngraph_onnx.proto
--
-- ******** Summary ********
--   CMake version             : 3.16.3
--   CMake command             : /usr/bin/cmake
--   System                    : Linux
--   C++ compiler              : /usr/bin/c++
--   C++ compiler version      : 9.3.0
--   CXX flags                 : -Wsuggest-override  -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-error=parentheses  -Wformat -Wformat-security -D_FORTIFY_SOURCE=2 -fstack-protector-strong -s -fsigned-char -Werror -ffunction-sections -fdata-sections -fdiagnostics-show-option -Wundef -Wreturn-type -Wunused-variable -Wuninitialized -Winit-self -Wmaybe-uninitialized -Wno-suggest-override -Wnon-virtual-dtor
--   Build type                : Release
--   Compile definitions       : IE_BUILD_POSTFIX="";ENABLE_MKL_DNN=1
--   CMAKE_PREFIX_PATH         :
--   CMAKE_INSTALL_PREFIX      : /work/6686_openvino/out_6686_opencv
--   CMAKE_MODULE_PATH         :
--
--   ONNX version              : 1.9.0
--   ONNX NAMESPACE            : ngraph_onnx
--   ONNX_USE_LITE_PROTO       : ON
--   USE_PROTOBUF_SHARED_LIBS  : OFF
--   ONNX_DISABLE_EXCEPTIONS   : OFF
--   ONNX_WERROR               : OFF
--   ONNX_BUILD_TESTS          : OFF
--   ONNX_BUILD_BENCHMARKS     : OFF
--   ONNXIFI_DUMMY_BACKEND     : OFF
--   ONNXIFI_ENABLE_EXT        : OFF
--
--   Protobuf compiler         :
--   Protobuf includes         :
--   Protobuf libraries        :
--   BUILD_ONNX_PYTHON         : OFF
-- The name openvino::pp is an ALIAS for openvino_preprocessor. It will be exported to the InferenceEngineDeveloperPackage with the original name.
-- The name openvino::itt is an ALIAS for itt. It will be exported to the InferenceEngineDeveloperPackage with the original name.
-- The name openvino::conditional_compilation is an ALIAS for conditional_compilation. It will be exported to the InferenceEngineDeveloperPackage with the original name.
-- The name ngraph::builder is an ALIAS for ngraph_builders. It will be exported to the InferenceEngineDeveloperPackage with the original name.
-- The name ngraph::reference is an ALIAS for ngraph_reference. It will be exported to the InferenceEngineDeveloperPackage with the original name.
-- nGraph unit tests disabled
-- pybind11 v2.8.0 dev1
-- Python version=python3.8
-- TBB: /work/6686_openvino/openvino/inference-engine/temp/tbb
-- GPU support is disabled
-- Primitive cache is disabled
-- Static tbbbind_2_4 package was found
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.8.so (found suitable version "3.8.10", minimum required is "3")
-- Found Cython version 0.29.24
CMake Warning at inference-engine/samples/common/format_reader/CMakeLists.txt:21 (message):
  OPENCV is disabled or not found, format_reader will be built without OPENCV
  support


-- Register template_plugin to be built in build-modules/template_plugin
-- Found PythonInterp: /usr/bin/python3 (found suitable version "3.8.10", minimum required is "3")
-- Configuring done
-- Generating done
-- Build files have been written to: /work/6686_openvino/openvino/build

and the directory structure of the output has changed, like this:

install_dependencies  python  runtime  samples  setupvars.sh  tools

The previous directory structure looked like this:

bin              deployment_tools  inference_engine      licensing  python
data_processing  documentation     install_dependencies  opencv

The new directory structure was problematic when I recompiled the Model Server,what should I do ?
thank you vert much.

@zhaohb
Copy link
Contributor Author

zhaohb commented Nov 12, 2021

maybe I should use model server develop branch.

@zhaohb
Copy link
Contributor Author

zhaohb commented Nov 16, 2021

@dmitry-gorokhov It's my fault, the width of the model is not the bottleneck of OpenVino, the root problem is FC, if you have a lot of FC performance will degrade a lot.

@zhaohb
Copy link
Contributor Author

zhaohb commented Nov 16, 2021

@dmitry-gorokhov Which file is the operator implementation of FC in? I want to try and optimize it.
thank you very much.

@jgespino
Copy link
Contributor

@zhaohb Just following up on this discussion, is this something you are still working on?

@zhaohb
Copy link
Contributor Author

zhaohb commented Jan 11, 2022

@jgespino yes, I am trying, but I also want some help, how to optimize FC, I do not have a particularly good method now.

@jgespino
Copy link
Contributor

@dmitry-gorokhov @zhaohb Could you provide some guidance on possible approach to optimize FC?

@jgespino
Copy link
Contributor

@zhaohb Apologies for the delay in our response. Could you please grant me access to the original model that was converted to IR format? Is it included in the link below?

https://drive.google.com/drive/folders/10FfO_AgJtJMJx5bcSEd-p0S6oeDWI1k-?usp=sharing

@zhaohb
Copy link
Contributor Author

zhaohb commented Sep 1, 2022

Yes, of course. I'll send [email protected] mailbox is whitelisted to access the model file.

@jgespino
Copy link
Contributor

jgespino commented Sep 1, 2022

@zhaohb Received the invite, thank you! I don't see the original blue_c_concat_end.onnx model, is that something you can share?

@zhaohb
Copy link
Contributor Author

zhaohb commented Sep 2, 2022

@jgespino yes, It can be shared. I've uploaded it.
By the way, are you going to optimize it?thank you very much.

@jgespino
Copy link
Contributor

jgespino commented Sep 2, 2022

@zhaohb Thanks! Yes, I want to test it on the latest OpenVINO release and see if the performance improved. I'll need to find a system with a processor similar to Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz.

We have a pre-release version of OpenVINO 2022.2 release on PyPI in case you want to try it from your side as well.
https://pypi.org/project/openvino-dev/2022.2.0.dev20220829/

Regards,
Jesus

@zhaohb
Copy link
Contributor Author

zhaohb commented Sep 5, 2022

@jgespino What is the optimization of Matmul/GEMM in 2022.2 compared to the previous version? I have tested 2022.1, but there is no improvement compared to the previous version.

@avitial avitial self-assigned this Oct 5, 2022
@zhaohb
Copy link
Contributor Author

zhaohb commented Oct 14, 2022

@jgespino @avitial Have you made any progress?

@avitial
Copy link
Contributor

avitial commented Oct 17, 2022

@zhaohb I don't have access to same Xeon Skylake processor as you do. But testing on Icelake Intel® Xeon® Platinum 8368 CPU I can see some improvement in FullyConnected layers between OpenVINO versions (2021.4.1 vs 2022.2). This test was using the model you have shared with us.

In 2022.2 release 5 of 19 FullyConnected layers run as brgemm_avx512_FP32 and 14 of 19 as jit_gemm_FP32, whereas in 2021.4.1 release all 19 FC layers execute as jit_gemm_FP32.

Cumulative time, roughly, for all FullyConnected layers in 2021.4.1 is 11.72 ms whereas in 2022.2 is 1.40e-5 ms.

Not sure this type of improvement is expected in your environment/configuration, but might be worthwhile trying it out with 2022.2. Note in the table below jit_gemm_FP32;1.215; represents exec_type and exec_time in ms.

$ benchmark_app -m 2022.2/blue_c_concat_end.xml -d CPU -niter 10000 -api async -nstreams 8 -hint none

 

  2021.4.1-3926-14e67d86634-releases/2021/4 2022.2.0-7713-af16ea1d79a-releases/2022/2
dense_1/BiasAdd jit_gemm_FP32;1.215; brgemm_avx512_FP32;0:00:00.000002
dense_2/BiasAdd jit_gemm_FP32;1.195; brgemm_avx512_FP32;0:00:00.000002

@avitial
Copy link
Contributor

avitial commented Oct 27, 2022

Closing this, I hope previous responses were sufficient to help you proceed. Feel free to reopen and ask additional questions related to this topic.

@avitial avitial closed this as completed Oct 27, 2022
@akote123
Copy link

akote123 commented May 22, 2024

Hi @avitial ,
In Graviton3(aarch64) , for fully connected op does it uses openblas or libxsmm and how can we check which library is being used.

@dmitry-gorokhov
Copy link
Contributor

dmitry-gorokhov commented May 22, 2024

Hi @akote123.
OpenVINO uses ACL library on all ARM platforms.
If you are running "benchmark_app" you can add "-pc" which provides additional details about executed operations. It includes primType field which should help to clarify what backend is used.

@NishantPrabhuFujitsu
Copy link
Contributor

NishantPrabhuFujitsu commented May 22, 2024

@dmitry-gorokhov Continuing on @akote123 question, is ACL used through oneDNN or independently? I noticed that nodes of type FullyConnected are not being executed through oneDNN (calls to those nodes don't show up in ONEDNN_VERBOSE logs) but MatMul nodes are.

@dmitry-gorokhov
Copy link
Contributor

ACL used through oneDNN or independently?

Depends on the operations. But for Convolution, Matmul and FullyConnected we use OneDNN which fallbacks on ACL internally. This actually gives us an ability to leverage SVE kernels as well: https://github.com/openvinotoolkit/oneDNN/blob/v3.3_for_ie_master/src/cpu/cpu_convolution_list.cpp#L118-L120

Cannot say for sure why FC is not visible in VERBOSE_LOG.

@NishantPrabhuFujitsu
Copy link
Contributor

NishantPrabhuFujitsu commented May 22, 2024

@dmitry-gorokhov I see... I should probably give a bigger picture of what I'm trying to do.

I've been trying to run this script for LLaMA-2 from openvino.genai and determine the backend path followed for aarch64. I found an issue on the same repo which demonstrated how to collect profiling information with the primitives used for each operation during inference. I transferred that script to samples/ in this repo (with necessary changes) and built it along with the other samples.

For aarch64, I observed that FullyConnected layers fell back to a reference implementation (ref_any_f16 primitive) while MatMul layers used GEMM kernels from ACL. When I override this function all FullyConnected layers remained as MatMul layers and were executed using gemm:acl as expected. However, their execution times were ~10x slower than OSS oneDNN (v3.3.3, benchmarked with benchDNN).

A bit more investigation into the ONEDNN_VERBOSE logs revealed that matmuls in benchDNN were using the blocked implementation (wei_f16:a:blocked:aCb16c::f0) while calls to oneDNN from OpenVINO used the plain implementation (wei_f16:a:blocked:acb::f0). Forcing benchDNN to use the plain implementation makes its execution as slow as that of OpenVINO, leading me to believe that the slowdown on aarch64 is due to blocked layout in matmul not being used.

My question: Do you know why gemm:acl matmul doesn't use the blocked implementation? Are there any flags/modifications to be done during build to enable it?

Please note:

  • I also ran this on an x86 (Sapphire Rapids) machine; matmul executions aren't blocked there either. However, the brgemm_avx512_bf16 kernel gets called and it provides good execution times even without blocking.
  • I've also raised this on openvino.genai (this and this issue) but I'm waiting for any resolution there. I decided to ask here since I'm building my script with openvino source now.

@dmitry-gorokhov
Copy link
Contributor

dmitry-gorokhov commented May 22, 2024

@NishantPrabhuFujitsu
We basically has 2 different operations to descibe matrix multiplication math. FullyConnected is used in case second input constains constant values (weights), while Matmul is used then 2nd input is dynamic. Blocked layout is applied for FC weights only. Matmul do not apply blocked layout to avoid data reorder on each iteration (since data is dynamic). So by disabling ConvertMatMulToFullyConnected pass you prevent the runtime from usage of dedicated FullyConnected operation.

The issue with FC (which fallback on ref impl) was caused by some bug on ACL side. The team already shared related patches with us and @alvoron incorporated them into OV runtime: openvinotoolkit/openvino.genai#438 (comment). So I would expect with that custom OV version correct OneDNN/ACL impls to be chosen.

@NishantPrabhuFujitsu
Copy link
Contributor

NishantPrabhuFujitsu commented May 23, 2024

@dmitry-gorokhov Thank you for providing clarity on how MatMul and FullyConnected nodes work. I tried out the patches shipped by @alvoron, and the issue I was facing has been resolved. Thanks again for your support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working category: build OpenVINO cmake script / infra PSE support_request
Projects
None yet
Development

No branches or pull requests

7 participants