xe: jit: gemm: implement qqw mul instructions #2375

Simonsays095 · 2025-01-11T00:54:44Z

Fixes MFDNN-12978. The following layer (among others) is failing:

./benchdnn --mode=F --matmul --engine=gpu --allow-enum-tags-only=false --memory-kind=usm_device --cold-cache=all --dt=f16:f16:f16 --stag=abc --wtag=abc --dtag=abc 192x663x64:192x64x663
onednn_verbose,v1,info,oneDNN v3.7.0 (commit 5890fc6)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:64
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:OpenCL
onednn_verbose,v1,info,gpu,engine,opencl device count:2
onednn_verbose,v1,info,gpu,engine,0,name:Intel(R) Data Center GPU Max 1550,driver_version:24.39.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,1,name:Intel(R) Data Center GPU Max 1550,driver_version:24.39.31294,binary_kernels:enabled
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,implementation,backend,exec_time
onednn_verbose,v1,primitive,error,gpu,jit::gemm,Unimplemented,src/gpu/intel/jit/gemm/gen_gemm_kernel.cpp:979
Error: Function 'create_primitive' at (/home/gta/simonewi/oneDNN/tests/benchdnn/dnnl_common.hpp:422) returned 'runtime_error'
Error: Function 'init_prim' at (/home/gta/simonewi/oneDNN/tests/benchdnn/dnnl_common.hpp:475) returned '1'
Error: Function 'createit' at (/home/gta/simonewi/oneDNN/tests/benchdnn/matmul/matmul.cpp:897) returned '1'
Error: Function 'create' at (/home/gta/simonewi/oneDNN/tests/benchdnn/utils/task.hpp:49) returned '1'
0:UNTESTED_FAILED __REPRO: --mode=F --matmul --engine=gpu --allow-enum-tags-only=false --memory-kind=usm_device --cold-cache=all --dt=f16:f16:f16 --stag=abc --wtag=abc --dtag=abc 192x663x64:192x64x663
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,gpu,,,--mode=F --matmul --engine=gpu --allow-enum-tags-only=false --memory-kind=usm_device --cold-cache=all --dt=f16:f16:f16 --stag=abc --wtag=abc --dtag=abc 192x663x64:192x64x663,10.8028,0.168213,0,0,0,0
tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:1 listed:0
total perf: min(ms):0 avg(ms):0
total: 0.04s; fill: 0.00s (0%);

#2276 Introduced 64-bit arithmetic for large buffer support in gemm kernels, but some cases (specifically related to word operands) were left unimplemented which resulted in runtime errors. This implements those remaining cases via word -> dword conversion.

rjoursler · 2025-01-13T15:48:17Z

src/gpu/intel/jit/emulation.hpp

                    return *reinterpret_cast<ngen::RegData *>(&src1);
-                g.mov(1, temp, src1);
-                return temp;
+                }


I think we should add emulation specific to this case to the s1Q case below. It should be possible to accomplish this in 4 instructions (mul, mul, add, mov) rather than the proposed 7 (mov, mul, macl, mul, mach, add, mov) and it will also enable the qwq case as well.

Longer term, it may also make sense to refactor so that dstQ && s0Q case contain the actual implementations, and the dstQ && s1Q contain the recursive implementation. I think this will provide some simplification as we know S1 is not an immediate in that case, removing the above mov logic.

We may be able to reduce it to 3 instructions, actually, using 48-bit precision accumulation in dw x w mul instructions. I created MFDNN-103006 to track this optimization.

This PR fixes the broken layers, and I will make the suggested optimization changes in an upcoming PR since there's a bit of exploring to be done finding the best instructions to use.

atkassen · 2025-01-14T23:15:34Z

src/gpu/intel/jit/emulation.hpp

            auto temp = s1Signed ? state.temp[0].d() : state.temp[0].ud();
            auto &src1Reg = [&]() -> ngen::RegData & {
-                if (std::is_base_of<ngen::RegData, S1>::value)
+                if (s1Immed || s1W) {
+                    g.mov(1, temp, src1);


There is an issue here, where only the first entry of src1 will be copied in cases where it is not broadcast.

I changed 1 to mod. The intention when calling emul (either broadcast or not) will be preserved for this mov instruction. Since src0 is a qw, exec size is limited to 2 GRFs of qw or 1 GRF of dw, so this will never spill out of temp.

Simonsays095 · 2025-01-15T17:01:17Z

make test perf-gpu
set primitive=matmul ip

Simonsays095 · 2025-01-17T16:46:18Z

make test
disable test_device_cpu
enable test_device_gpu

Simonsays095 · 2025-01-17T16:46:28Z

make test perf-gpu
set primitive=matmul ip

Simonsays095 requested a review from a team as a code owner January 11, 2025 00:54

github-actions bot added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Jan 11, 2025

rjoursler reviewed Jan 13, 2025

View reviewed changes

rjoursler approved these changes Jan 14, 2025

View reviewed changes

atkassen reviewed Jan 14, 2025

View reviewed changes

Simonsays095 mentioned this pull request Jan 14, 2025

[Backport v3.7] xe: jit: gemm: implement qqw mul instructions #2408

Merged

Simonsays095 force-pushed the simonewi/fixup_emulation_qqw branch from 6984c45 to 0e74281 Compare January 15, 2025 17:00

atkassen approved these changes Jan 15, 2025

View reviewed changes

Simonsays095 force-pushed the simonewi/fixup_emulation_qqw branch from 0e74281 to 6750c6c Compare January 17, 2025 16:45

xe: jit: gemm: implement qqw mul instructions

d5a00a4

Simonsays095 force-pushed the simonewi/fixup_emulation_qqw branch from 6750c6c to d5a00a4 Compare January 18, 2025 00:08

Simonsays095 merged commit 301eb0d into oneapi-src:main Jan 18, 2025
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xe: jit: gemm: implement qqw mul instructions #2375

xe: jit: gemm: implement qqw mul instructions #2375

Simonsays095 commented Jan 11, 2025

rjoursler Jan 13, 2025

Simonsays095 Jan 14, 2025

atkassen Jan 14, 2025

Simonsays095 Jan 15, 2025

Simonsays095 commented Jan 15, 2025

Simonsays095 commented Jan 17, 2025

Simonsays095 commented Jan 17, 2025

xe: jit: gemm: implement qqw mul instructions #2375

xe: jit: gemm: implement qqw mul instructions #2375

Conversation

Simonsays095 commented Jan 11, 2025

rjoursler Jan 13, 2025

Choose a reason for hiding this comment

Simonsays095 Jan 14, 2025

Choose a reason for hiding this comment

atkassen Jan 14, 2025

Choose a reason for hiding this comment

Simonsays095 Jan 15, 2025

Choose a reason for hiding this comment

Simonsays095 commented Jan 15, 2025

Simonsays095 commented Jan 17, 2025

Simonsays095 commented Jan 17, 2025