Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xe: jit: gemm: implement qqw mul instructions #2375

Merged

Conversation

Simonsays095
Copy link
Contributor

Fixes MFDNN-12978. The following layer (among others) is failing:

./benchdnn --mode=F --matmul --engine=gpu --allow-enum-tags-only=false --memory-kind=usm_device --cold-cache=all --dt=f16:f16:f16 --stag=abc --wtag=abc --dtag=abc 192x663x64:192x64x663
onednn_verbose,v1,info,oneDNN v3.7.0 (commit 5890fc6)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:64
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:OpenCL
onednn_verbose,v1,info,gpu,engine,opencl device count:2
onednn_verbose,v1,info,gpu,engine,0,name:Intel(R) Data Center GPU Max 1550,driver_version:24.39.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,1,name:Intel(R) Data Center GPU Max 1550,driver_version:24.39.31294,binary_kernels:enabled
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,implementation,backend,exec_time
onednn_verbose,v1,primitive,error,gpu,jit::gemm,Unimplemented,src/gpu/intel/jit/gemm/gen_gemm_kernel.cpp:979
Error: Function 'create_primitive' at (/home/gta/simonewi/oneDNN/tests/benchdnn/dnnl_common.hpp:422) returned 'runtime_error'
Error: Function 'init_prim' at (/home/gta/simonewi/oneDNN/tests/benchdnn/dnnl_common.hpp:475) returned '1'
Error: Function 'createit' at (/home/gta/simonewi/oneDNN/tests/benchdnn/matmul/matmul.cpp:897) returned '1'
Error: Function 'create' at (/home/gta/simonewi/oneDNN/tests/benchdnn/utils/task.hpp:49) returned '1'
0:UNTESTED_FAILED __REPRO: --mode=F --matmul --engine=gpu --allow-enum-tags-only=false --memory-kind=usm_device --cold-cache=all --dt=f16:f16:f16 --stag=abc --wtag=abc --dtag=abc 192x663x64:192x64x663
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,gpu,,,--mode=F --matmul --engine=gpu --allow-enum-tags-only=false --memory-kind=usm_device --cold-cache=all --dt=f16:f16:f16 --stag=abc --wtag=abc --dtag=abc 192x663x64:192x64x663,10.8028,0.168213,0,0,0,0
tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:1 listed:0
total perf: min(ms):0 avg(ms):0
total: 0.04s; fill: 0.00s (0%);

#2276 Introduced 64-bit arithmetic for large buffer support in gemm kernels, but some cases (specifically related to word operands) were left unimplemented which resulted in runtime errors. This implements those remaining cases via word -> dword conversion.

@Simonsays095 Simonsays095 requested a review from a team as a code owner January 11, 2025 00:54
@github-actions github-actions bot added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Jan 11, 2025
return *reinterpret_cast<ngen::RegData *>(&src1);
g.mov(1, temp, src1);
return temp;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add emulation specific to this case to the s1Q case below. It should be possible to accomplish this in 4 instructions (mul, mul, add, mov) rather than the proposed 7 (mov, mul, macl, mul, mach, add, mov) and it will also enable the qwq case as well.

Longer term, it may also make sense to refactor so that dstQ && s0Q case contain the actual implementations, and the dstQ && s1Q contain the recursive implementation. I think this will provide some simplification as we know S1 is not an immediate in that case, removing the above mov logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may be able to reduce it to 3 instructions, actually, using 48-bit precision accumulation in dw x w mul instructions. I created MFDNN-103006 to track this optimization.

This PR fixes the broken layers, and I will make the suggested optimization changes in an upcoming PR since there's a bit of exploring to be done finding the best instructions to use.

auto temp = s1Signed ? state.temp[0].d() : state.temp[0].ud();
auto &src1Reg = [&]() -> ngen::RegData & {
if (std::is_base_of<ngen::RegData, S1>::value)
if (s1Immed || s1W) {
g.mov(1, temp, src1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue here, where only the first entry of src1 will be copied in cases where it is not broadcast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed 1 to mod. The intention when calling emul (either broadcast or not) will be preserved for this mov instruction. Since src0 is a qw, exec size is limited to 2 GRFs of qw or 1 GRF of dw, so this will never spill out of temp.

@Simonsays095
Copy link
Contributor Author

make test perf-gpu
set primitive=matmul ip

@Simonsays095 Simonsays095 force-pushed the simonewi/fixup_emulation_qqw branch from 0e74281 to 6750c6c Compare January 17, 2025 16:45
@Simonsays095
Copy link
Contributor Author

make test
disable test_device_cpu
enable test_device_gpu

@Simonsays095
Copy link
Contributor Author

make test perf-gpu
set primitive=matmul ip

@Simonsays095 Simonsays095 force-pushed the simonewi/fixup_emulation_qqw branch from 6750c6c to d5a00a4 Compare January 18, 2025 00:08
@Simonsays095 Simonsays095 merged commit 301eb0d into oneapi-src:main Jan 18, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants