-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xe: jit: gemm: implement qqw mul instructions #2375
xe: jit: gemm: implement qqw mul instructions #2375
Conversation
return *reinterpret_cast<ngen::RegData *>(&src1); | ||
g.mov(1, temp, src1); | ||
return temp; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add emulation specific to this case to the s1Q
case below. It should be possible to accomplish this in 4 instructions (mul, mul, add, mov) rather than the proposed 7 (mov, mul, macl, mul, mach, add, mov) and it will also enable the qwq
case as well.
Longer term, it may also make sense to refactor so that dstQ && s0Q
case contain the actual implementations, and the dstQ && s1Q
contain the recursive implementation. I think this will provide some simplification as we know S1
is not an immediate in that case, removing the above mov
logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may be able to reduce it to 3 instructions, actually, using 48-bit precision accumulation in dw x w mul instructions. I created MFDNN-103006 to track this optimization.
This PR fixes the broken layers, and I will make the suggested optimization changes in an upcoming PR since there's a bit of exploring to be done finding the best instructions to use.
src/gpu/intel/jit/emulation.hpp
Outdated
auto temp = s1Signed ? state.temp[0].d() : state.temp[0].ud(); | ||
auto &src1Reg = [&]() -> ngen::RegData & { | ||
if (std::is_base_of<ngen::RegData, S1>::value) | ||
if (s1Immed || s1W) { | ||
g.mov(1, temp, src1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue here, where only the first entry of src1
will be copied in cases where it is not broadcast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed 1
to mod
. The intention when calling emul
(either broadcast or not) will be preserved for this mov
instruction. Since src0
is a qw, exec size is limited to 2 GRFs of qw or 1 GRF of dw, so this will never spill out of temp
.
6984c45
to
0e74281
Compare
make test perf-gpu |
0e74281
to
6750c6c
Compare
make test |
make test perf-gpu |
6750c6c
to
d5a00a4
Compare
Fixes MFDNN-12978. The following layer (among others) is failing:
#2276 Introduced 64-bit arithmetic for large buffer support in gemm kernels, but some cases (specifically related to word operands) were left unimplemented which resulted in runtime errors. This implements those remaining cases via word -> dword conversion.