You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Understanding this difference is really important.
torch.nn.Linear does the following: y = xW^t + b
x - input hidden states W - weights
This operation should take up 90% of inference time.
It can be written as the following: y = x @ (trans(W)) + b
GGML does things very differently, for good reason, with it's function ggml_mul_mat(ctx, W, x) y^T = Wx^t
It can be written as the following trans(y) = W @ (trans(x))
Why are these so different?
The crux of the issue is quantization.
During row-wise quantization (most common), we pack a $m$ x $n$ matrix into an $m$ x $n / p$ (where p is the pack factor, so 4xf32 -> u32 gives a pack factor of 4).
See below for an example
Consecutive floats are now packed into a single value. This causes BIG BIG problems if you want to do a transposed matrix multiply (i.e you can't without swizzling).
Therefore, you must forbid transposing quantized weights.
How then can we get the same result as PyTorch for our matmul?
Well you can rewrite the torch.nn.Linear behaviour as follows:
y = xW^t + b -> y^t = Wx^t + b.
As you can see, this is what GGML has done.
Therefore they can happily dequantize and then matmul without having to transpose first!
Furthermore, because iff y is a vector, then y^t = y, then you get happy memory access patterns on both the weight and the hidden states!
So, you'll see throughout ratchet that the weight is always the LHS of the matmul.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Understanding this difference is really important.
torch.nn.Linear
does the following:y = xW^t + b
x - input hidden states
W - weights
This operation should take up 90% of inference time.
It can be written as the following:
y = x @ (trans(W)) + b
GGML does things very differently, for good reason, with it's function
ggml_mul_mat(ctx, W, x)
y^T = Wx^t
It can be written as the following
trans(y) = W @ (trans(x))
Why are these so different?$m$ x $n$ matrix into an $m$ x $n / p$ (where p is the pack factor, so 4xf32 -> u32 gives a pack factor of 4).
The crux of the issue is quantization.
During row-wise quantization (most common), we pack a
See below for an example
Consecutive floats are now packed into a single value. This causes BIG BIG problems if you want to do a transposed matrix multiply (i.e you can't without swizzling).
Therefore, you must forbid transposing quantized weights.
How then can we get the same result as PyTorch for our matmul?
Well you can rewrite the
torch.nn.Linear
behaviour as follows:y = xW^t + b
->y^t = Wx^t + b
.As you can see, this is what GGML has done.
Therefore they can happily dequantize and then matmul without having to transpose first!
Furthermore, because iff y is a vector, then y^t = y, then you get happy memory access patterns on both the weight and the hidden states!
So, you'll see throughout ratchet that the
weight
is always the LHS of the matmul.Beta Was this translation helpful? Give feedback.
All reactions