Skip to content

Latest commit

 

History

History
134 lines (76 loc) · 4.2 KB

Algo.md

File metadata and controls

134 lines (76 loc) · 4.2 KB

Algorithm explanation

Basic Idea

Linear Layer:

$Y = W \cdot X$

finetuning is to get the $W'$

$Y = W \cdot X + W' \cdot X$

and normally $shape(W') = shape(W)$


Conventional Methods

Ref

Linear:

$Y_{out \times batch} = W_{out \times in} \cdot X_{in \times batch}$

$\xrightarrow{} Y_{out \times batch} = W_{out \times in} \cdot X_{in \times batch} + Wa_{out \times dim} \cdot Wb_{dim \times in} \cdot X_{in \times batch}$

LoRA for Convolution: Consider im2col of matmul first:

image image $X:[channel, width, height]$

$\xrightarrow{reorder}[c \times kw \times kh, outw \times outh]$

$Kernels: [out, c, kw, kh] \xrightarrow{reshape} [out, c \times kw \times kh]$

$Conv(X, Kernels) = Kernels \times X \xrightarrow{reshape} [out, outw, outh]$

and then write down this conventional LoRA for conv layer $Conv(in, out, ksize, padding, stride)$

$\xrightarrow{}Conv(dim, out, 1)\circ Conv(in, dim, ksize, padding, stride)$

In this method, we can get that $W' = Wa \cdot Wb$ with $rank(W') \le dim$


Hadamard Product

Ref image

consider $W' = Wa \odot Wb$, we can get $rank(W') \le rank(Wa) \times rank(Wb)$. And then we use conventional method on $Wa$ and $Wb$. Which means it can use 2x dim to get square rank.

Rank != Information capacity, but they are relative

based on the experiment result from the paper, it seems like although the rank(Wa) * rank(Wb) is just upper bound, but almost everytime it will produce W' with rank = rank(Wa)*rank(Wb).

Why custom backward

with $W' = (Wa_1 \cdot Wa_2) \odot (Wb_1 \cdot Wb_2)$, when you need to calc the backpropogation, you will need $\Delta{W'}$ and $Wa$ to calc $\Delta{Wb}$, also $Wb$ for $\Delta{Wa}$.

With pytorch's autograd, this kind of operation will cache the $Wa$ and $Wb$ for calc the backward, which means it will cache 2x size of weight for backward.

To avoid this terrible situation, I impl a custom backward which will reconstruct $Wa$ and $Wb$ when actually needed, this method saved tons of memory.

CP-Decomposition

Ref

As mentioned before, the weight shape for convolution layer is $[out, in, kw, kh]$. And we just unfold it to $[out, in \times kw \times kh]$ for decomposition.

But actually there is a method to decomposition any shape ot tensor called cp decomposition.

Using cp-decomposition in Covolution will be something like:

$\tau: [dim, dim, kw, kh]$
$x_1: [dim, out]$
$x_2: [dim, in]$
$W' = \tau \times_1 x_1 \times_2 x_2$
$W': [out, in, kw, kh]$

Or write this thing as multiple conv layer:

Conv(in, dim, (1, 1))

Conv(dim, dim, (kw, kh), stride, padding)

conv(dim, out, (1, 1))

For hadamard product implementation, just use 2 different $W'$ and multiply them together.


Kronercker Product

Definition

If $W_1$ is an a x b matrix and $W_2$ is a c x d matrix, then the Kronecker Product of two matrices is

$W' = W_1 \otimes W_2$ and an ac x bd matrix.

In meaning of matrix, $W_2$ becomes weight and $W_1$ becomes weight scale of $W_2$

About rank

And we can decompose $W_2$ using LoRA with rank, r.

$W_2 = Wa_2 \cdot Wb_2$ then $W' = W_1 \otimes (Wa_2 \cdot Wb_2)$

we can get $rank(W') \le rank(W_1) \times rank(Wa_2 \cdot Wb_2)$ and $rank(W_1) \le min(a, b), rank(Wa_2 \cdot Wb_2) \le r$

=> $rank(W') \le min(a, b) \times r$

Remember that $min(a, b) \times r$ is upper bound. $rank(W') = min(a, b) \times r$ does not guarantee. Experiment needs.

Number of parameters

We decompose matrix, $W' = W_1 \otimes (Wa_2 \cdot Wb_2)$.

(# of parameters) = $(a \times b) + (c \times r + r \times d) = a \times b + r \times (c + d)$, m = ac, n = bd.

and suppose best case $a=c= \sqrt{m}, b=d= \sqrt{n}$

then, (# of parameters) = $\sqrt{mn} + r \times (\sqrt{m} + \sqrt{n})$

We can reduce the number of parameters to order of square root maximally.


Sparse Bias

Todo...