-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantisation and Pruning Support #76
Comments
@karanchahal this sounds great. let's add both and we can use the official PyTorch version when it's ready! The first one as a trainer option: The second after training which can be called on Module. trainer.fit(model)
model.quantize(bits=8) @karanchahal submit a PR and we can walk through the implementation! |
@karanchahal can you please check the link you provide for pruning notebook. I think it's the same link for quantization notebook. |
Hello,
This conversation between me an Tim Dettmers might interest you in the
challenges of attaining real world speed ups with sparse weights.
TimDettmers/sparse_learning#1
My apologies on the wrong link, I'll update it soon and let you know.
Best,
Karan
…On Tue, Aug 13, 2019, 20:09 Shivam Saboo ***@***.***> wrote:
@karanchahal <https://github.com/karanchahal> can you please check the
links you provide for pruning notebook. I think it's the same link for
quantization notebook.
Also, regarding the implementation of neural network pruning, I found that
masking the weights that we need to prun is very simple to implement, but
if we still keep the weight tensors as the same datatype as before, we
still have to do entire matrix multiplication. While multiplications with
0's take less time, still I believe this is really inefficient when you
prun 90% of weights but still have to do full matrix multiplication. Are
you familiar with a way to handle sparse weights more efficiently in
pytorch or some other way such that we can re-structure the network based
on prunned weights (assuming unstructured pruning)?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADEXT7WI7TP6J4KNHSD4BWLQELBQNANCNFSM4IKIZYAQ>
.
|
Thanks for the reply! I was too unaware of so many challenges of working on sparse tensors. |
Hey sure, I was quite interested in this actually. Some great work has been
done on fast sparse kernels (link
<https://openreview.net/forum?id=rJPcZ3txx>, link
<https://arxiv.org/abs/1702.08597>, link <https://arxiv.org/abs/1802.10280>),
but it's certainly an area of active research.
I haven't read these papers but I've heard this is a good place to start.
Let's read this and then revert back here with what we've learnt ?
Best,
Karanbir Chahal
…On Wed, Aug 14, 2019, 14:19 Shivam Saboo ***@***.***> wrote:
Thanks for the reply! I was too unaware of so many challenges of working
on sparse tensors.
But I was really interested in implementing custom layers in PyTorch for
just inference once we have all the boolean mask. Would you be interested
in collaborating on implementing such layers? Perhaps we can start
specifically for linear layers and then extend to other types of layers.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADEXT7XGO2FROWOKPU4MUXLQEPBK7ANCNFSM4IKIZYAQ>
.
|
Great! Will start reading these papers. |
I read through the ICLR 17 paper and implemented their algorithm in python (link to colab). It is not the most efficient implementation as I used python loops to implement their algorithm, but the key takeaway is the speed increase when sparsity in the weights increase, whereas the PyTorch conv2d need almost same time for all sparsity levels (even all 0's weights). I will try to implement the algorithm using PyTorch C++ extension functionality perhaps (haven't worked on it before), but before that I need to figure out how to use CSR sparse matrix in PyTorch (currently I am using scipy). |
This is pretty interesting ! Great work!
I see you're using numba to run it on the GPU if I'm not mistaken.
I wonder if numba converts the python loops into C/C++, if not using C++
extensions might be a worthwhile exercise.
I was also wondering if combining cython with numba would be the easier way
to go for that?
The speed increase is definitely encouraging. I think tuning this
implementation could get us below 4 ms. Btw what do the pytorch people use
to do the conv2d, is it plain im2col or something fancy like a Winograd
algorithm? Mostly I feel they must have really optimised the loading and
unloading of data to and fro from the GPU. We'll have a tough time getting
a better speed than cudnn's super optimised implementation. But definitely
worth trying !
I've been traveling a lot this week and have been unable to read the papers
or code :/ I'll try to read up soon and study your implementation.
On another note, good news is that I've almost got quant aware training
working ( inference in 4 bits ! ).
Apologies again for the late response :)
Best,
Karanbir Chahal
…On Fri, Aug 16, 2019, 00:34 Shivam Saboo ***@***.***> wrote:
I read through the ICLR 17 paper
<https://openreview.net/forum?id=rJPcZ3txx> and implemented their
algorithm in python (link to colab
<https://colab.research.google.com/drive/1MpDzO70S--zGDWjpcwx7uBgSDunKkDhy>).
It is not the most efficient implementation as I used python loops to
implement their algorithm, but the key takeaway is the speed increase when
sparsity in the weights increase, whereas the PyTorch conv2d need almost
same time for all sparsity levels (even all 0's weights). I will try to
implement the algorithm using PyTorch C++ extension functionality perhaps
(haven't worked on it before), but before that I need to figure out how to
use CSR sparse matrix in PyTorch (currently I am using scipy).
If you have any suggestions please let me know!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADEXT7SHPY2DBF4I447IIMDQEWSE3ANCNFSM4IKIZYAQ>
.
|
I used the numba jit decorator for sparse convolution function and it runs on CPU (implemented using scipy sparse arrays). I felt it would convert python loops to C++ but when I use nopython=True to compile entire function I get an error because it cannot recognize scipy sparse matrix format and is treated as a regular python object. I too think that I should first try to make the implementation work with cython and numba before C++ implementation. Regarding pytorch's conv I think it uses im2col but not sure. But I too think if we can somehow implement this paper's algorithm using torch's inbuilt functions and/or optimize the loops we can get faster layer. Will try out a few things this weekend and let you know if I get any improvements |
Ahh okay, well pytorch has the Torchscript thing that we can try as well.
It uses a jit too and applies the optimizations for pytorch tensors. I
don't know it's possible to get it working with scipy sparse format.
Can we use the sparse tensor format (COO) instead of the one scipy uses ?
Thanks again for this great work !
Best,
Karan
…On Fri, Aug 16, 2019, 20:03 Shivam Saboo ***@***.***> wrote:
I used the numba jit decorator for sparse convolution function and it runs
on CPU (implemented using scipy sparse arrays). I felt it would convert
python loops to C++ but when I use nopython=True to compile entire function
I get an error because it cannot recognize scipy sparse matrix format and
is treated as a regular python object.
I too think that I should first try to make the implementation work with
cython and numba before C++ implementation.
Regarding pytorch's conv I think it uses im2col but not sure. But I too
think if we can somehow implement this paper's algorithm using torch's
inbuilt functions and/or optimize the loops we can get faster layer.
Will try out a few things this weekend and let you know if I get any
improvements
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADEXT7XSLAOT3ZZ2EHREWDDQE23EPANCNFSM4IKIZYAQ>
.
|
The paper actually metions using CSR format as row slicing is very fast. Not sure if COO format would be as efficient but we can try. Although converting from COO to CSR should be possible (but not sure how) with small computational overhead |
super excited about this feature! |
@karanchahal @williamFalcon I ported the pure python code to cython and got significant speedups: Cython optimized: For ref: PyTorch conv2d took 1.9 ms on my machine (CPU). (Prev results were on colab(CPU)) google drive link to .pyx and ipynb file: Link to compiled C file: I am looking at more ways to optimize cython code now. |
@sidhanthholalkere @karanchahal spoke with @soumith about this. I think this is better added to core PyTorch. Check out this issue. Once it's merged and live there we can do whatever we need to do to support it. Closing to move this work to the PyTorch issue. |
Note that we have a notebook with a preview tutorial on eager mode post training quantization in core pytorch over in pytorch/pytorch#18318 ... please check it out and leave feedback. |
Is your feature request related to a problem? Please describe.
Nowadays, there is a need to take the floating point models that have been trained and deploy them to edge devices. One way that is popular is to quantise the weights and activation os a neural network to a lower bit width (eg: 8 bits or even 4 bits). The benefits of this are 2 fold:
People have tried other means to compress a model, one of them is pruning.
Pruning basically means that some of the weights of a neural network are zero, hence we seek to introduce sparsity in the network.
The benefits of this are that you potentially do not have to perform the useless multiplications with zeros hence providing a potential computation saving. Research has shown that even after pruning ~80% of weights (this is fine grained pruning), the network preserves it's accuracy . This is a very surprising result. Course grained pruning (setting all weights of a channel to zero) also works to an extent but results in significantly more accuracy loss. This is an active research area.
Describe the solution you'd like
Generally how quantisation works is through the use of a scale value and a zero point value, so each quantised tensor needs to have the quantised tensor, it's scale and zero point. The scale and zero point are needed to convert to and from quantised and dequantized tensors.
There are 2 ways to quantize a model:
I have successfully implemented the post training quantisation algorithms and was able to get a quantised MNIST model down to 8 bits with next to no accuracy loss. Going down to 4 bits resulted in the model diverging.I am currently working on quant aware training as of now. If you want to see how post train quantisation works, please check out this Google colab notebook.
Now, let's come to pruning:
Pruning is a very general thing, there could be a lot of ways to perform it. As far as I know, there is generally a "pruning schedule". The researcher decided when to prune how many percent of weights (aka the degree of sparsity of the layer). Now, they could prune some layers, leave some as is. Slowly increase the sparsity degree of the pruned players with time during training. There are also different types of pruning, a structured way to prune weights (eg: take off full channels of a conv kernel or reduce a dimension of a fully connected layer by 1) or an unstructured way to prune (randomly zero out weights).
Lightning could potentially offer a structured and unstructured way to prune to help out researchers. If you would like to see pruning in action, I have tried pruning out on an MNIST model by using the Google paper algorithm, "To Prune or not to Prune". It is unstructured pruning with 90% sparsity and I was able roughly the same accuracy as the un-pruned model. This is the Google Colab link for it.
Describe alternatives you've considered
Right now Pytorch doesn't have quantization and pruning support however, that is in the works. We could either wait for them to complete their work or we could implement a small library by ourselves.
What use case I was trying to target is lightning could become a playground where researchers could test out quantisation and pruning on their models and potentially could implement novel algorithms through it's base support.
Additional context
If any of you want to learn more about quantization, I have embedded the resources I learnt from below. They were indeed invaluable.
Jacob Benoit et al’s Quantisation Paper (Google)
Raghuraman’s Paper on Quantisation (Google, he’s now at Facebook)
Distiller Docs on Quantisation
Gemmlowp’s Quantisation Tutorial
The text was updated successfully, but these errors were encountered: