Quantisation and Pruning Support #76

karanchahal · 2019-08-08T10:28:36Z

Is your feature request related to a problem? Please describe.
Nowadays, there is a need to take the floating point models that have been trained and deploy them to edge devices. One way that is popular is to quantise the weights and activation os a neural network to a lower bit width (eg: 8 bits or even 4 bits). The benefits of this are 2 fold:

Some accelerators perform computation at lower bit widths much faster than fp16 or fp32 computation.
The model takes less space, and the savings increase by a substantial factor every time we reduce a bit from the tensor data type.

People have tried other means to compress a model, one of them is pruning.
Pruning basically means that some of the weights of a neural network are zero, hence we seek to introduce sparsity in the network.

The benefits of this are that you potentially do not have to perform the useless multiplications with zeros hence providing a potential computation saving. Research has shown that even after pruning ~80% of weights (this is fine grained pruning), the network preserves it's accuracy . This is a very surprising result. Course grained pruning (setting all weights of a channel to zero) also works to an extent but results in significantly more accuracy loss. This is an active research area.

Describe the solution you'd like
Generally how quantisation works is through the use of a scale value and a zero point value, so each quantised tensor needs to have the quantised tensor, it's scale and zero point. The scale and zero point are needed to convert to and from quantised and dequantized tensors.

There are 2 ways to quantize a model:

Post training quantisation: Quantises a trained model, no retraining required (works well for down to 8 bits).
Quantisation Aware Training: A way to train a model to induce robustness to quantisation. (It works well for aggressive quantizations schemes (down to 4 bits))

I have successfully implemented the post training quantisation algorithms and was able to get a quantised MNIST model down to 8 bits with next to no accuracy loss. Going down to 4 bits resulted in the model diverging.I am currently working on quant aware training as of now. If you want to see how post train quantisation works, please check out this Google colab notebook.

Now, let's come to pruning:

Pruning is a very general thing, there could be a lot of ways to perform it. As far as I know, there is generally a "pruning schedule". The researcher decided when to prune how many percent of weights (aka the degree of sparsity of the layer). Now, they could prune some layers, leave some as is. Slowly increase the sparsity degree of the pruned players with time during training. There are also different types of pruning, a structured way to prune weights (eg: take off full channels of a conv kernel or reduce a dimension of a fully connected layer by 1) or an unstructured way to prune (randomly zero out weights).
Lightning could potentially offer a structured and unstructured way to prune to help out researchers. If you would like to see pruning in action, I have tried pruning out on an MNIST model by using the Google paper algorithm, "To Prune or not to Prune". It is unstructured pruning with 90% sparsity and I was able roughly the same accuracy as the un-pruned model. This is the Google Colab link for it.

Describe alternatives you've considered
Right now Pytorch doesn't have quantization and pruning support however, that is in the works. We could either wait for them to complete their work or we could implement a small library by ourselves.

What use case I was trying to target is lightning could become a playground where researchers could test out quantisation and pruning on their models and potentially could implement novel algorithms through it's base support.

Additional context
If any of you want to learn more about quantization, I have embedded the resources I learnt from below. They were indeed invaluable.

Jacob Benoit et al’s Quantisation Paper (Google)
Raghuraman’s Paper on Quantisation (Google, he’s now at Facebook)
Distiller Docs on Quantisation
Gemmlowp’s Quantisation Tutorial

williamFalcon · 2019-08-10T11:44:53Z

@karanchahal this sounds great. let's add both and we can use the official PyTorch version when it's ready!

The first one as a trainer option:
Trainer(quantize_bits=4)

The second after training which can be called on Module.

trainer.fit(model)

model.quantize(bits=8)

@karanchahal submit a PR and we can walk through the implementation!

shivamsaboo17 · 2019-08-13T14:38:59Z

@karanchahal can you please check the link you provide for pruning notebook. I think it's the same link for quantization notebook.
Also, regarding the implementation of neural network pruning, I found that masking the weights that we need to prun is very simple to implement, but if we still keep the weight tensors as the same datatype as before, we still have to do entire matrix multiplication. While multiplications with 0's take less time, still I believe this is really inefficient when you prun 90% of weights but still have to do full matrix multiplication. Are you familiar with a way to handle sparse weights more efficiently in pytorch or some other way such that we can re-structure the network based on prunned weights (assuming unstructured pruning)?

karanchahal · 2019-08-14T04:21:06Z

Hello, This conversation between me an Tim Dettmers might interest you in the challenges of attaining real world speed ups with sparse weights. TimDettmers/sparse_learning#1 My apologies on the wrong link, I'll update it soon and let you know. Best, Karan

…

On Tue, Aug 13, 2019, 20:09 Shivam Saboo ***@***.***> wrote: @karanchahal <https://github.com/karanchahal> can you please check the links you provide for pruning notebook. I think it's the same link for quantization notebook. Also, regarding the implementation of neural network pruning, I found that masking the weights that we need to prun is very simple to implement, but if we still keep the weight tensors as the same datatype as before, we still have to do entire matrix multiplication. While multiplications with 0's take less time, still I believe this is really inefficient when you prun 90% of weights but still have to do full matrix multiplication. Are you familiar with a way to handle sparse weights more efficiently in pytorch or some other way such that we can re-structure the network based on prunned weights (assuming unstructured pruning)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#76>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADEXT7WI7TP6J4KNHSD4BWLQELBQNANCNFSM4IKIZYAQ> .

shivamsaboo17 · 2019-08-14T08:49:48Z

Thanks for the reply! I was too unaware of so many challenges of working on sparse tensors.
But I was really interested in implementing custom layers in PyTorch for just inference (only writing forward pass perhaps using torch.sparse API) once we have all the boolean mask. Would you be interested in collaborating on implementing such layers? Perhaps we can start specifically for linear layers and then extend to other types of layers.

karanchahal · 2019-08-14T13:39:21Z

Hey sure, I was quite interested in this actually. Some great work has been done on fast sparse kernels (link <https://openreview.net/forum?id=rJPcZ3txx>, link <https://arxiv.org/abs/1702.08597>, link <https://arxiv.org/abs/1802.10280>), but it's certainly an area of active research. I haven't read these papers but I've heard this is a good place to start. Let's read this and then revert back here with what we've learnt ? Best, Karanbir Chahal

…

On Wed, Aug 14, 2019, 14:19 Shivam Saboo ***@***.***> wrote: Thanks for the reply! I was too unaware of so many challenges of working on sparse tensors. But I was really interested in implementing custom layers in PyTorch for just inference once we have all the boolean mask. Would you be interested in collaborating on implementing such layers? Perhaps we can start specifically for linear layers and then extend to other types of layers. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#76>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADEXT7XGO2FROWOKPU4MUXLQEPBK7ANCNFSM4IKIZYAQ> .

shivamsaboo17 · 2019-08-14T14:16:48Z

Great! Will start reading these papers.

shivamsaboo17 · 2019-08-15T19:04:42Z

I read through the ICLR 17 paper and implemented their algorithm in python (link to colab). It is not the most efficient implementation as I used python loops to implement their algorithm, but the key takeaway is the speed increase when sparsity in the weights increase, whereas the PyTorch conv2d need almost same time for all sparsity levels (even all 0's weights). I will try to implement the algorithm using PyTorch C++ extension functionality perhaps (haven't worked on it before), but before that I need to figure out how to use CSR sparse matrix in PyTorch (currently I am using scipy).
If you have any suggestions please let me know!

karanchahal · 2019-08-16T11:38:26Z

This is pretty interesting ! Great work! I see you're using numba to run it on the GPU if I'm not mistaken. I wonder if numba converts the python loops into C/C++, if not using C++ extensions might be a worthwhile exercise. I was also wondering if combining cython with numba would be the easier way to go for that? The speed increase is definitely encouraging. I think tuning this implementation could get us below 4 ms. Btw what do the pytorch people use to do the conv2d, is it plain im2col or something fancy like a Winograd algorithm? Mostly I feel they must have really optimised the loading and unloading of data to and fro from the GPU. We'll have a tough time getting a better speed than cudnn's super optimised implementation. But definitely worth trying ! I've been traveling a lot this week and have been unable to read the papers or code :/ I'll try to read up soon and study your implementation. On another note, good news is that I've almost got quant aware training working ( inference in 4 bits ! ). Apologies again for the late response :) Best, Karanbir Chahal

…

On Fri, Aug 16, 2019, 00:34 Shivam Saboo ***@***.***> wrote: I read through the ICLR 17 paper <https://openreview.net/forum?id=rJPcZ3txx> and implemented their algorithm in python (link to colab <https://colab.research.google.com/drive/1MpDzO70S--zGDWjpcwx7uBgSDunKkDhy>). It is not the most efficient implementation as I used python loops to implement their algorithm, but the key takeaway is the speed increase when sparsity in the weights increase, whereas the PyTorch conv2d need almost same time for all sparsity levels (even all 0's weights). I will try to implement the algorithm using PyTorch C++ extension functionality perhaps (haven't worked on it before), but before that I need to figure out how to use CSR sparse matrix in PyTorch (currently I am using scipy). If you have any suggestions please let me know! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#76>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADEXT7SHPY2DBF4I447IIMDQEWSE3ANCNFSM4IKIZYAQ> .

shivamsaboo17 · 2019-08-16T14:33:40Z

I used the numba jit decorator for sparse convolution function and it runs on CPU (implemented using scipy sparse arrays). I felt it would convert python loops to C++ but when I use nopython=True to compile entire function I get an error because it cannot recognize scipy sparse matrix format and is treated as a regular python object.

I too think that I should first try to make the implementation work with cython and numba before C++ implementation.

Regarding pytorch's conv I think it uses im2col but not sure. But I too think if we can somehow implement this paper's algorithm using torch's inbuilt functions and/or optimize the loops we can get faster layer.

Will try out a few things this weekend and let you know if I get any improvements

karanchahal · 2019-08-16T15:38:15Z

Ahh okay, well pytorch has the Torchscript thing that we can try as well. It uses a jit too and applies the optimizations for pytorch tensors. I don't know it's possible to get it working with scipy sparse format. Can we use the sparse tensor format (COO) instead of the one scipy uses ? Thanks again for this great work ! Best, Karan

…

On Fri, Aug 16, 2019, 20:03 Shivam Saboo ***@***.***> wrote: I used the numba jit decorator for sparse convolution function and it runs on CPU (implemented using scipy sparse arrays). I felt it would convert python loops to C++ but when I use nopython=True to compile entire function I get an error because it cannot recognize scipy sparse matrix format and is treated as a regular python object. I too think that I should first try to make the implementation work with cython and numba before C++ implementation. Regarding pytorch's conv I think it uses im2col but not sure. But I too think if we can somehow implement this paper's algorithm using torch's inbuilt functions and/or optimize the loops we can get faster layer. Will try out a few things this weekend and let you know if I get any improvements — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#76>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADEXT7XSLAOT3ZZ2EHREWDDQE23EPANCNFSM4IKIZYAQ> .

shivamsaboo17 · 2019-08-17T03:47:12Z

The paper actually metions using CSR format as row slicing is very fast. Not sure if COO format would be as efficient but we can try. Although converting from COO to CSR should be possible (but not sure how) with small computational overhead

williamFalcon · 2019-08-18T23:06:56Z

@shivamsaboo17 @karanchahal https://gitter.im/PyTorch-Lightning/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge

williamFalcon · 2019-08-18T23:07:12Z

super excited about this feature!

shivamsaboo17 · 2019-08-21T05:20:46Z

@karanchahal @williamFalcon I ported the pure python code to cython and got significant speedups:
My experiments are on 3x64x64 input tensor and filters size is 256x3x3x3
Pure python:
50% sparse --> 45 seconds
90% sparse --> 11 seconds
100% sparse --> 60ms

Cython optimized:
50% sparse --> 13 ms
90% sparse --> 5 ms
100% sparse --> 661 microseconds

For ref: PyTorch conv2d took 1.9 ms on my machine (CPU). (Prev results were on colab(CPU))

google drive link to .pyx and ipynb file:
https://drive.google.com/open?id=1gnrbFNWJBZbyPH6KKnCLmrPBNqOFtKUD
https://drive.google.com/open?id=1--_B89H4iSZuJuj9QKqBRrB5Tlr7DMnH

Link to compiled C file:
https://drive.google.com/open?id=1nCGKRmM4AGcmepEJCkWAl_SBZc2l-rrA

I am looking at more ways to optimize cython code now.

williamFalcon · 2019-08-30T15:00:17Z

@sidhanthholalkere @karanchahal spoke with @soumith about this. I think this is better added to core PyTorch. Check out this issue.

Once it's merged and live there we can do whatever we need to do to support it.

Closing to move this work to the PyTorch issue.

gottbrath · 2019-09-18T18:56:18Z

Note that we have a notebook with a preview tutorial on eager mode post training quantization in core pytorch over in pytorch/pytorch#18318 ... please check it out and leave feedback.

karanchahal added feature Is an improvement or enhancement help wanted Open to be worked on labels Aug 8, 2019

karanchahal changed the title ~~Post Training Quantisation and Quantisation Aware Training Support~~ Quantisation and Pruning Support Aug 8, 2019

williamFalcon assigned karanchahal and shivamsaboo17 Aug 19, 2019

williamFalcon closed this as completed Aug 30, 2019

banderlog mentioned this issue Sep 6, 2019

Quantization Model Support pytorch/pytorch#11348

Closed

limberc mentioned this issue May 3, 2020

Get Caffe2 Error when import the pytorch-lightning. #1713

Closed

rafathasan mentioned this issue Apr 6, 2023

StepLR doesn't work as expected after loading from checkpoint using Trainer.fit(ckpt_path=...) #17296

Closed

leng-yue mentioned this issue Apr 7, 2023

Incorrect Checkpoint storing path when using WandbLogger #17298

Closed

GeoffNN mentioned this issue Apr 15, 2023

DDP training freezes immediately #17389

Open

jsejdija mentioned this issue Apr 24, 2023

typeerror when trying to fit a TemporalFusionTransformer model #17458

Closed

Borda mentioned this issue May 3, 2023

deepspeed strategy can't save checkpoint, TypeError: cannot pickle torch._C._distributed_c10d.ProcessGroup object #17369

Open

thisistejaspandey mentioned this issue May 8, 2023

Load From Checkpoint Bug #17593

Closed

pranavrao-qure mentioned this issue Apr 25, 2024

Issue in Manual optimisation, during self.manual_backward call #19810

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantisation and Pruning Support #76

Quantisation and Pruning Support #76

karanchahal commented Aug 8, 2019

williamFalcon commented Aug 10, 2019

shivamsaboo17 commented Aug 13, 2019 •

edited

Loading

karanchahal commented Aug 14, 2019 via email

shivamsaboo17 commented Aug 14, 2019 •

edited

Loading

karanchahal commented Aug 14, 2019 via email

shivamsaboo17 commented Aug 14, 2019

shivamsaboo17 commented Aug 15, 2019

karanchahal commented Aug 16, 2019 via email

shivamsaboo17 commented Aug 16, 2019

karanchahal commented Aug 16, 2019 via email

shivamsaboo17 commented Aug 17, 2019

williamFalcon commented Aug 18, 2019

williamFalcon commented Aug 18, 2019

shivamsaboo17 commented Aug 21, 2019 •

edited

Loading

williamFalcon commented Aug 30, 2019

gottbrath commented Sep 18, 2019 •

edited

Loading

Quantisation and Pruning Support #76

Quantisation and Pruning Support #76

Comments

karanchahal commented Aug 8, 2019

williamFalcon commented Aug 10, 2019

shivamsaboo17 commented Aug 13, 2019 • edited Loading

karanchahal commented Aug 14, 2019 via email

shivamsaboo17 commented Aug 14, 2019 • edited Loading

karanchahal commented Aug 14, 2019 via email

shivamsaboo17 commented Aug 14, 2019

shivamsaboo17 commented Aug 15, 2019

karanchahal commented Aug 16, 2019 via email

shivamsaboo17 commented Aug 16, 2019

karanchahal commented Aug 16, 2019 via email

shivamsaboo17 commented Aug 17, 2019

williamFalcon commented Aug 18, 2019

williamFalcon commented Aug 18, 2019

shivamsaboo17 commented Aug 21, 2019 • edited Loading

williamFalcon commented Aug 30, 2019

gottbrath commented Sep 18, 2019 • edited Loading

shivamsaboo17 commented Aug 13, 2019 •

edited

Loading

shivamsaboo17 commented Aug 14, 2019 •

edited

Loading

shivamsaboo17 commented Aug 21, 2019 •

edited

Loading

gottbrath commented Sep 18, 2019 •

edited

Loading