Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantisation and Pruning Support #76

Closed
karanchahal opened this issue Aug 8, 2019 · 16 comments
Closed

Quantisation and Pruning Support #76

karanchahal opened this issue Aug 8, 2019 · 16 comments
Assignees
Labels
feature Is an improvement or enhancement help wanted Open to be worked on

Comments

@karanchahal
Copy link

Is your feature request related to a problem? Please describe.
Nowadays, there is a need to take the floating point models that have been trained and deploy them to edge devices. One way that is popular is to quantise the weights and activation os a neural network to a lower bit width (eg: 8 bits or even 4 bits). The benefits of this are 2 fold:

  1. Some accelerators perform computation at lower bit widths much faster than fp16 or fp32 computation.
  2. The model takes less space, and the savings increase by a substantial factor every time we reduce a bit from the tensor data type.

People have tried other means to compress a model, one of them is pruning.
Pruning basically means that some of the weights of a neural network are zero, hence we seek to introduce sparsity in the network.

The benefits of this are that you potentially do not have to perform the useless multiplications with zeros hence providing a potential computation saving. Research has shown that even after pruning ~80% of weights (this is fine grained pruning), the network preserves it's accuracy . This is a very surprising result. Course grained pruning (setting all weights of a channel to zero) also works to an extent but results in significantly more accuracy loss. This is an active research area.

Describe the solution you'd like
Generally how quantisation works is through the use of a scale value and a zero point value, so each quantised tensor needs to have the quantised tensor, it's scale and zero point. The scale and zero point are needed to convert to and from quantised and dequantized tensors.

There are 2 ways to quantize a model:

  1. Post training quantisation: Quantises a trained model, no retraining required (works well for down to 8 bits).
  2. Quantisation Aware Training: A way to train a model to induce robustness to quantisation. (It works well for aggressive quantizations schemes (down to 4 bits))

I have successfully implemented the post training quantisation algorithms and was able to get a quantised MNIST model down to 8 bits with next to no accuracy loss. Going down to 4 bits resulted in the model diverging.I am currently working on quant aware training as of now. If you want to see how post train quantisation works, please check out this Google colab notebook.

Now, let's come to pruning:

Pruning is a very general thing, there could be a lot of ways to perform it. As far as I know, there is generally a "pruning schedule". The researcher decided when to prune how many percent of weights (aka the degree of sparsity of the layer). Now, they could prune some layers, leave some as is. Slowly increase the sparsity degree of the pruned players with time during training. There are also different types of pruning, a structured way to prune weights (eg: take off full channels of a conv kernel or reduce a dimension of a fully connected layer by 1) or an unstructured way to prune (randomly zero out weights).
Lightning could potentially offer a structured and unstructured way to prune to help out researchers. If you would like to see pruning in action, I have tried pruning out on an MNIST model by using the Google paper algorithm, "To Prune or not to Prune". It is unstructured pruning with 90% sparsity and I was able roughly the same accuracy as the un-pruned model. This is the Google Colab link for it.

Describe alternatives you've considered
Right now Pytorch doesn't have quantization and pruning support however, that is in the works. We could either wait for them to complete their work or we could implement a small library by ourselves.

What use case I was trying to target is lightning could become a playground where researchers could test out quantisation and pruning on their models and potentially could implement novel algorithms through it's base support.

Additional context
If any of you want to learn more about quantization, I have embedded the resources I learnt from below. They were indeed invaluable.

Jacob Benoit et al’s Quantisation Paper (Google)
Raghuraman’s Paper on Quantisation (Google, he’s now at Facebook)
Distiller Docs on Quantisation
Gemmlowp’s Quantisation Tutorial

@karanchahal karanchahal added feature Is an improvement or enhancement help wanted Open to be worked on labels Aug 8, 2019
@karanchahal karanchahal changed the title Post Training Quantisation and Quantisation Aware Training Support Quantisation and Pruning Support Aug 8, 2019
@williamFalcon
Copy link
Contributor

@karanchahal this sounds great. let's add both and we can use the official PyTorch version when it's ready!

The first one as a trainer option:
Trainer(quantize_bits=4)

The second after training which can be called on Module.

trainer.fit(model)

model.quantize(bits=8)

@karanchahal submit a PR and we can walk through the implementation!

@shivamsaboo17
Copy link

shivamsaboo17 commented Aug 13, 2019

@karanchahal can you please check the link you provide for pruning notebook. I think it's the same link for quantization notebook.
Also, regarding the implementation of neural network pruning, I found that masking the weights that we need to prun is very simple to implement, but if we still keep the weight tensors as the same datatype as before, we still have to do entire matrix multiplication. While multiplications with 0's take less time, still I believe this is really inefficient when you prun 90% of weights but still have to do full matrix multiplication. Are you familiar with a way to handle sparse weights more efficiently in pytorch or some other way such that we can re-structure the network based on prunned weights (assuming unstructured pruning)?

@karanchahal
Copy link
Author

karanchahal commented Aug 14, 2019 via email

@shivamsaboo17
Copy link

shivamsaboo17 commented Aug 14, 2019

Thanks for the reply! I was too unaware of so many challenges of working on sparse tensors.
But I was really interested in implementing custom layers in PyTorch for just inference (only writing forward pass perhaps using torch.sparse API) once we have all the boolean mask. Would you be interested in collaborating on implementing such layers? Perhaps we can start specifically for linear layers and then extend to other types of layers.

@karanchahal
Copy link
Author

karanchahal commented Aug 14, 2019 via email

@shivamsaboo17
Copy link

Great! Will start reading these papers.

@shivamsaboo17
Copy link

I read through the ICLR 17 paper and implemented their algorithm in python (link to colab). It is not the most efficient implementation as I used python loops to implement their algorithm, but the key takeaway is the speed increase when sparsity in the weights increase, whereas the PyTorch conv2d need almost same time for all sparsity levels (even all 0's weights). I will try to implement the algorithm using PyTorch C++ extension functionality perhaps (haven't worked on it before), but before that I need to figure out how to use CSR sparse matrix in PyTorch (currently I am using scipy).
If you have any suggestions please let me know!

@karanchahal
Copy link
Author

karanchahal commented Aug 16, 2019 via email

@shivamsaboo17
Copy link

I used the numba jit decorator for sparse convolution function and it runs on CPU (implemented using scipy sparse arrays). I felt it would convert python loops to C++ but when I use nopython=True to compile entire function I get an error because it cannot recognize scipy sparse matrix format and is treated as a regular python object.

I too think that I should first try to make the implementation work with cython and numba before C++ implementation.

Regarding pytorch's conv I think it uses im2col but not sure. But I too think if we can somehow implement this paper's algorithm using torch's inbuilt functions and/or optimize the loops we can get faster layer.

Will try out a few things this weekend and let you know if I get any improvements

@karanchahal
Copy link
Author

karanchahal commented Aug 16, 2019 via email

@shivamsaboo17
Copy link

The paper actually metions using CSR format as row slicing is very fast. Not sure if COO format would be as efficient but we can try. Although converting from COO to CSR should be possible (but not sure how) with small computational overhead

@williamFalcon
Copy link
Contributor

@shivamsaboo17 @karanchahal https://gitter.im/PyTorch-Lightning/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge

@williamFalcon
Copy link
Contributor

super excited about this feature!

@shivamsaboo17
Copy link

shivamsaboo17 commented Aug 21, 2019

@karanchahal @williamFalcon I ported the pure python code to cython and got significant speedups:
My experiments are on 3x64x64 input tensor and filters size is 256x3x3x3
Pure python:
50% sparse --> 45 seconds
90% sparse --> 11 seconds
100% sparse --> 60ms

Cython optimized:
50% sparse --> 13 ms
90% sparse --> 5 ms
100% sparse --> 661 microseconds

For ref: PyTorch conv2d took 1.9 ms on my machine (CPU). (Prev results were on colab(CPU))

google drive link to .pyx and ipynb file:
https://drive.google.com/open?id=1gnrbFNWJBZbyPH6KKnCLmrPBNqOFtKUD
https://drive.google.com/open?id=1--_B89H4iSZuJuj9QKqBRrB5Tlr7DMnH

Link to compiled C file:
https://drive.google.com/open?id=1nCGKRmM4AGcmepEJCkWAl_SBZc2l-rrA

I am looking at more ways to optimize cython code now.

@williamFalcon
Copy link
Contributor

@sidhanthholalkere @karanchahal spoke with @soumith about this. I think this is better added to core PyTorch. Check out this issue.

Once it's merged and live there we can do whatever we need to do to support it.

Closing to move this work to the PyTorch issue.

@gottbrath
Copy link

gottbrath commented Sep 18, 2019

Note that we have a notebook with a preview tutorial on eager mode post training quantization in core pytorch over in pytorch/pytorch#18318 ... please check it out and leave feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

4 participants