This document presents step-by-step instructions for pruning Huggingface models on translation tasks using the Intel® Neural Compressor. It also provides an example of flan-T5-small pruning.
PyTorch 1.8 or higher version is needed with pytorch_fx backend.
pip install -r requirements.txt
The dataset will be downloaded automatically from the datasets Hub. See more about loading huggingface dataset
The Flan-T5 model could be downloaded from Huggingface. More details of running this Pytorch model could be found at Model Usage.
git lfs install
git clone https://huggingface.co/google/flan-t5-small
git lfs pull
An example of finetuning Flan-T5 is provided to generate a desirable baseline model for pruning jobs.
An example of pruning a Flan-T5-small model is provided, which is trained on wmt16 English-Romanian task. We are working on providing more pruning examples and sharing our sparse models on HuggingFace.
The snip-momentum pruning method is used by default and the initial dense model us fine-tuned.
Model | Dataset | Target sparsity | Sparsity pattern | Dense BLEU | Sparse BLEU | Relative drop |
---|---|---|---|---|---|---|
Flan-T5-small | wmt16 en-ro | 0.8 | 4x1 | 25.63 | 24.35 | -4.95% |