Skip to content

Commit

Permalink
update XNLI examples and README
Browse files Browse the repository at this point in the history
  • Loading branch information
airaria committed Jan 26, 2022
1 parent 455cd58 commit 041c61e
Show file tree
Hide file tree
Showing 10 changed files with 250 additions and 3 deletions.
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ We demonstrate the basic usage below.
To perform vocabulary pruning, users should provide a text file or a list of strings. The tokens that do not appear in the texts are removed from the model and the tokenizer.
See the examples at [examples/vocabulary_pruning](examples/vocabulary_pruning).
See the examples at [examples/vocabulary_pruning](examples/vocabulary_pruning) and [examples/vocabulary_pruning_xnli](examples/vocabulary_pruning_xnli).
#### Use TextPruner as a package
Expand Down Expand Up @@ -180,15 +180,19 @@ textpruner-cli \
### Transformer Pruning
* To perform transformer pruning on a dataset, a `dataloader` of the dataset should be provided. The `dataloader` should return both the inputs and the labels.
* TextPruner needs the loss return by the model to calculate neuron importance scores. TextPruner will try to guess which element in the model output is the loss. If none of the following is true:
* TextPruner needs the loss returned by the model to calculate neuron importance scores. TextPruner will try to guess which element in the model output is the loss. If none of the following is true:
* the model returns a single element, which is the loss;
* the model output is a list or a tuple. Loss is its first element;
* the loss of can be accessed by `output['loss'] ` or `output.loss` where `output` is the model output
users should provide an `adaptor` function (which takes the output of the model and return the loss) to the `TransformerPruner`.
* If running in *self-supervised* mode, TextPruner needs the logits returned by the model to calculate importance scores. In this case, the `adaptor` should return the logits. Check the `use_logits` option in `TransformerPruningConfig` for details.
See the examples at [examples/transformer_pruning](examples/transformer_pruning).
For self-supervised pruning, see the examples [examples/transformer_pruning_xnli](examples/transformer_pruning_xnli).
#### Use TextPruner as a package
```python
Expand Down
5 changes: 4 additions & 1 deletion README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ Configurations的说明参见[Configurations](#configurations)。

要进行词表裁剪,用户应提供一个文本文件或字符串列表(list of strings)。TextPruner将从model和tokenizer中移除未在文本文件或列表中出现过的token。

具体的例子参见[examples/vocabulary_pruning](examples/vocabulary_pruning)
具体的例子参见[examples/vocabulary_pruning](examples/vocabulary_pruning)和[examples/vocabulary_pruning_xnli](examples/vocabulary_pruning_xnli).

#### 在脚本中使用

Expand Down Expand Up @@ -179,9 +179,12 @@ textpruner-cli \
* loss可以通过`output['loss']``output.loss`得到,其中`output`是模型的输出

那么用户应提供一个`adaptor`函数(以模型的输出为输入,返回loss)给`TransformerPruner`
* 当运行于自监督裁剪模式,TextPruner需要模型返回的logits。此时需要`adaptor`函数返回logits。详细参见`TransformerPruningConfig`中的`use_logits`选项。

具体的例子参见[examples/transformer_pruning](examples/transformer_pruning)

自监督裁剪的例子参见[examples/transformer_pruning_xnli](examples/transformer_pruning_xnli)

#### 在脚本中使用

裁剪一个12层预训练模型,每层的注意力头目标数为8,全连接层的目标维数为2048,通过4次迭代裁剪到目标大小:
Expand Down
6 changes: 6 additions & 0 deletions examples/datasets/xnli/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Download [XNLI](https://github.com/facebookresearch/XNLI) and put `multinli.train.en.tsv`,`multinli.train.zh.tsv`,`xnli.dev.tsv`,`xnli.test.tsv` here.

Concatenate the train files:
```bash
cat multinli.train.en.tsv multinli.train.zh.tsv > multinli.train.en_zh.tsv
```
1 change: 1 addition & 0 deletions examples/models/xlmr_xnli/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Put `pytorch_model.bin`, `config.json` and `sentencepiece.bpe.model` here.
25 changes: 25 additions & 0 deletions examples/transformer_pruning_xnli/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Pruning the Classification model

These scripts perform transformer pruning **in a self-supervised way** on the classification model (`XLMRobertaForSequenceClassification`) and evaluate the performance.

Download the fine-tuned model or train your own model on XNLI dataset, and save the files to `../models/xlmr_xnli`.

Download link:
* [Hugging Face Models](https://huggingface.co/ziqingyang/XLMRobertaBaseForXNLI-en/tree/main)

See the README in ../datasets/xnli for how to construct the dataset.

* Pruning with the python script:
```bash
MODEL_PATH=../models/xlmr_xnli
python transformer_pruning_selfsupervised.py $MODEL_PATH
```

* Evaluate the model:

Set `$PRUNED_MODEL_PATH` to the directory where the pruned model is stored.

```bash
cp $MODEL_PATH/sentencepiece.bpe.model $PRUNED_MODEL_PATH
python measure_performance.py $PRUNED_MODEL_PATH
```
29 changes: 29 additions & 0 deletions examples/transformer_pruning_xnli/measure_performance.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
import logging
logging.basicConfig(level = logging.INFO,format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

from transformers import XLMRobertaForSequenceClassification,XLMRobertaTokenizer
import sys, os

sys.path.insert(0, os.path.abspath('..'))

from classification_utils.my_dataset import MultilingualNLIDataset
from classification_utils.predict_function import predict

model_path = sys.argv[1]
taskname = 'xnli'
data_dir = '../datasets/xnli'
split = 'test'
max_seq_length=128
eval_langs = ['en','zh']
batch_size=32
device = 'cuda'

# Re-initialze the tokenizer
model = XLMRobertaForSequenceClassification.from_pretrained(model_path).to(device)
tokenizer = XLMRobertaTokenizer.from_pretrained(model_path)
eval_dataset = MultilingualNLIDataset(
task=taskname, data_dir=data_dir, split=split, prefix='xlmr',
max_seq_length=max_seq_length, langs=eval_langs, tokenizer=tokenizer)
eval_datasets = [eval_dataset.lang_datasets[lang] for lang in eval_langs]
predict(model, eval_datasets, eval_langs, device, batch_size)
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
import logging
logging.basicConfig(level = logging.INFO,format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

from transformers import XLMRobertaForSequenceClassification,XLMRobertaTokenizer
from textpruner import summary, TransformerPruner, TransformerPruningConfig
import sys, os

sys.path.insert(0, os.path.abspath('..'))

from classification_utils.dataloader_script_xnli import dataloader, eval_langs, batch_size,MultilingualNLIDataset
from classification_utils.predict_function import predict

model_path = sys.argv[1]
model = XLMRobertaForSequenceClassification.from_pretrained(model_path)
tokenizer = XLMRobertaTokenizer.from_pretrained(model_path)

print("Before pruning:")
print(summary(model))

def adatpor(model_outputs):
logits = model_outputs.logits
return logits #entropy(logits)


transformer_pruning_config = TransformerPruningConfig(
target_ffn_size=1536, target_num_of_heads=6,
pruning_method='iterative',n_iters=8,use_logits=True,head_even_masking=False,ffn_even_masking=False)
pruner = TransformerPruner(model,transformer_pruning_config=transformer_pruning_config)
pruner.prune(dataloader=dataloader, save_model=False, adaptor=adatpor)

# save the tokenizer to the same place
#tokenizer.save_pretrained(pruner.save_dir)

print("After pruning:")
print(summary(model))

for i in range(12):
print ((model.base_model.encoder.layer[i].intermediate.dense.weight.shape,
model.base_model.encoder.layer[i].intermediate.dense.bias.shape,
model.base_model.encoder.layer[i].attention.self.key.weight.shape))


print("Measure performance")
taskname = 'xnli'
data_dir = '../datasets/xnli'
split = 'dev'
max_seq_length=128
eval_langs = ['en','zh']
batch_size=32
device= model.device
eval_dataset = MultilingualNLIDataset(
task=taskname, data_dir=data_dir, split=split, prefix='xlmr',
max_seq_length=max_seq_length, langs=eval_langs, tokenizer=tokenizer)
eval_datasets = [eval_dataset.lang_datasets[lang] for lang in eval_langs]
print("dev")
predict(model, eval_datasets, eval_langs, device, batch_size)

split="test"
print("test")
eval_dataset = MultilingualNLIDataset(
task=taskname, data_dir=data_dir, split=split, prefix='xlmr',
max_seq_length=max_seq_length, langs=eval_langs, tokenizer=tokenizer)
eval_datasets = [eval_dataset.lang_datasets[lang] for lang in eval_langs]
predict(model, eval_datasets, eval_langs, device, batch_size)

print(transformer_pruning_config)
27 changes: 27 additions & 0 deletions examples/vocabulary_pruning_xnli/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Pruning the Classification model

These scripts perform vocabulary pruning on the classification model (`XLMRobertaForSequenceClassification`) and evaluate the performance.

We use the English and Chinese training sets as the vocabulary file.

Download the fine-tuned model or train your own model on XNLI dataset, and save the files to `../models/xlmr_xnli`.

Download link:
* [Hugging Face Models](https://huggingface.co/ziqingyang/XLMRobertaBaseForXNLI-en/tree/main)

See the README in ../datasets/xnli for how to construct the dataset.

* Pruning with the python script:
```bash
VOCABULARY_FILE=../datasets/xnli/multinli.train.en_zh.tsv
MODEL_PATH=../models/xlmr_xnli
python vocabulary_pruning.py $MODEL_PATH $VOCABULARY_FILE
```

* Evaluate the model:

Set `$PRUNED_MODEL_PATH` to the directory where the pruned model is stored.

```bash
python measure_performance.py $PRUNED_MODEL_PATH
```
29 changes: 29 additions & 0 deletions examples/vocabulary_pruning_xnli/measure_performance.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
import logging
logging.basicConfig(level = logging.INFO,format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

from transformers import XLMRobertaForSequenceClassification,XLMRobertaTokenizer
import sys, os

sys.path.insert(0, os.path.abspath('..'))

from classification_utils.my_dataset import MultilingualNLIDataset
from classification_utils.predict_function import predict

model_path = sys.argv[1]
taskname = 'xnli'
data_dir = '../datasets/xnli'
split = 'test'
max_seq_length=128
eval_langs = ['en']
batch_size=32
device = 'cuda'

# Re-initialze the tokenizer
model = XLMRobertaForSequenceClassification.from_pretrained(model_path).to(device)
tokenizer = XLMRobertaTokenizer.from_pretrained(model_path)
eval_dataset = MultilingualNLIDataset(
task=taskname, data_dir=data_dir, split=split, prefix='xlmr',
max_seq_length=max_seq_length, langs=eval_langs, tokenizer=tokenizer)
eval_datasets = [eval_dataset.lang_datasets[lang] for lang in eval_langs]
predict(model, eval_datasets, eval_langs, device, batch_size)
56 changes: 56 additions & 0 deletions examples/vocabulary_pruning_xnli/vocabulary_pruning.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
import logging
logging.basicConfig(level = logging.INFO,format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
from transformers import XLMRobertaForSequenceClassification,XLMRobertaTokenizer
from textpruner import summary, VocabularyPruner
from textpruner.commands.utils import read_file_line_by_line
import sys, os

sys.path.insert(0, os.path.abspath('..'))
from classification_utils.my_dataset import MultilingualNLIDataset
from classification_utils.predict_function import predict

# Initialize your model and load data
model_path = sys.argv[1]
vocabulary = sys.argv[2]
model = XLMRobertaForSequenceClassification.from_pretrained(model_path)
tokenizer = XLMRobertaTokenizer.from_pretrained(model_path)
texts, _ = read_file_line_by_line(vocabulary)

print("Before pruning:")
print(summary(model))

pruner = VocabularyPruner(model, tokenizer)
pruner.prune(dataiter=texts, save_model=True)

print("After pruning:")
print(summary(model))


print("Measure performance")

taskname = 'xnli'
data_dir = '../datasets/xnli'
split = 'dev'
max_seq_length=128
eval_langs = ['zh','en']
batch_size=32
device= model.device

# Re-initialze the tokenizer
tokenizer = XLMRobertaTokenizer.from_pretrained(pruner.save_dir)
eval_dataset = MultilingualNLIDataset(
task=taskname, data_dir=data_dir, split=split, prefix='xlmr',
max_seq_length=max_seq_length, langs=eval_langs, tokenizer=tokenizer)
eval_datasets = [eval_dataset.lang_datasets[lang] for lang in eval_langs]
print("dev")
predict(model, eval_datasets, eval_langs, device, batch_size)

split="test"
print("test")
eval_dataset = MultilingualNLIDataset(
task=taskname, data_dir=data_dir, split=split, prefix='xlmr',
max_seq_length=max_seq_length, langs=eval_langs, tokenizer=tokenizer)
eval_datasets = [eval_dataset.lang_datasets[lang] for lang in eval_langs]

predict(model, eval_datasets, eval_langs, device, batch_size)

0 comments on commit 041c61e

Please sign in to comment.