update XNLI examples and README

airaria · Jan 26, 2022 · 041c61e · 041c61e
1 parent 455cd58
commit 041c61e
Show file tree

Hide file tree

Showing 10 changed files with 250 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -139,7 +139,7 @@ We demonstrate the basic usage below.
 
 To perform vocabulary pruning, users should provide a text file or a list of strings. The tokens that do not appear in the texts are removed from the model and the tokenizer.
 
-See the examples at [examples/vocabulary_pruning](examples/vocabulary_pruning).
+See the examples at [examples/vocabulary_pruning](examples/vocabulary_pruning) and [examples/vocabulary_pruning_xnli](examples/vocabulary_pruning_xnli).
 
 #### Use TextPruner as a package
 
@@ -180,15 +180,19 @@ textpruner-cli  \
 ### Transformer Pruning
 
 * To perform transformer pruning on a dataset, a `dataloader` of the dataset should be provided. The `dataloader` should return both the inputs and the labels. 
-* TextPruner needs the loss return by the model to calculate neuron importance scores. TextPruner will try to guess which element in the model output is the loss. If none of the following is true:
+* TextPruner needs the loss returned by the model to calculate neuron importance scores. TextPruner will try to guess which element in the model output is the loss. If none of the following is true:
   * the model returns  a single element, which is the loss;
   * the model output is a list or a tuple. Loss is its first element;
   * the loss of can be accessed by `output['loss'] ` or `output.loss` where `output` is the model output
 
   users should provide an `adaptor` function (which takes the output of the model and return the loss) to the `TransformerPruner`.
 
+  * If running in *self-supervised* mode, TextPruner needs the logits returned by the model to calculate importance scores. In this case,  the `adaptor` should return the logits. Check the `use_logits` option in `TransformerPruningConfig` for details.
+
 See the examples at [examples/transformer_pruning](examples/transformer_pruning).
 
+For self-supervised pruning, see the examples [examples/transformer_pruning_xnli](examples/transformer_pruning_xnli).
+
 #### Use TextPruner as a package
 
 ```python

diff --git a/README_zh.md b/README_zh.md
@@ -135,7 +135,7 @@ Configurations的说明参见[Configurations](#configurations)。
 
 要进行词表裁剪，用户应提供一个文本文件或字符串列表（list of strings）。TextPruner将从model和tokenizer中移除未在文本文件或列表中出现过的token。
 
-具体的例子参见[examples/vocabulary_pruning](examples/vocabulary_pruning)
+具体的例子参见[examples/vocabulary_pruning](examples/vocabulary_pruning)和[examples/vocabulary_pruning_xnli](examples/vocabulary_pruning_xnli).
 
 #### 在脚本中使用
 
@@ -179,9 +179,12 @@ textpruner-cli  \
   * loss可以通过`output['loss']`或`output.loss`得到，其中`output`是模型的输出
 
   那么用户应提供一个`adaptor`函数（以模型的输出为输入，返回loss）给`TransformerPruner`。
+* 当运行于自监督裁剪模式，TextPruner需要模型返回的logits。此时需要`adaptor`函数返回logits。详细参见`TransformerPruningConfig`中的`use_logits`选项。
 
 具体的例子参见[examples/transformer_pruning](examples/transformer_pruning)
 
+自监督裁剪的例子参见[examples/transformer_pruning_xnli](examples/transformer_pruning_xnli)
+
 #### 在脚本中使用
 
 裁剪一个12层预训练模型，每层的注意力头目标数为8，全连接层的目标维数为2048，通过4次迭代裁剪到目标大小：

diff --git a/examples/datasets/xnli/README.md b/examples/datasets/xnli/README.md
@@ -0,0 +1,6 @@
+Download [XNLI](https://github.com/facebookresearch/XNLI) and put `multinli.train.en.tsv`,`multinli.train.zh.tsv`,`xnli.dev.tsv`,`xnli.test.tsv` here.
+
+Concatenate the train files:
+```bash
+cat multinli.train.en.tsv multinli.train.zh.tsv > multinli.train.en_zh.tsv
+```
diff --git a/examples/models/xlmr_xnli/README.md b/examples/models/xlmr_xnli/README.md
@@ -0,0 +1 @@
+Put  `pytorch_model.bin`, `config.json` and `sentencepiece.bpe.model` here.
diff --git a/examples/transformer_pruning_xnli/README.md b/examples/transformer_pruning_xnli/README.md
@@ -0,0 +1,25 @@
+# Pruning the Classification model
+
+These scripts perform transformer pruning **in a self-supervised way** on the classification model (`XLMRobertaForSequenceClassification`) and evaluate the performance.
+
+Download the fine-tuned model or train your own model on XNLI dataset, and save the files to `../models/xlmr_xnli`.
+
+Download link: 
+    * [Hugging Face Models](https://huggingface.co/ziqingyang/XLMRobertaBaseForXNLI-en/tree/main)
+
+See the README in ../datasets/xnli for how to construct the dataset.
+
+* Pruning with the python script:
+```bash
+MODEL_PATH=../models/xlmr_xnli
+python transformer_pruning_selfsupervised.py $MODEL_PATH
+```
+
+* Evaluate the model:
+
+Set `$PRUNED_MODEL_PATH` to the directory where the pruned model is stored.
+
+```bash
+cp $MODEL_PATH/sentencepiece.bpe.model $PRUNED_MODEL_PATH
+python measure_performance.py $PRUNED_MODEL_PATH
+```
diff --git a/examples/transformer_pruning_xnli/measure_performance.py b/examples/transformer_pruning_xnli/measure_performance.py
@@ -0,0 +1,29 @@
+import logging
+logging.basicConfig(level = logging.INFO,format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+
+from transformers import XLMRobertaForSequenceClassification,XLMRobertaTokenizer
+import sys, os
+
+sys.path.insert(0, os.path.abspath('..'))
+
+from classification_utils.my_dataset import MultilingualNLIDataset
+from classification_utils.predict_function import predict
+
+model_path = sys.argv[1]
+taskname = 'xnli'
+data_dir = '../datasets/xnli'
+split = 'test'
+max_seq_length=128
+eval_langs = ['en','zh']
+batch_size=32
+device = 'cuda'
+
+# Re-initialze the tokenizer
+model = XLMRobertaForSequenceClassification.from_pretrained(model_path).to(device)
+tokenizer = XLMRobertaTokenizer.from_pretrained(model_path)
+eval_dataset = MultilingualNLIDataset(
+    task=taskname, data_dir=data_dir, split=split, prefix='xlmr',
+    max_seq_length=max_seq_length, langs=eval_langs,  tokenizer=tokenizer)
+eval_datasets = [eval_dataset.lang_datasets[lang] for lang in eval_langs]
+predict(model, eval_datasets, eval_langs, device, batch_size)
diff --git a/examples/transformer_pruning_xnli/transformer_pruning_selfsupervised.py b/examples/transformer_pruning_xnli/transformer_pruning_selfsupervised.py
@@ -0,0 +1,67 @@
+import logging
+logging.basicConfig(level = logging.INFO,format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+
+from transformers import XLMRobertaForSequenceClassification,XLMRobertaTokenizer
+from textpruner import summary, TransformerPruner, TransformerPruningConfig
+import sys, os
+
+sys.path.insert(0, os.path.abspath('..'))
+
+from classification_utils.dataloader_script_xnli import dataloader, eval_langs, batch_size,MultilingualNLIDataset
+from classification_utils.predict_function import predict
+
+model_path = sys.argv[1]
+model = XLMRobertaForSequenceClassification.from_pretrained(model_path)
+tokenizer = XLMRobertaTokenizer.from_pretrained(model_path)
+
+print("Before pruning:")
+print(summary(model))
+
+def adatpor(model_outputs):
+    logits = model_outputs.logits
+    return logits #entropy(logits)
+
+
+transformer_pruning_config = TransformerPruningConfig(
+    target_ffn_size=1536, target_num_of_heads=6, 
+    pruning_method='iterative',n_iters=8,use_logits=True,head_even_masking=False,ffn_even_masking=False)
+pruner = TransformerPruner(model,transformer_pruning_config=transformer_pruning_config)   
+pruner.prune(dataloader=dataloader, save_model=False, adaptor=adatpor)
+
+# save the tokenizer to the same place
+#tokenizer.save_pretrained(pruner.save_dir)
+
+print("After pruning:")
+print(summary(model))
+
+for i in range(12):
+    print ((model.base_model.encoder.layer[i].intermediate.dense.weight.shape,
+            model.base_model.encoder.layer[i].intermediate.dense.bias.shape,
+            model.base_model.encoder.layer[i].attention.self.key.weight.shape))
+
+
+print("Measure performance")
+taskname = 'xnli'
+data_dir = '../datasets/xnli'
+split = 'dev'
+max_seq_length=128
+eval_langs = ['en','zh']
+batch_size=32
+device= model.device
+eval_dataset = MultilingualNLIDataset(
+    task=taskname, data_dir=data_dir, split=split, prefix='xlmr',
+    max_seq_length=max_seq_length, langs=eval_langs,  tokenizer=tokenizer)
+eval_datasets = [eval_dataset.lang_datasets[lang] for lang in eval_langs]
+print("dev")
+predict(model, eval_datasets, eval_langs, device, batch_size)
+
+split="test"
+print("test")
+eval_dataset = MultilingualNLIDataset(
+    task=taskname, data_dir=data_dir, split=split, prefix='xlmr',
+    max_seq_length=max_seq_length, langs=eval_langs,  tokenizer=tokenizer)
+eval_datasets = [eval_dataset.lang_datasets[lang] for lang in eval_langs]
+predict(model, eval_datasets, eval_langs, device, batch_size)
+
+print(transformer_pruning_config)
diff --git a/examples/vocabulary_pruning_xnli/README.md b/examples/vocabulary_pruning_xnli/README.md
@@ -0,0 +1,27 @@
+# Pruning the Classification model
+
+These scripts perform vocabulary pruning on the classification model (`XLMRobertaForSequenceClassification`) and evaluate the performance.
+
+We use the English and Chinese training sets as the vocabulary file.
+
+Download the fine-tuned model or train your own model on XNLI dataset, and save the files to `../models/xlmr_xnli`.
+
+Download link: 
+    * [Hugging Face Models](https://huggingface.co/ziqingyang/XLMRobertaBaseForXNLI-en/tree/main)
+
+See the README in ../datasets/xnli for how to construct the dataset.
+
+* Pruning with the python script:
+```bash
+VOCABULARY_FILE=../datasets/xnli/multinli.train.en_zh.tsv
+MODEL_PATH=../models/xlmr_xnli
+python vocabulary_pruning.py $MODEL_PATH $VOCABULARY_FILE
+```
+
+* Evaluate the model:
+
+Set `$PRUNED_MODEL_PATH` to the directory where the pruned model is stored.
+
+```bash
+python measure_performance.py $PRUNED_MODEL_PATH
+```
diff --git a/examples/vocabulary_pruning_xnli/measure_performance.py b/examples/vocabulary_pruning_xnli/measure_performance.py
@@ -0,0 +1,29 @@
+import logging
+logging.basicConfig(level = logging.INFO,format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+
+from transformers import XLMRobertaForSequenceClassification,XLMRobertaTokenizer
+import sys, os
+
+sys.path.insert(0, os.path.abspath('..'))
+
+from classification_utils.my_dataset import MultilingualNLIDataset
+from classification_utils.predict_function import predict
+
+model_path = sys.argv[1]
+taskname = 'xnli'
+data_dir = '../datasets/xnli'
+split = 'test'
+max_seq_length=128
+eval_langs = ['en']
+batch_size=32
+device = 'cuda'
+
+# Re-initialze the tokenizer
+model = XLMRobertaForSequenceClassification.from_pretrained(model_path).to(device)
+tokenizer = XLMRobertaTokenizer.from_pretrained(model_path)
+eval_dataset = MultilingualNLIDataset(
+    task=taskname, data_dir=data_dir, split=split, prefix='xlmr',
+    max_seq_length=max_seq_length, langs=eval_langs,  tokenizer=tokenizer)
+eval_datasets = [eval_dataset.lang_datasets[lang] for lang in eval_langs]
+predict(model, eval_datasets, eval_langs, device, batch_size)
diff --git a/examples/vocabulary_pruning_xnli/vocabulary_pruning.py b/examples/vocabulary_pruning_xnli/vocabulary_pruning.py
@@ -0,0 +1,56 @@
+import logging
+logging.basicConfig(level = logging.INFO,format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+from transformers import XLMRobertaForSequenceClassification,XLMRobertaTokenizer
+from textpruner import summary, VocabularyPruner
+from textpruner.commands.utils import read_file_line_by_line
+import sys, os
+
+sys.path.insert(0, os.path.abspath('..'))
+from classification_utils.my_dataset import MultilingualNLIDataset
+from classification_utils.predict_function import predict
+
+# Initialize your model and load data
+model_path = sys.argv[1]
+vocabulary = sys.argv[2]
+model = XLMRobertaForSequenceClassification.from_pretrained(model_path)
+tokenizer = XLMRobertaTokenizer.from_pretrained(model_path)
+texts, _ = read_file_line_by_line(vocabulary)
+
+print("Before pruning:")
+print(summary(model))
+
+pruner = VocabularyPruner(model, tokenizer)
+pruner.prune(dataiter=texts, save_model=True)
+
+print("After pruning:")
+print(summary(model))
+
+
+print("Measure performance")
+
+taskname = 'xnli'
+data_dir = '../datasets/xnli'
+split = 'dev'
+max_seq_length=128
+eval_langs = ['zh','en']
+batch_size=32
+device= model.device
+
+# Re-initialze the tokenizer
+tokenizer = XLMRobertaTokenizer.from_pretrained(pruner.save_dir)
+eval_dataset = MultilingualNLIDataset(
+    task=taskname, data_dir=data_dir, split=split, prefix='xlmr',
+    max_seq_length=max_seq_length, langs=eval_langs,  tokenizer=tokenizer)
+eval_datasets = [eval_dataset.lang_datasets[lang] for lang in eval_langs]
+print("dev")
+predict(model, eval_datasets, eval_langs, device, batch_size)
+
+split="test"
+print("test")
+eval_dataset = MultilingualNLIDataset(
+    task=taskname, data_dir=data_dir, split=split, prefix='xlmr',
+    max_seq_length=max_seq_length, langs=eval_langs,  tokenizer=tokenizer)
+eval_datasets = [eval_dataset.lang_datasets[lang] for lang in eval_langs]
+
+predict(model, eval_datasets, eval_langs, device, batch_size)
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Put `pytorch_model.bin`, `config.json` and `sentencepiece.bpe.model` here.