It's not just size that matters, small language models are also few-shot learners, Schick and Schütze, 2020
We show that performance similar to GPT-3 can be ontained with language models whose parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain some form of task description, combined with gradient-based optimization.
Many tasks can be formulated as cloze questions, such as by appending phrases such as "the correct answer is ...", allowing pretrained LMs to solve them without any or with only very few labeled examples.
Radford et al. 2019 says that LMs are unsupervised multi-task learners. Finding ways to reformulate tasks as cloze questions that are understood well by LMs is difficult, so we proposed PET (pattern exploiting training) in a previous paper, a method that uses knowledge distillation to easily combine several reformulations Our modified version of PET uses masked language models to assign probabilities to sequences of text.
More on the algorithm of PET in paper section 3.
We trained it on the following tasks:
- BoolQ
- CB and RTE
- COPA
- WiC
- WSC
- MultiRC
- ReCoRD
We investigate the importance of several factors for good few-shot performance: the choice of patterns and verbalizers, the usage of both unlabeled and labeled data, and properties of the underlying language model.
We have modified PET enabling us to use it for tasks that require predicting multiple tokens. We have identified several factors responsible for the strong performance of PET combined with pretrained ALBERT: the possibility to concurrently use multiple patterns for transforming examples into cloze questions, the ability to compensate for patterns that are difficult to understand, the usage of labeled data to perform parameter updates, and the underlying LM itself.