In this paper we show that commonsense inference is still difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Humans achieve 95% accuracy, but state-of-the-art models struggle <48%. We use adversarial Filtering like in Zellers et al., 2018, wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models.
When the SWAG dataset was first announced, Zellers et al., 2018, this new task of commonsense natural language inference seemed trivial for humans (88%) and challenging for then-state-of-the-art models, such as ELMo. But BERT achieved 86%, almost human-level performance. They announced that BERT could really finish your sentence.
In this pape we argue that the underlying task remains unsolved, and that deep models don't demonstrate robust commonsense reasoning ability by themselves. Instead, they operate more like rapid surface learners for a particular dataset. Their strong performance in SWAG depends on the finetuning process, where they learn to pick up on dataset-specific distributional biases. When the distribution of language shifts slightly, performance drops drastically.
HellaSwag has 70k problems. To make this dataset robust to deep pretrained models, we use a trifecta of state-of-the-art generators, state-of-the-art discriminators (BERT), and high quality source text. We expand on SWAG's original video-captioning domain by using WikiHow articles. Our investigation reveals a Goldilocks zone, roughly three sentences in context, and two generated sentences, where generations are largely nonsensical, even if state-of-the-art discriminators can't reliably tell the difference.
While the best known ELMo NLI model, ESIM+ELMo requires the entire training set to reach 59%, BERT outperforms this given only 64 examples. But then BERT needs still needs upwards of 16k examples to approach human performance, around which it plateaus.
We compare BERT's performance when trained and evaluated on variants of SWAG.
- Context: BERT's performance only slips 11.9% when context is omitted (Ending only), which suggests that a bias exists in the endings themselves.
- Structure: we consider a new scenario, Shuffled, where the shared context is provided, but the words in each ending choice are randomly permuted. BERT easily adapts to this, which suggests that BERT is largely performing lexical reasoning over each (context, answer) pair.
- When the context is removed and the words in each ending are shuffled, performance drops to 60.4%, which is low but still higher than ELMo. It's likely that systems primarily learn to detect distributional stylistic patterns during finetuning.
Zellers et al., used a two layer LSTM for generation, with shallow stylistic adversarial filters. This setup was robust against ELMo models, but has the shallow LM in particular produced distributional artifacts that BERT picks up on? To investigate, we perform AF using BERT-Large as the discriminator. The results show that the generations used in SWAG are so different from the human-written endings that AF never drops the accuracy to chance. Instead, it converges roughly in 75%. On the other hand, GPT's generations are good enough for BERT to drop below 30%.
The success of BERT implies that high-quality generators and discriminators are crucial to AF's success. But it doesn't imply that commonsense NLI is solved.
We extract 80k context and follow-up paragraphs from WikiHow. Each context and follow-ups have at most 3 sentences. Given more context, it becomes easier to classify an ending as machine- or human-written. In the two-sentence case, we find ourselves in a Goldilocks zone wherein generations are challenging for deep models, yet as we shall soon see, easy for humans. In the end, we keep the 25k best ActivityNet contexts and the 45k best WikiHow contexts.
We evaluate the difficulty of HellaSwag using a variety of strong baselines, with and without massive pretraining. All models, given a context and an ending, return a logit for that ending.
- OpenAI GPT
- BERT-Base
- ESIM+ELMo
- LSTM sentence encoder
- FastText
Human performance is over 95%, while overal model performance is <50% for every model. Despite BERT-Large having been used as adversarial filter, it still performs the strongest at 47.3% overall. By making the dataset adversarial for BERT, it seems to also have become adversarial for every other model.
The best models are trained on the same dataset they're evaluated on, training on SWAG and evaluating on HellaSwag lowers performance by 12%, and vice versa lowers performance by 15%.
By throwing in the best known generator, GPT, and the best known discriminator, BERT-Large, we made an adversarial dataset - not just to BERT, but to all models we have access to.