Paper, Tags: #nlp, #datasets
BERT's peak performance on the Argument Reasoning Comprehension Task reaches just three points below the average untrained human baseline. In this paper we show that this result is entirely accounted for by exploitation of spurious statistical clues in the dataset. We analyze these cues and demonstrate that a range of models all exploit them
Our adversarial dataset provides a more robust assessment of argument comprehension and should be adopted as the standard in future work.
This is the task of determining argumentative structure in natural language, which text segments represent claims, and which comprise reason that support or attack those claims. Even for humans, it is difficult to determine when two text segments stand in argumentative relation.
One approach to the problem is focusing on warrants, a form of world knowledge that permits inferences.
(1) It is raining, therefore (2) you should take an umbrella.
The warrant (3) 'it is bd to get wet' could license this inference. Knowing (3) facilitates drawing the inferential connection between (1) and (2). However, warrants are often left implicit. Machine learners must reason with warrants, and first discover them.
- Claim: Google is not a harmful monopoly
- Reason: People can choose not to use Google
- Warrant: Other search engines don't redirect to Google
- Alternative: All other search engines redirect to Google
Results:
- Reason + and since + Warrant -> Claim
- Reason + but since + Alternative -> !Claim
The inference form R and A to !C is by design.
The Argument Reasoning Comprehension Task (ARCT) defers the problem of discovering warrants and focuses on inference. An argument is provided, comprising a claim C and reason R. The task is to pick the correct warrant W over a distractor, an alternative warrant A. The alternative is written such that R & A -> !C.
Even supplying warrants, learners still need to rely on further world knowledge. In the example before, it's necessary to know how consumer choice and web re-directs relate to the concept of monopoly, and that Google is a search engine.
All but one participating system in the shared task could not exceed 60% accuracy (on binary classification).
Therefore, it is surprising that BERT achieves 77% test set accuracy, only 3 points below the average (untrained) human baseline. It doesn't seem reasonable to expect it to perform so well without world knowledge.
What has BERT learned about argument comprehension?
BERT exploits the presence of cue words in the warrant, especially not. We demonstrate that BERT's surprising performance can be entirely accounted for in terms of exploiting spurious statistical cues.
Since R & A -> !C, we can add a copy of each data point with the claim negated and the label inverted. The distribution of the statistical cues in the warrants is mirrored over both labels, eliminating the signal. On this adversarial dataset all models perform randomly, with BERT achieving a maximum test set accuracy of 53%.
The adversarial dataset therefore provides a more robust evaluation of argument comprehension and should be adopted as the standard in future work on this dataset.
The major source of spurious statistical cues in ARCT comes from uneven distributions of linguistic artifacts over the warrants, and therefore over the labels.
The productivity and coverage (formulas in paper) of the strongest unigram cue was for not, followed by 'is', 'do', and 'are'. Bigrams occurring with not, 'will not', 'cannot', were also found to be highly productive.
Given R & A -> !C, we can negate the claim and invert the label for each data point. The adversarial examples are then combined with the original data, eliminating the problem by mirroring the distributions of cues around both labels.
- First experimental setup: models trained and validated on the original data were evaluated on the adversarial set. All results were worse than random due to overfitting the cues in the original data set.
- Second experimental setup: models trained from scratch on adversarial training and val sets, then evaluated on the adversarial test set. BERT's peak performance is reduced to 53%.
The adversarial dataset has successfully eliminated the clues as expected, providing a more robust evaluation of machine argument comprehension.
With little to no understanding about the reailty underlying these arguments, good performance shouldn't be feasible.