We introduce reactive supervision, a novel data collection method that uses the dynamics of online conversations to overcome the limitations of existing techniques. We release a large dataset of tweets with sarcasm perspective labels and new contextual features.
Previous work describe three methods for collecting sarcasm data:
- Distant supervision: automatically collecting 'in-the-wild' sarcastic tweets by leveraging author-generated labels such as the #sarcasm hashtag. Datasets are large and cheap to generate, but noisy and biased. It collects intended sarcasm only.
- Manual annotation: tweets annotated by humans. Low inter-annotator reliability, due to the subjective nature of sarcasm and the lack of context. It collects perceived sarcasm only.
- Manual collection: humans gather and report sarcastic texts, their own or by others. This is slow and expensive, and datasets are small.
Our method collects automated, high-volume, 'in-the-wild' intended and perceived sarcasm data. It exploits the frequent use of a cue tweet: a reply that highlights sarcasm in a prior tweet. We first search for cue tweets, then carefully examine each cue tweet to identify the corresponding sarcastic tweet. Reactive supervision means that the users act as annotators.
We obtain the personal subject pronoun used in the cue and map it to a grammatical person class (1st, 2nd, 3rd class). This informs us whether the sarcastic author is also the author of the cue, its addresse, or another party. For each person class we then apply a heuristic to identify the sarcastic tweet.
If a user refers to itself when saying 'I was just being sarcastic', and this author has written 2+ tweets prior to the cue tweet, the sarcastic tweet cannot be unambiguously pinpointed and the entire thread is discarded. We formalize this by creating a regex for this case. 2nd and 3rd person cues produce corresponding rules and patterns.
We obtain the oblivious tweet as well, the tweet in the middle not understanding the sarcasm. As participants in the thread are familiar with the conversation's context, we overcome some quality issues of using external annotators, who are often unfamiliar with the context due to cultural and social gaps.
Our algorithm has 4 steps:
- Fetch: calls the Twitter Search API to collect cue tweets
- Classify: rule-based, precision-oriented classifier that classifies cues as 1st-, 2nd- or 3rd- person according to the referred pronoun. If the cue cannot be accurately classified (a pronoun cannot be found, the cue contains multiple pronouns, negation words are present), the cue is classified as unknown and discarded.
- Traverse: calls the Twitter Lookup API to retrieve the thread by starting from the cue tweet and repeatedly fetching the parent tweet up to the root tweet.
- Match: matches the thread author's sequence with the corresponding regex. Unmatched sequences are discarded. Otherwise, the sarcastic tweet is identified and saved along with the cue tweet.
The pipeline collected 65k cue tweets containing the phrase "being sarcastic" and corresponding threads during 48 days in Oct-Nov 2019, collecting 312 tweets a day on average. 77% of the cues were classified as unknown, ending with 15k English sarcastic tweets. In addition, 10.6k oblivious and 9.1k eliciting tweets were automatically captured. We added 15k negative instances by sampling random English tweets captured during the same period.
Most captured tweets are 1st person. (10.3k - 3k - 1.7k). We found that most intended sarcasm tweets are replies, while the majority of the perceived sarcasm tweets are root tweets.
BERT is the best performing model with 70.3% accuracy. We compared SPIRS' classification results to Ptacek et al. dataset. Ptacek's accuracy is 86.6%, we posit that it is because sarcasm is confounded with locale in Ptacek (sarcastic tweets are from worldwide users, non-sarcastic tweets are from users near Prague) and thus classifiers learn features correlated to locale. We replaced our negative samples with Ptacek's, and it boosted our accuracy by 19.1%.
This experiment uses conversation context by adding eliciting and oblivious tweets to the model. As far as we know, this is the first sarcasm-related task that uses oblivious texts. We use BiLSTMs. Accuracy for the full-context model ws 74.7%.
Using the new labels in our dataset, we propose a new task to classify a sarcastic tweet's perspective: intended vs. perceived. BERT has the best improvement with an accuracy of 68.2%.
We hypothesize that longer. more informative texts make sarcasm easier to perceive. We also believe that the classifier learns discourse-related features (original tweet vs. reply tweet), which can lead to these errors. Further analysis of sarcasm perspective and its interplay with sarcasm pragmatics is a promising avenue for future research.