Paper, Tags: #sarcasm-detection, #datasets
More and more, social media platforms like Twitter offer users the ability to create multi-modal messages, text, images and videos. In this paper, we focus on multi-modal sarcasm detection for tweets consisting of texts and images in Twitter. We treat text features, image features and image attributes as 3 modalities and propose a multi-modal hierarchical fusion model. It first extracts image features and attributes features, then leverages attribute features and biLSTM to extract text features. We create a multi-modal sarcasm detection dataset based on twitter.
The attribute features are used to initialize the biLSTM, which is then used to extract the text features. The three features are transformed into reconstructed representation vectors, which is more effective than simply concatenating.
Little has been revealed by far on how to effectively combine textual and visual information to boost performance of Twitter sarcasm detection. Schifanella et al., 2016, only manually concatenate manually designed features or DL based features of text and images to make predictions with 2 modalities.
Image attribute modality has been shown to boost model performance by adding high-level concept of the image content.
For the image modality, we use a pre-trained and fine-tuned ResNet model to obtain 14x14 regional vectors of the tweet image, defined as the raw image vectors, and average them to get the image guidance vector.
For the image attribute modality, we use another pre-trained and fine-tuned ResNet model to predict 5 attributes for each image. We use biLSTM to obtain our text vectors.
Feature vectors of the 3 modalities are reconstructed using raw vectors and guidance vectors. The refined vectors are combined into one vector with weighted average instead of simple concatenation. The fused vector is pumped into a 2-layer fully-connected NN to obtain classification result.
We use ResNet-50 V2 to obtain representations of tweet images.
We treat attributes as an extra modality bridging the tweet text and image, by directly using the word embedding of 5 predicted attributes of each tweet image as the raw attribute vectors. (more explanations in paper).
We use biLSTM to obtain the representation of the tweet text.
We apply the non-linearly transformed attribute guidance vector as the initial state of biLSTM.
We use the weighted average of the feature vectors to obtain the representation fusion. To leverage as much information as possible and more accurately model the relationship between multiple modalities, we exploit information from all 3 modalities.
(details in paper).
There's no publicly available dataset for evaluating the multi-modal sarcasm detection task, so we build our own dataset. We collect english tweets containing a picture and some special hashtag (#sarcasm) as positive examples, and then tweets with images but with no hashtag as negative examples. We discard tweets containing keywords such as sarcasm, sarcastic, irnoy, then we discard tweets containing URLs. We replace mentions with a symbol, separate words, emoticons and hashtags, separate hashtag sign from hashtag, and lowerize.
We use several models:
- Random: randomly predicts whether a tweet is sarcastic
- Text (biLSTM)
- Text (CNN)
- Image
- Attr.
- Concat.
Models based only on the image or attribute modality don't perform well, while both text models perform much better. The removal of early fusion decreases performance, and removal of representation fusion also decreases performance. With image and attribute modalities, our proposed model correctly detects sarcasm in these tweets.
Evaluation results demonstrate the effectiveness of our proposed model and the usefulness of the three modalities. In future work, we will incorporate other modality such as audio into the sarcasm detection task and also investigate to make use of common sense knowledge in our model.