Paper, Tags: #sarcasm-detection
CASCADE is a hybrid approach of both content and context-driven modeling for sarcasm detection in online social media discussions. For the latter, CASCADE extracts contextual information from the discourse of a discussion thread. And because the sarcastic nature and form of expression can vary from person to person, we use user embeddings encoding stylommetric and personality features of the users. When used with a CNN, we see a significant boost in the classification performance on a large Reddit corpus.
Sarcasm can not only be expressed via lexical cues such as punctuations, interjections etc, but implicitly, relying on context.
Using the users' historical posts to model their writing style (stylommetry) and personality indicators, which are then fused into comprehensive user embeddings using a multi-view fusion approach, Canonical Correlation Analysis (CCA). Then, it extracts contextual information from the discourse of comments in the discussion forums, done by document modeling. After the contextual modeling phase, CASCADE performs content-modeling using a CNN to extract its syntactic features. This CNN representation is then concatenated with the relevant user embeddings and discourse features to get the final representation used for classification.
Previous work can be classified into 2 categories:
- Content-based: standard classification task, find lexical and pragmatic indicators. Prosodic and spectral cues in spoken dialogue, positive predicates, interjections, emoticons, positive sentoment words in negative situations, etc.
- Context-based: show the importance of speaker and/or topical information associated to a text to gather such context. Mainly historical posts of users have been used, sentiment and noun phrases used within the forum to gather context typical from that forum, etc.
For the content-based information from the comment, a CNN is used. It extracts location-invariant local patterns, capturing both syntactic and semantic information.
For contextual modeling, CASCADE first learns user embeddings and discourse features of all users and discussion forums. Then retrieves user embedding of user ui and discourse feature vector of forum tj. Then, all three vectors cij, ui and tj are concatenated and used for the classification.
To generate user embeddings, we model their stylommetric and personality features and then fuse them using CCA to create a single representation.
For the stylommetric features, we use an unsupervised representation learning method ParagraphVector to generate a fixed-sized vector for each user by performing the auxiliary task of predicting the words within the document.
For the personality features, for every user ui we iterate over all the vi comments written by them. We provide the comment as an input to a pre-trained CNN, pre-trained on a multi-label personality detection task.
We fuse stylommetric and personality features into a comprehensive embedding for each user, to which we use CCA (page 5 of paper).
To extract discourse features from the general forum, for all Nt discussion forums, we compose each forum's document by appending the comments within them, and then use ParagraphVector to generate discourse representation for each document.
We concatenate the text representation cij, user embedding ui, and discourse feature tj to form the unified text representation. Then we pass it to the output layer with 2 neurons with a softmax activation.
We perform our experiments on a large-scale self-annotated corpus for sarcasm, SARC, with 1M+ examples of sarcastic/non-sarcastic statements made in Reddit. We consider 3 variants of the SARC dataset in our experiments:
- Main balanced: primary dataset, a balanced distribution. of sarcastic and non-sarcastic comments
- Main imbalanced: to emulate real-world situations where there are much fewer sarcastic comments than non-sarcastic ones, we use an imbalanced version with a 20-80 ratio.
- Pol: we extract a subset of the dataset on the topic of politics
For baseline models, we use:
- Bag of words
- CNN: can only model the content of a comment
- CNN-SVM: Cnn for content modeling and other pre-trained CNNs for extracting sentiment, emotion and personality features. All features are concatenated and fed to a SVM for classification
- CUE-CNN: models user embeddings with a technique similar to ParagraphVector, then combine embeddings with a CNN.
CASCADE manages to achieve major improvement across all datasets with statistical significance, worst performance achieved by BoW. We also show that including contextual features increases the performance over the just content-based CNN, and there's also a major boost when including user embeddings.
Overall, CASCADE consisting of CNN with user embeddings and contextual discourse features provide the best performance in all three datasets. The biggest boost when including user embeddings happens in the Pol dataset.
Sometimes our model fails despite the presence of contextual information from the previous comments, this is observed for contextual comments which are very long. Sequential discourse modeling using RNNs might be better suited for such cases. And in user embeddings, there was some misclassification for users with lesser historical posts. In such scenarios, potential solutions would be to create user networks and derive information from similar users within the network.