Paper, Tags: #nlp
This work proposes a new challenge set for multimodal classification, focused on hate speech in multimodal memes. Challenge set means that the purpose isn't to train models from scratch but to finetune and test large scale multimodal models that were pre-trained.
Difficult examples (benign confounders) are added to the dataset to make it hard to rely on unimodal signals, to counter the possibility that models might exploit unimodal priors.
Interestingly, we can see that the difference between unimodally pretrained models and multimodally pretrained models pretraining is relatively small, and indicating that multimodal pretraining can probably be improved further. Human accuracy is at 84.7%.
The fact that even the best multimodal models are still very far away from human performance shows that there is much room for improvement