Ashmal Vayani♠, Dinura Dissanayake♠,Hasindri Watawana♠, Noor Ahsan♠, Nevasini Sasikumar♠, Omkar Thawakar♠, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuckreja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdelrahman M Shaker, Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, Shachar Mirkin, Harsh Singh, Ashay Srivastava, Endre Hamerlik, Fathinah Asma Izzati, Fadillah Adamsyah Maani, Sebastian Cavada, Jenny Chim, Rohit Gupta, Sanjay Manjunath, Kamila Zhumakhanova, Feno Heriniaina Rabevohitra, Azril Amirudin, Muhammad Ridzuan, Daniya Kareem, Ketan More, Kunyang Li, Pramesh Shakya, Muhammad Saad, Amirpouya Ghasemaghaei, Amirbek Djanibekov, Dilshod Azizov, Branislava Jankovic, Naman Bhatia, Alvaro Cabrera, Johan Obando-Ceron, Olympiah Otieno, Fabian Farestam, Muztoba Rabbani, Sanoojan Baliah, Santosh Sanjeev, Abduragim Shtanchaev, Maheen Fatima, Thao Nguyen, Amrin Kareem, Toluwani Aremu, Nathan Xavier, Amit Bhatkal, Hawau Toyin, Aman Chadha♠, Hisham Cholakkal♠, Rao Muhammad Anwer♠, Michael Felsberg♠, Jorma Laaksonen♠, Thamar Solorio♠, Monojit Choudhury♠, Ivan Laptev♠, Mubarak Shah♠, Salman Khan♠, and Fahad Khan♠
University of Central Florida, Mohamed bin Zayed University of AI, Amazon, Aalto University, Australian National University, Linköping University
Official GitHub repository for All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
.
- Nov-21-24- Arxiv Preprint is released! 🔥🔥
- Nov-20-24- ALM-Bench Dataset and codes are released. It provides 22,763 human-annotated multimodal QA pairs across 19 categories to extensively evaluate the performance of LMMs. 🔥🔥
Figure: ALM-bench comprises a diverse set of 100 languages with manually verified annotations by respective native language experts. Here, qualitative examples highlight the comprehensive set of 13 cultural aspects covered in the benchmark, such as heritage, customs, architecture, literature, music, and sports. It also evaluates visual understanding for six generic aspects. The ALM-bench focuses on low- resource languages and different regions, spanning 73 countries across five continents and 24 distinct scripts. ALM-bench covers diverse questions, such as multiple choice questions (MCQs), true/false (T/F), short and long visual question answers (VQAs).
Abstract: Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (
ALM-bench
) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages.ALM-bench
challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages, including many low-resource languages traditionally underrepresented in LMM research. The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions, which are further divided into short and long-answer categories.ALM-bench
design ensures a comprehensive assessment of a model’s ability to handle varied levels of difficulty in visual and linguistic reasoning. To capture the rich tapestry of global cultures,ALM-bench
carefully curates content from 13 distinct cultural aspects, ranging from traditions and rituals to famous personalities and celebrations. Through this,ALM-bench
not only provides a rigorous testing ground for state-of-the-art open and closed-source LMMs but also highlights the importance of cultural and linguistic inclusivity, encouraging the development of models that can serve diverse global populations effectively. Our benchmark and codes are publicly available.
ALM-Bench assesses the multilingual multimodal models towards better cultural understanding and inclusivity.
Main contributions:
- All Languages Matter Benchmark (ALM-Bench): We introduce ALM-bench, a culturally diverse multilingual and multimodal VQA benchmark covering 100 languages with 22.7K question-answers. ALM-bench encompasses 19 generic and culture-specific domains for each language, enriched with four diverse question types.
- Extensive Human Annotation: ALM-bench is meticulously curated and verified with native-language experts (over 800 hours of human annotators), ensuring cultural relevance and accuracy across low- and high-resource languages alike.
- Comprehensive Evaluation: We benchmark existing LMMs on the ALM-bench, identifying performance gaps and areas for improvement, especially in culturally complex multilingual scenarios.
Table: Comparison of various LMM benchmarks with a focus on multilingual and cultural understanding. The Domains indicate the range of aspects covered by the dataset for each language. Question Form is categorized as "Diverse" if the questions phrasing varies, and "Fixed" otherwise. Annotation Types are classified as "Manual" if questions were originally in the local language, "Manual+Auto" if questions were generated or translated using GPT-4/Google API and subsequently validated by human experts, and "Auto" if generated or translated automatically without human validation. Bias Correction reflects whether the dataset is balanced across cultures and countries, while Diversity indicates whether the dataset includes both Western and non-Western minority cultures. ‘-’ means information not available.
Figure: Data collection and verification pipeline. Our benchmark features both cultural specific content sourced from the web (left) and generic image understanding collection sourced from existing LMM benchmark. The cultural part is carefully filtered to remove noisy samples and private information. We use GPT4o for translations which are manually verified and corrected with over 800 hours of human annotators (native speakers). Our ALM-bench has diverse question types and features approximately 23K QA pairs in total in 100 languages.
The below Heatmap presents the evaluation results of 16 recent LMMs including 14 open- and 2 closed-source LMMS on all 100 languages on the 19 categories of the ALM-Bench dataset.
Figure: Benchmarking LMMs on diverse languages & cultures: (a) Performance of various open versus closed-sourced LMMs on ALM-bench. For each LMM, we also report performance on low versus high-resource languages. All these carefully selected models were released after 2023. (b) Performance of high-performing LMMs on our culturally curated 13 categories available in our ALM-bench.
We study the performance of various LMMs with and without additional country location information. Proprietary models show a notable performance boost of 2.6% to 5% when location-aware prompts are used, while open-source models exhibit a marginal improvement.
Models | With Country Info. | Without Country Info. |
---|---|---|
GPT-4o | 83.57% | 80.96% |
Gemini-1.5-Pro | 81.52% | 76.19% |
GLM-4V-9B | 56.78% | 56.41% |
Qwen2-VL | 53.97% | 52.57% |
We demonstrate the performance comparison of GPT-4o and Qwen2-VL on different language scripts. We highlight that both the GPT-4o and Qwen2-VL struggle particularly on low-resource scripts such as, Ge’ez (Amharic), Sinhalese (Sinhala), Oriya (Odia), and Myanmar (Myanmar-Burmese).
We show the performance comparison of GPT-4o and Qwen2-VL on 15 language families. Results show that performance on several African (Atlantic-Congo) languages such as Igbo, Kinyarwanda, Shona, Swahili, and Yoruba is inferior compared to several Asian (e.g., Chinese, Korean, Vietnamese) and Western languages (e.g., English, French, German).
Figure: We present the qualitative examples of the success cases in the first row and failure cases of GPT-4o in the second row on different languages & domains in ALM-bench. For the failure cases, we specify different error types. For instance, the Urdu language question asks about the festival depicted in the image. The image specifically refers to Mela Chiraghan (Festival of Lights), a celebration held in honor of the Sufi saint Shah Jamal’s shrine. Since the decoration in the image closely resembles that of Eid Milad un Nabi — another religious festival—the model erroneously associates it with this wrong event. This constitutes a lack of cultural understanding since the model fails to distinguish between the theme behind the decorations. Eid Milad un Nabi typically features more modest, reverential lighting with green lights, whereas the lighting in Mela Chiraghan is brighter and more colorful. Additionally, people typically dress for the Eid Milad un Nabi event in a traditional outfit which is absent in the image. These examples highlight the model’s gap in cultural knowledge and its limitations in terms of accurately interpreting the cultural context of the given sample.
The first version of the ALM-Bench dataset is already finalized. However, for additional reference, we are providing code snippets alongside LLM prompts that we used to generate the initial set of QA pairs.
Please refer to QA_GENERATION.md for instructions and sample code on generating question-answer pairs for CVRR-ES videos using LLM.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The images in ALM-Bench dataset are collected from public domains and sources (refer to main paper for more details) and are for academic research use only. By using ALM-Bench, you agree not to use the dataset for any harm or unfair discrimination. Please note that the data in this dataset may be subject to other agreements. Video copyrights belong to the original dataset providers, video creators, or platforms.
If you find our work and this repository useful, please consider giving our repo a star and citing our paper as follows:
@misc{Ashmal, ..
}
If you have any questions, please create an issue on this repository or contact at [email protected].
This repository has borrowed Video-LMM evaluation code from TimeChat and LLaMA-VID. We also borrowed partial code from Video-ChatGPT and CVRR-Evaluation-Suit repository. We thank the authors for releasing their code.