Skip to content

🔥 ALM-Bench is a multilingual multi-modal diverse cultural benchmark for 100 languages across 19 categories. It assesses the next generation of LMMs on cultural inclusitivity.

License

Notifications You must be signed in to change notification settings

mbzuai-oryx/ALM-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

Oryx MobiLLama

license

Core Authors

University of Central Florida, Mohamed bin Zayed University of AI, Amazon, Aalto University, Australian National University, Linköping University

paper Dataset Website

Official GitHub repository for All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages.


📢 Latest Updates

  • Nov-21-24- Arxiv Preprint is released! 🔥🔥
  • Nov-20-24- ALM-Bench Dataset and codes are released. It provides 22,763 human-annotated multimodal QA pairs across 19 categories to extensively evaluate the performance of LMMs. 🔥🔥

🏆 Highlights

main figure

Figure: ALM-bench comprises a diverse set of 100 languages with manually verified annotations by respective native language experts. Here, qualitative examples highlight the comprehensive set of 13 cultural aspects covered in the benchmark, such as heritage, customs, architecture, literature, music, and sports. It also evaluates visual understanding for six generic aspects. The ALM-bench focuses on low- resource languages and different regions, spanning 73 countries across five continents and 24 distinct scripts. ALM-bench covers diverse questions, such as multiple choice questions (MCQs), true/false (T/F), short and long visual question answers (VQAs).

Abstract: Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages. ALM-bench challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages, including many low-resource languages traditionally underrepresented in LMM research. The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions, which are further divided into short and long-answer categories. ALM-bench design ensures a comprehensive assessment of a model’s ability to handle varied levels of difficulty in visual and linguistic reasoning. To capture the rich tapestry of global cultures, ALM-bench carefully curates content from 13 distinct cultural aspects, ranging from traditions and rituals to famous personalities and celebrations. Through this, ALM-bench not only provides a rigorous testing ground for state-of-the-art open and closed-source LMMs but also highlights the importance of cultural and linguistic inclusivity, encouraging the development of models that can serve diverse global populations effectively. Our benchmark and codes are publicly available.

ALM-Bench assesses the multilingual multimodal models towards better cultural understanding and inclusivity.

Main contributions:

  1. All Languages Matter Benchmark (ALM-Bench): We introduce ALM-bench, a culturally diverse multilingual and multimodal VQA benchmark covering 100 languages with 22.7K question-answers. ALM-bench encompasses 19 generic and culture-specific domains for each language, enriched with four diverse question types.
  2. Extensive Human Annotation: ALM-bench is meticulously curated and verified with native-language experts (over 800 hours of human annotators), ensuring cultural relevance and accuracy across low- and high-resource languages alike.
  3. Comprehensive Evaluation: We benchmark existing LMMs on the ALM-bench, identifying performance gaps and areas for improvement, especially in culturally complex multilingual scenarios.

🗂️ Dataset

Dataset Comparison table

Table: Comparison of various LMM benchmarks with a focus on multilingual and cultural understanding. The Domains indicate the range of aspects covered by the dataset for each language. Question Form is categorized as "Diverse" if the questions phrasing varies, and "Fixed" otherwise. Annotation Types are classified as "Manual" if questions were originally in the local language, "Manual+Auto" if questions were generated or translated using GPT-4/Google API and subsequently validated by human experts, and "Auto" if generated or translated automatically without human validation. Bias Correction reflects whether the dataset is balanced across cultures and countries, while Diversity indicates whether the dataset includes both Western and non-Western minority cultures. ‘-’ means information not available.


🔍 Dataset Annotation Process

main figure

Figure: Data collection and verification pipeline. Our benchmark features both cultural specific content sourced from the web (left) and generic image understanding collection sourced from existing LMM benchmark. The cultural part is carefully filtered to remove noisy samples and private information. We use GPT4o for translations which are manually verified and corrected with over 800 hours of human annotators (native speakers). Our ALM-bench has diverse question types and features approximately 23K QA pairs in total in 100 languages.


📊 Results

The below Heatmap presents the evaluation results of 16 recent LMMs including 14 open- and 2 closed-source LMMS on all 100 languages on the 19 categories of the ALM-Bench dataset.

Results Heatmap

⚖️ Cultural Categories and Open-Sourced vs Close-Sourced LMM Performance

main figure

Figure: Benchmarking LMMs on diverse languages & cultures: (a) Performance of various open versus closed-sourced LMMs on ALM-bench. For each LMM, we also report performance on low versus high-resource languages. All these carefully selected models were released after 2023. (b) Performance of high-performing LMMs on our culturally curated 13 categories available in our ALM-bench.

Different prompting techniques

We study the performance of various LMMs with and without additional country location information. Proprietary models show a notable performance boost of 2.6% to 5% when location-aware prompts are used, while open-source models exhibit a marginal improvement.

Models With Country Info. Without Country Info.
GPT-4o 83.57% 80.96%
Gemini-1.5-Pro 81.52% 76.19%
GLM-4V-9B 56.78% 56.41%
Qwen2-VL 53.97% 52.57%

Performance on different language scripts

We demonstrate the performance comparison of GPT-4o and Qwen2-VL on different language scripts. We highlight that both the GPT-4o and Qwen2-VL struggle particularly on low-resource scripts such as, Ge’ez (Amharic), Sinhalese (Sinhala), Oriya (Odia), and Myanmar (Myanmar-Burmese).

app-screen

Performance on different language families

We show the performance comparison of GPT-4o and Qwen2-VL on 15 language families. Results show that performance on several African (Atlantic-Congo) languages such as Igbo, Kinyarwanda, Shona, Swahili, and Yoruba is inferior compared to several Asian (e.g., Chinese, Korean, Vietnamese) and Western languages (e.g., English, French, German).

app-screen


🤖 Qualitative Success and Failure cases

main figure

Figure: We present the qualitative examples of the success cases in the first row and failure cases of GPT-4o in the second row on different languages & domains in ALM-bench. For the failure cases, we specify different error types. For instance, the Urdu language question asks about the festival depicted in the image. The image specifically refers to Mela Chiraghan (Festival of Lights), a celebration held in honor of the Sufi saint Shah Jamal’s shrine. Since the decoration in the image closely resembles that of Eid Milad un Nabi — another religious festival—the model erroneously associates it with this wrong event. This constitutes a lack of cultural understanding since the model fails to distinguish between the theme behind the decorations. Eid Milad un Nabi typically features more modest, reverential lighting with green lights, whereas the lighting in Mela Chiraghan is brighter and more colorful. Additionally, people typically dress for the Eid Milad un Nabi event in a traditional outfit which is absent in the image. These examples highlight the model’s gap in cultural knowledge and its limitations in terms of accurately interpreting the cultural context of the given sample.


🚀 Getting started with ALM-Bench

Downloading and Setting Up ALM-Bench Dataset

🧩 Additional Assets for LLM-based QA generation process:

Generating LLM-based question-answer pairs from images from our cultural categories:

The first version of the ALM-Bench dataset is already finalized. However, for additional reference, we are providing code snippets alongside LLM prompts that we used to generate the initial set of QA pairs.

Please refer to QA_GENERATION.md for instructions and sample code on generating question-answer pairs for CVRR-ES videos using LLM.


📂 License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The images in ALM-Bench dataset are collected from public domains and sources (refer to main paper for more details) and are for academic research use only. By using ALM-Bench, you agree not to use the dataset for any harm or unfair discrimination. Please note that the data in this dataset may be subject to other agreements. Video copyrights belong to the original dataset providers, video creators, or platforms.

📜 Citation

If you find our work and this repository useful, please consider giving our repo a star and citing our paper as follows:

@misc{Ashmal, .. 
} 

📨 Contact

If you have any questions, please create an issue on this repository or contact at [email protected].

🙏 Acknowledgements

This repository has borrowed Video-LMM evaluation code from TimeChat and LLaMA-VID. We also borrowed partial code from Video-ChatGPT and CVRR-Evaluation-Suit repository. We thank the authors for releasing their code.


About

🔥 ALM-Bench is a multilingual multi-modal diverse cultural benchmark for 100 languages across 19 categories. It assesses the next generation of LMMs on cultural inclusitivity.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published