Add new Arabic benchmarks (5) and enhance existing tasks #372

alielfilali01 · 2024-10-23T09:41:46Z

Renamed arabic_mmlu to arabic_mmlu_mt:
- This change reflects that the previous Arabic MMLU was machine translated (MT) using a neural machine translation (NMT) engine (most probably Google Translate API).
Introduced three new MMLU-style benchmarks:
- arabic_mmlu: Native Arabic MMLU benchmark introduced by MBZUAI, based on the official "ArabicMMLU" paper (https://arxiv.org/abs/2402.12840).
- arabic_mmlu_ht: Human-translated version from MBZUAI, providing a more accurate and high-quality translation of the original work by Hendrycks et al. (2021) on Measuring Massive Multitask Language Understanding (MMLU).
- arabic_mmmlu: Arabic subset of OpenAI's Multilingual MMLU (MMMLU), which is human-annotated, targeting similar subjects.
Added AraTrust benchmark:
- Integrated AraTrust, a benchmark designed for evaluating trustworthiness in Arabic LLMs (based on the paper "AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic" https://arxiv.org/abs/2403.09017).
Added MadinahQA benchmark:
- MadinahQA, which is generously contributed by MBZUAI for the sake of the 2nd versio. of OALL (https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard). This dataset focuses mainly on educational and linguistic assessments.
Comparative study across different versions of Arabic MMLU:
- Detailed performance analysis shows a strong correlation between OpenAI’s MMMLU (human annotated) and MBZUAI’s Arabic MMLU HT (human-translated).
- The arabic_mmlu_mt (machine translated using an NMT engine) shows competitive results compared to human-translated versions, indicating the efficacy of the translation engine.
- The Okapi version (arabic_mmlu_okapi), which was translated using GPT-3.5 API (ChatGPT), shows lower correlation and performance, reflecting potential flaws and lower translation quality.

The attached table below shows the comparative analysis of model performances across the different Arabic MMLU datasets.

Model Name	Model Size (B)	Average Score	Arabic MMLU Okapi	Arabic MMLU MT	Arabic MMLU HT	Arabic MMLU OpenAI
Qwen_Qwen2.5-7B-Instruct	7	49.79	37.94	50.95	55.41	54.86
Qwen_Qwen2.5-7B	7	47.35	36.21	49.21	51.7	52.28

cc : @clefourrier , @NathanHB , @hynky1999

Add new Arabic benchmarks and update existing tasks - Renamed `arabic_mmlu` to `arabic_mmlu_mt` to highlight its machine-translated origin. - Added new benchmarks: `arabic_mmlu` ArabicMMLU (https://arxiv.org/abs/2402.12840), `arabic_mmlu_ht` (human-translated), and `MadinahQA` from MBZUAI. As well as `arabic_mmmlu` (OpenAI MMMLU), and `AraTrust` a trustworthiness benchmark for Arabic LLMs (https://arxiv.org/abs/2403.09017). - Enhanced prompt functions for better flexibility in answer options.

Rename file to refelect that it is v1 leaderboard tasks

Tasks for v2 of OALL

add new and renamed tasks

hynky1999 · 2024-10-24T11:03:18Z

Hi, we thanks for adding the benches.
Two things:
Do you think you could use the prompt templates ? This will ensure that you can easily switch between formulations (-> be able to evaluate models at early stage of training) and that the task implementation are consistent.

Secondly we have added the both arabic_mmmlu and openai_mmlu. I would prefer not adding duplicates, but I am open to discuss adding the with/without instruction modifications

alielfilali01 · 2024-10-24T11:55:28Z

Hey @hynky1999, thanks for your input!

Secondly, we have added both arabic_mmlu and openai_mmlu.**

I went ahead and added them just to test how the different implementations might affect the scores (hopefully it doesn’t!). I can run this test on my fork and compare the results with your version, and then we can decide whether to keep them or nto. What do you think?

Do you think you could use the prompt templates?**

I haven’t fully wrapped my head around the templates yet—it might take me a few days. If you’re able to help with this integration in the meantime, feel free to contribute! Otherwise, I’ll try to get to it by next week max.

Also, I’m unsure why the format check is failing. I ran ruff format . on my local machine before pushing, but it’s still being flagged. Could you help me figure out what might be going wrong?

Thanks!

NathanHB · 2024-10-24T14:24:35Z

Hi ! Thanks for the PR, for the formating issues, you should use the precommit hooks

pip install -e .[dev]
pre-commit install
pre-commit run --all-files

Fix formatting issues for

NathanHB · 2024-11-04T14:08:26Z

Thanks for fixing the formating ! You can find the doc to add a new task using the prompt templates here don't hesitate to reach out if you need any help :)

alielfilali01 · 2024-11-05T09:50:06Z

Hey @NathanHB Thanks for pointing out to the docs of adding prompt templates. I'am planning to add that in a separate PR in the near future. For now i believe we can move on with this unless it contradicts with the team's plan for the future versions of LightEval.

clefourrier · 2024-11-14T14:20:44Z

Hi! I think we can indeed move forward with this. One last thing before: did you check the difference in result between your and the current implementations of arabic mmlu and open ai mmlu?

alielfilali01 · 2024-11-15T07:52:24Z

Hey @clefourrier, Unfortunately i was planning to but it slipped through as I forgot to add it to my to-do list ! I can prioritize this over the weekend and get back to you with an update by Monday. Does this timeline work ? I assume there will be no difference (or at least that's what we hope for)
Also, could you please confirm if these are the correct corresponding multilingual tasks?

lighteval|mmlu_ara_mcf|0|0
lighteval|openai_mmlu_ara_mcf|0|0

Let me know if there’s anything else to consider.

clefourrier · 2024-11-15T13:17:16Z

Yep, the timeline works on our side! I think these are the correct evals.

hynky1999 · 2024-11-15T15:47:09Z

Yes these are the correct evals

alielfilali01 · 2024-11-17T13:57:57Z

I'am sorry but i don't understand why i keep hitting this error:

ValueError: Cannot find tasks  lighteval|mmlu_ara_mcf in task list or in custom task registry)

My main command:

    # Run the evaluation command
    srun -N $SLURM_NNODES --ntasks=$SLURM_NTASKS --cpus-per-task=$SLURM_CPUS_PER_TASK --gres=gpu:$SLURM_GPUS_PER_NODE \
    yes 'y' | lighteval accelerate \
    --model_args "pretrained=$model,trust_remote_code=$TRUST_REMOTE_CODE" \
    --tasks "lighteval|openai_mmlu_ara_mcf|0|0, lighteval|mmlu_ara_mcf|0|0" \
    --override_batch_size 1 \
    --output_dir=$RESULTS_DIR

I don't understand where i'm passing the custom task registry !

cc: @hynky1999 , @NathanHB , @clefourrier

hynky1999 · 2024-11-17T22:14:56Z

You wanna run with --custom_task lighteval.tasks.multilingual.tasks.

alielfilali01 · 2024-11-18T06:49:50Z

Oooh I thought lighteval suite don't need the --custom_task tag !
Thank you @hynky1999 now it is running.
Will try to update you guys with results ASAP.

Add missing task: OpenAI's MMMLU arabic subset

Correct order

alielfilali01 · 2024-11-20T11:01:17Z

Hey @clefourrier, following up on your previous question :

did you check the difference in result between your and the current implementations of arabic mmlu and open ai mmlu?

Please find here teh results which i find pretty interesting !

Model	ArabicMMLU (arabic community)	ArabicMMLU (multilingual lighteval)	Score Variation A	OpenAI MMMLU (arabic community)	OpenAI MMMLU (multilingual lighteval)	Score Variation B
inceptionai/jais-family-2p7b	0.2946	0.2763	0.0183	0.2530	0.2537	-0.0007
inceptionai/jais-family-2p7b-chat	0.5034	0.5148	-0.0114	0.4122	0.4365	-0.0243
inceptionai/jais-family-6p7b	0.2960	0.2797	0.0163	0.2643	0.2629	0.0014
inceptionai/jais-family-6p7b-chat	0.5519	0.5470	0.0049	0.4468	0.4627	-0.0159
inceptionai/jais-family-30b-8k	0.5163	0.4102	0.1061	0.4156	0.3755	0.0401
inceptionai/jais-family-30b-8k-chat	0.6067	0.6271	-0.0204	0.5001	0.5405	-0.0404

I don't really see here any correlation ! sometimes score is up with community suite implementation, sometimes it goes up with lighteval @hynky1999 implementation

hynky1999 · 2024-11-20T13:08:26Z

I think there is one consistency:
Base models work better with yours, while chat models works better with lighteval.

Re chat models, did you run with tempaltes?
Re task type, did you run with mcf right? The scores are acc_norms?

In any case:
For openai mmlu, we should keep the lighteval implementation. The task is inherently multilingual (has subsets for diff langauges), so using lighteval implementation allows to use other implementations with no code changes.

For mbuzai mmlu, I am open going with your implementation (since you are the creator), but would maybe prefer a middle ground, because it's still very hard-coded and don't allow switching to CF:
Could you try running with this task definition?

from lighteval.metrics.normalizations import LogProbCharNorm, LogProbTokenNorm
from lighteval.tasks.default_prompts import LETTER_INDICES
from lighteval.tasks.lighteval_task import LightevalTaskConfig
from lighteval.tasks.multilingual.utils.task_utils import get_metrics_for_formulation, normalize_subset
from lighteval.tasks.requests import Doc
from lighteval.tasks.templates.multichoice import get_mcq_prompt_function

arabic_mmlu_templated_tasks = [
    LightevalTaskConfig(
        name=f"templated_mmlu_{Language.ARABIC.value}_{formulation.name.lower()}:{normalize_subset(subset)}",
        prompt_function=get_mcq_prompt_function(
            Language.ARABIC,
            lambda line: {
                "instruction": "السؤال التالي هو سؤال متعدد الإختيارات. اختر الإجابة الصحيحة:",
                "context": line["Context"],
                "question": line["Question"],
                "choices": [str(o) for o in [line[f"Option {i}"] for i in range(1, 6)] if o],
                "gold_idx": LETTER_INDICES.index(line["Answer Key"]),
            },
            formulation=formulation,
        ),
        suite=("lighteval",),
        hf_repo="MBZUAI/ArabicMMLU",
        hf_subset=subset,
        evaluation_splits=("test",),
        hf_avail_splits=["dev"],
        metric=get_metrics_for_formulation(
            formulation,
            [
                loglikelihood_acc_metric(normalization=LogProbTokenNorm()),
                loglikelihood_acc_metric(normalization=LogProbCharNorm()),
            ],
        ),
    )
    for subset in ARABIC_MMLU_SUBSETS
    for formulation in [
        MCFFormulation("NativeLetters"),
    ]
]

Task name in this case is: templated_mmlu_ara_mcf.
Btw lighteval now supports running all all subsets by just specifying the common task name:
e.g: lighteval|templated_mmlu_ara_mcf|0|0 will run all subsets

Think it's a good middle ground between the two:

{'choices': [' أ', ' ب', ' ج', ' د'],
 'ctx': '',
 'gold_index': [0],
 'instruction': 'السؤال التالي هو سؤال متعدد الإختيارات. اختر الإجابة الصحيحة:',
 'num_asked_few_shots': -1,
 'num_effective_few_shots': -1,
 'original_query': '',
 'query': 'السؤال التالي هو سؤال متعدد الإختيارات. اختر الإجابة الصحيحة:\n'
          'سؤال: ما هي ذاكرة الوصول العشوائي\n'
          ' أ. هي الذاكرة التي تخزنبها البيانات التي يحتاجها الكمبيوتر أثناء '
          'فترة تشغيله\n'
          ' ب. هي الذاكرة التي تخزن بها البيانات التي تعالج من قبل الكمبيوتر\n'
          ' ج. هو المكان الذي نستخدمه لتخزين البيانات التي نتعامل معها دائماً '
          ', كما تستخدم لنقل البيانات من مكان آلخر\n'
          ' د. هي ذاكرة دائماً يحتاجها الكمبيوتر طول  فترة التشغيل وهي خاصة '
          'بنقل بيانات بطاقات الصوت والفيديو والشبكة\n'
          'إجابة:',
 'specific': None,
 'target_for_fewshot_sorting': None,
 'task_name': 'lighteval|templated_mmlu_ara_mcf:computer_science_university',
 'unconditioned_query': 'إجابة:'}

alielfilali01 · 2024-11-20T13:21:39Z

I noticed this as well:

Base models work better with yours, while chat models works better with lighteval.

But inceptionai/jais-family-6p7b-chat was an exception in ArabicMMLU.

I agree that for the sake of consistency, OpenAI MMMLU should be run through multilingual lighteval suite.

About this:

Btw lighteval now supports running all all subsets by just specifying the common task name:

I tried it and still running into errors, so i just copy pasted all the tasks in a txt file and run.

alielfilali01 · 2024-11-22T08:50:22Z

What errors ?

@hynky1999, Honestly i totally forgot, it is somewhere in one of the log files but i can't remember which one 😶
I remember i tried to debug it assuming i'm doing something wrong (which most probably i'm) but then it was so much hussle so i just copy the subset tasks in a file instead and ran it and it went through smoothly

NathanHB · 2024-11-22T12:07:16Z

Then we are using lighteval's version for open ai MMLU and your version for arabic_mmlu. @alielfilali01 just waiting for you to use the templated version as showned here and we will be good to merge ! :)

alielfilali01 · 2024-11-22T12:53:35Z

Hey @NathanHB, Is it already defined or should i define it ? Anyway, I will try to get it done tomorrow.
Also i just remembered i need to remove the openai mmmlu task from arabic_evals.py !

remove openai mmmlu task following the discussion here: #372

NathanHB · 2024-11-22T14:42:41Z

it is not yet defined using templated prompt, @hynky1999 provided some code to use templates!

For mbuzai mmlu, I am open going with your implementation (since you are the creator), but would maybe prefer a middle ground, because it's still very hard-coded and don't allow switching to CF:
Could you try running with this task definition?

alielfilali01 · 2024-11-23T09:16:31Z

Hey @hynky1999, I'am wondering, which normalization is used exactly in mcf formulation when calling get_metrics_for_formulation:

metric=get_metrics_for_formulation(
            formulation,
            [
                loglikelihood_acc_metric(normalization=LogProbTokenNorm()),
                loglikelihood_acc_metric(normalization=LogProbCharNorm()),
                loglikelihood_acc_metric(normalization=LogProbPMINorm()),
            ],
        ),

For example in our ArabicMMLU, the metric is loglikelihood_acc_norm with LogProbCharNorm
From what i understood in src/lighteval/tasks/multilingual/utils/task_utils.py the normalization for mcf is None ! Can you confirm that ?
I assume probably this explains the differences in the scores above specially that chat models usually "understands" that they need to assign higher probability to a letter while base model give lower probability to a letter and more to a continuation ... Honestly idk i'm just hallucinating here

alielfilali01 · 2024-11-23T10:47:19Z

Update on running templated_mmlu_ara_mcf tasks.
I keep on hitting this error:

[rank0]:     raise ValueError(f"Cannot find tasks {task_name} in task list or in custom task registry)")
[rank0]: ValueError: Cannot find tasks lighteval|templated_mmlu_ara_mcf:accounting_university in task list or in custom task registry)

The task is defined in lighteval/src/lighteval/tasks/multilingual/tasks.py as mentioned above, and the custom tasks tag is lighteval.tasks.multilingual.tasks
can't figure what exactly is wrong
cc: @hynky1999

hynky1999 · 2024-11-23T13:30:32Z

Because it's not there! You have to create it, or give me rights to push to your branch
And I can't update your PR:

error: Authentication error: Authentication required: You must have push access to verify locks
error: failed to push some refs to 'https://github.com/alielfilali01/lighteval.git'

hynky1999 · 2024-11-23T13:36:25Z

For example in our ArabicMMLU, the metric is loglikelihood_acc_norm with LogProbCharNorm
From what i understood in src/lighteval/tasks/multilingual/utils/task_utils.py the normalization for mcf is None ! Can you confirm that ?

Yes, the explanation is simple:

If you evaluate with A/B/C/D as targets (mcf) we don't normalize because you want to know probabiliyt of generating the letters. Also most of the llms use just single token for them anyways.
Contrary if you evaluate with cf, you are predicting the the continuations itself and they can have drastically different lenghts. In this case we use normalizations.

Lastly why bother with changing normalizations, if you normalize with chars/tokens for mcf you will get same results because targerts are just single token right? Well yes, but we also use PMI normalization for some tasks, which makes evals 2x more expensive (you have to run two logprob calls for single sample), so that's why we bother with chaning the norms for mcf.

PS: I have noticed a bug yesterday, and the produced metric is called acc_, will be fix it soon

alielfilali01 · 2024-11-24T11:55:06Z

Because it's not there! You have to create it, or give me rights to push to your branch

Indeed i did define it in my local lighteval/src/lighteval/tasks/multilingual/tasks.py as presented above.

hynky1999 · 2024-11-24T12:14:04Z

🤔 it's tough to tell, can you push it ?

hynky1999 · 2024-11-28T12:49:35Z

cc @alielfilali01 could you push the changes so I can check what's wrong?

@hynky1999

Adding a templated version of arabic mmlu based on @hynky1999 request in the #372 PR

alielfilali01 · 2024-11-29T08:53:36Z

Hey @hynky1999, so sorry for delay ! I was planning to after seeing your first comment but i got carried away with other things and totally slipped out of my mind ! Tnx for the reminder too.
Plz review the last commit cuz that's how i did it locally and LMK what i did wrong. Also if it is sound then feel free to run it from your side as well.

alielfilali01 · 2024-12-04T08:05:45Z

Hey @hynky1999 , Have you managed to take a look yet ?

hynky1999 · 2024-12-05T13:54:37Z

@alielfilali01 was a bit busy. The issue is that you didn't add it to TASK_TABLE

alielfilali01 · 2024-12-05T13:58:47Z

Ooooh 😶 alright will do tomorrow and get back to you on status

remove arabic_mmlu_templated_tasks

HuggingFaceDocBuilderDev · 2024-12-10T10:13:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

clefourrier · 2024-12-11T13:09:30Z

Doing one last run of the tests and we should be good to go

alielfilali01 added 4 commits October 23, 2024 11:55

Update and rename OALL_tasks.txt to OALL_v1_tasks.txt

64d4e11

Rename file to refelect that it is v1 leaderboard tasks

Create OALL_v2_tasks.txt

f2596d5

Tasks for v2 of OALL

Update all_arabic_tasks.txt

a164472

add new and renamed tasks

alielfilali01 changed the title ~~Add new Arabic MMLU benchmarks and enhance existing tasks~~ Add new Arabic benchmarks and enhance existing tasks Oct 23, 2024

alielfilali01 changed the title ~~Add new Arabic benchmarks and enhance existing tasks~~ Add new Arabic benchmarks (5) and enhance existing tasks Oct 23, 2024

Merge branch 'main' into main

71c3167

alielfilali01 and others added 2 commits October 31, 2024 15:31

Update arabic_evals.py

b6d61dc

Fix formatting issues for

Merge branch 'main' into main

abb7244

Merge branch 'main' into main

506e5d1

clefourrier and others added 5 commits November 18, 2024 10:13

Merge branch 'main' into main

3fccdc3

Update all_arabic_tasks.txt

91aa0e1

Add missing task: OpenAI's MMMLU arabic subset

Update all_arabic_tasks.txt

7e163e2

Correct order

Merge branch 'main' into main

9ba81fa

Merge branch 'main' into main

bcc4885

alielfilali01 added 2 commits November 22, 2024 16:55

Update arabic_evals.py

aa201d2

remove openai mmmlu task following the discussion here: #372

Update all_arabic_tasks.txt

81255ae

remove openai mmmlu task following the discussion here: #372

Merge branch 'main' into main

ce7fbde

Update tasks.py

b8869cb

Adding a templated version of arabic mmlu based on @hynky1999 request in the #372 PR

Merge branch 'huggingface:main' into main

a4f1ccc

alielfilali01 and others added 4 commits December 8, 2024 16:50

Merge branch 'huggingface:main' into main

50736f2

Update tasks.py

bdb2867

remove arabic_mmlu_templated_tasks

Merge branch 'main' into main

cf42a63

Merge branch 'main' into main

bc61181

alielfilali01 and others added 2 commits December 11, 2024 00:07

Merge branch 'huggingface:main' into main

5fae812

Merge branch 'main' into main

1466c69

clefourrier approved these changes Dec 11, 2024

View reviewed changes

clefourrier merged commit de8dba3 into huggingface:main Dec 11, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new Arabic benchmarks (5) and enhance existing tasks #372

Add new Arabic benchmarks (5) and enhance existing tasks #372

alielfilali01 commented Oct 23, 2024 •

edited

Loading

hynky1999 commented Oct 24, 2024 •

edited

Loading

alielfilali01 commented Oct 24, 2024

NathanHB commented Oct 24, 2024

NathanHB commented Nov 4, 2024

alielfilali01 commented Nov 5, 2024

clefourrier commented Nov 14, 2024

alielfilali01 commented Nov 15, 2024

clefourrier commented Nov 15, 2024

hynky1999 commented Nov 15, 2024

alielfilali01 commented Nov 17, 2024

hynky1999 commented Nov 17, 2024

alielfilali01 commented Nov 18, 2024

alielfilali01 commented Nov 20, 2024 •

edited

Loading

hynky1999 commented Nov 20, 2024 •

edited

Loading

alielfilali01 commented Nov 20, 2024

alielfilali01 commented Nov 22, 2024

NathanHB commented Nov 22, 2024

alielfilali01 commented Nov 22, 2024 •

edited

Loading

NathanHB commented Nov 22, 2024

alielfilali01 commented Nov 23, 2024

alielfilali01 commented Nov 23, 2024 •

edited

Loading

hynky1999 commented Nov 23, 2024 •

edited

Loading

hynky1999 commented Nov 23, 2024

alielfilali01 commented Nov 24, 2024

hynky1999 commented Nov 24, 2024

hynky1999 commented Nov 28, 2024

alielfilali01 commented Nov 29, 2024

alielfilali01 commented Dec 4, 2024

hynky1999 commented Dec 5, 2024

alielfilali01 commented Dec 5, 2024

HuggingFaceDocBuilderDev commented Dec 10, 2024

clefourrier commented Dec 11, 2024

Add new Arabic benchmarks (5) and enhance existing tasks #372

Add new Arabic benchmarks (5) and enhance existing tasks #372

Conversation

alielfilali01 commented Oct 23, 2024 • edited Loading

hynky1999 commented Oct 24, 2024 • edited Loading

alielfilali01 commented Oct 24, 2024

NathanHB commented Oct 24, 2024

NathanHB commented Nov 4, 2024

alielfilali01 commented Nov 5, 2024

clefourrier commented Nov 14, 2024

alielfilali01 commented Nov 15, 2024

clefourrier commented Nov 15, 2024

hynky1999 commented Nov 15, 2024

alielfilali01 commented Nov 17, 2024

hynky1999 commented Nov 17, 2024

alielfilali01 commented Nov 18, 2024

alielfilali01 commented Nov 20, 2024 • edited Loading

hynky1999 commented Nov 20, 2024 • edited Loading

alielfilali01 commented Nov 20, 2024

alielfilali01 commented Nov 22, 2024

NathanHB commented Nov 22, 2024

alielfilali01 commented Nov 22, 2024 • edited Loading

NathanHB commented Nov 22, 2024

alielfilali01 commented Nov 23, 2024

alielfilali01 commented Nov 23, 2024 • edited Loading

hynky1999 commented Nov 23, 2024 • edited Loading

hynky1999 commented Nov 23, 2024

alielfilali01 commented Nov 24, 2024

hynky1999 commented Nov 24, 2024

hynky1999 commented Nov 28, 2024

alielfilali01 commented Nov 29, 2024

alielfilali01 commented Dec 4, 2024

hynky1999 commented Dec 5, 2024

alielfilali01 commented Dec 5, 2024

HuggingFaceDocBuilderDev commented Dec 10, 2024

clefourrier commented Dec 11, 2024

alielfilali01 commented Oct 23, 2024 •

edited

Loading

hynky1999 commented Oct 24, 2024 •

edited

Loading

alielfilali01 commented Nov 20, 2024 •

edited

Loading

hynky1999 commented Nov 20, 2024 •

edited

Loading

alielfilali01 commented Nov 22, 2024 •

edited

Loading

alielfilali01 commented Nov 23, 2024 •

edited

Loading

hynky1999 commented Nov 23, 2024 •

edited

Loading