Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new Arabic benchmarks (5) and enhance existing tasks #372

Merged
merged 24 commits into from
Dec 11, 2024

Conversation

alielfilali01
Copy link
Contributor

@alielfilali01 alielfilali01 commented Oct 23, 2024

  • Renamed arabic_mmlu to arabic_mmlu_mt:

    • This change reflects that the previous Arabic MMLU was machine translated (MT) using a neural machine translation (NMT) engine (most probably Google Translate API).
  • Introduced three new MMLU-style benchmarks:

    • arabic_mmlu: Native Arabic MMLU benchmark introduced by MBZUAI, based on the official "ArabicMMLU" paper (https://arxiv.org/abs/2402.12840).
    • arabic_mmlu_ht: Human-translated version from MBZUAI, providing a more accurate and high-quality translation of the original work by Hendrycks et al. (2021) on Measuring Massive Multitask Language Understanding (MMLU).
    • arabic_mmmlu: Arabic subset of OpenAI's Multilingual MMLU (MMMLU), which is human-annotated, targeting similar subjects.
  • Added AraTrust benchmark:

    • Integrated AraTrust, a benchmark designed for evaluating trustworthiness in Arabic LLMs (based on the paper "AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic" https://arxiv.org/abs/2403.09017).
  • Added MadinahQA benchmark:

  • Comparative study across different versions of Arabic MMLU:

    • Detailed performance analysis shows a strong correlation between OpenAI’s MMMLU (human annotated) and MBZUAI’s Arabic MMLU HT (human-translated).
    • The arabic_mmlu_mt (machine translated using an NMT engine) shows competitive results compared to human-translated versions, indicating the efficacy of the translation engine.
    • The Okapi version (arabic_mmlu_okapi), which was translated using GPT-3.5 API (ChatGPT), shows lower correlation and performance, reflecting potential flaws and lower translation quality.

The attached table below shows the comparative analysis of model performances across the different Arabic MMLU datasets.

Model Name Model Size (B) Average Score Arabic MMLU Okapi Arabic MMLU MT Arabic MMLU HT Arabic MMLU OpenAI
Qwen_Qwen2.5-7B-Instruct 7 49.79 37.94 50.95 55.41 54.86
Qwen_Qwen2.5-7B 7 47.35 36.21 49.21 51.7 52.28

cc : @clefourrier , @NathanHB , @hynky1999

Add new Arabic benchmarks and update existing tasks

- Renamed `arabic_mmlu` to `arabic_mmlu_mt` to highlight its machine-translated origin.
- Added new benchmarks: `arabic_mmlu` ArabicMMLU (https://arxiv.org/abs/2402.12840), `arabic_mmlu_ht` (human-translated), and `MadinahQA` from MBZUAI. As well as `arabic_mmmlu` (OpenAI MMMLU), and `AraTrust` a trustworthiness benchmark for Arabic LLMs (https://arxiv.org/abs/2403.09017).
- Enhanced prompt functions for better flexibility in answer options.
Rename file to refelect that it is v1 leaderboard tasks
Tasks for v2 of OALL
add new and renamed tasks
@alielfilali01 alielfilali01 changed the title Add new Arabic MMLU benchmarks and enhance existing tasks Add new Arabic benchmarks and enhance existing tasks Oct 23, 2024
@alielfilali01 alielfilali01 changed the title Add new Arabic benchmarks and enhance existing tasks Add new Arabic benchmarks (5) and enhance existing tasks Oct 23, 2024
@hynky1999
Copy link
Collaborator

hynky1999 commented Oct 24, 2024

Hi, we thanks for adding the benches.
Two things:
Do you think you could use the prompt templates ? This will ensure that you can easily switch between formulations (-> be able to evaluate models at early stage of training) and that the task implementation are consistent.

Secondly we have added the both arabic_mmmlu and openai_mmlu. I would prefer not adding duplicates, but I am open to discuss adding the with/without instruction modifications

@alielfilali01
Copy link
Contributor Author

Hey @hynky1999, thanks for your input!

Secondly, we have added both arabic_mmlu and openai_mmlu.**

I went ahead and added them just to test how the different implementations might affect the scores (hopefully it doesn’t!). I can run this test on my fork and compare the results with your version, and then we can decide whether to keep them or nto. What do you think?

Do you think you could use the prompt templates?**

I haven’t fully wrapped my head around the templates yet—it might take me a few days. If you’re able to help with this integration in the meantime, feel free to contribute! Otherwise, I’ll try to get to it by next week max.

Also, I’m unsure why the format check is failing. I ran ruff format . on my local machine before pushing, but it’s still being flagged. Could you help me figure out what might be going wrong?

Thanks!

@NathanHB
Copy link
Member

Hi ! Thanks for the PR, for the formating issues, you should use the precommit hooks

pip install -e .[dev]
pre-commit install
pre-commit run --all-files

@NathanHB
Copy link
Member

NathanHB commented Nov 4, 2024

Thanks for fixing the formating ! You can find the doc to add a new task using the prompt templates here don't hesitate to reach out if you need any help :)

@alielfilali01
Copy link
Contributor Author

Hey @NathanHB Thanks for pointing out to the docs of adding prompt templates. I'am planning to add that in a separate PR in the near future. For now i believe we can move on with this unless it contradicts with the team's plan for the future versions of LightEval.

@clefourrier
Copy link
Member

Hi! I think we can indeed move forward with this. One last thing before: did you check the difference in result between your and the current implementations of arabic mmlu and open ai mmlu?

@alielfilali01
Copy link
Contributor Author

Hey @clefourrier, Unfortunately i was planning to but it slipped through as I forgot to add it to my to-do list ! I can prioritize this over the weekend and get back to you with an update by Monday. Does this timeline work ? I assume there will be no difference (or at least that's what we hope for)
Also, could you please confirm if these are the correct corresponding multilingual tasks?

lighteval|mmlu_ara_mcf|0|0
lighteval|openai_mmlu_ara_mcf|0|0

Let me know if there’s anything else to consider.

@clefourrier
Copy link
Member

Yep, the timeline works on our side! I think these are the correct evals.

@hynky1999
Copy link
Collaborator

Yes these are the correct evals

@alielfilali01
Copy link
Contributor Author

I'am sorry but i don't understand why i keep hitting this error:

ValueError: Cannot find tasks  lighteval|mmlu_ara_mcf in task list or in custom task registry)

My main command:

    # Run the evaluation command
    srun -N $SLURM_NNODES --ntasks=$SLURM_NTASKS --cpus-per-task=$SLURM_CPUS_PER_TASK --gres=gpu:$SLURM_GPUS_PER_NODE \
    yes 'y' | lighteval accelerate \
    --model_args "pretrained=$model,trust_remote_code=$TRUST_REMOTE_CODE" \
    --tasks "lighteval|openai_mmlu_ara_mcf|0|0, lighteval|mmlu_ara_mcf|0|0" \
    --override_batch_size 1 \
    --output_dir=$RESULTS_DIR

I don't understand where i'm passing the custom task registry !

cc: @hynky1999 , @NathanHB , @clefourrier

@hynky1999
Copy link
Collaborator

You wanna run with --custom_task lighteval.tasks.multilingual.tasks.

@alielfilali01
Copy link
Contributor Author

Oooh I thought lighteval suite don't need the --custom_task tag !
Thank you @hynky1999 now it is running.
Will try to update you guys with results ASAP.

@alielfilali01
Copy link
Contributor Author

alielfilali01 commented Nov 20, 2024

Hey @clefourrier, following up on your previous question :

did you check the difference in result between your and the current implementations of arabic mmlu and open ai mmlu?

Please find here teh results which i find pretty interesting !


Model ArabicMMLU (arabic community) ArabicMMLU (multilingual lighteval) Score Variation A OpenAI MMMLU (arabic community) OpenAI MMMLU (multilingual lighteval) Score Variation B
inceptionai/jais-family-2p7b 0.2946 0.2763 0.0183 0.2530 0.2537 -0.0007
inceptionai/jais-family-2p7b-chat 0.5034 0.5148 -0.0114 0.4122 0.4365 -0.0243
inceptionai/jais-family-6p7b 0.2960 0.2797 0.0163 0.2643 0.2629 0.0014
inceptionai/jais-family-6p7b-chat 0.5519 0.5470 0.0049 0.4468 0.4627 -0.0159
inceptionai/jais-family-30b-8k 0.5163 0.4102 0.1061 0.4156 0.3755 0.0401
inceptionai/jais-family-30b-8k-chat 0.6067 0.6271 -0.0204 0.5001 0.5405 -0.0404

I don't really see here any correlation ! sometimes score is up with community suite implementation, sometimes it goes up with lighteval @hynky1999 implementation

@hynky1999
Copy link
Collaborator

hynky1999 commented Nov 20, 2024

I think there is one consistency:
Base models work better with yours, while chat models works better with lighteval.

Re chat models, did you run with tempaltes?
Re task type, did you run with mcf right? The scores are acc_norms?

In any case:
For openai mmlu, we should keep the lighteval implementation. The task is inherently multilingual (has subsets for diff langauges), so using lighteval implementation allows to use other implementations with no code changes.

For mbuzai mmlu, I am open going with your implementation (since you are the creator), but would maybe prefer a middle ground, because it's still very hard-coded and don't allow switching to CF:
Could you try running with this task definition?

from lighteval.metrics.normalizations import LogProbCharNorm, LogProbTokenNorm
from lighteval.tasks.default_prompts import LETTER_INDICES
from lighteval.tasks.lighteval_task import LightevalTaskConfig
from lighteval.tasks.multilingual.utils.task_utils import get_metrics_for_formulation, normalize_subset
from lighteval.tasks.requests import Doc
from lighteval.tasks.templates.multichoice import get_mcq_prompt_function

arabic_mmlu_templated_tasks = [
    LightevalTaskConfig(
        name=f"templated_mmlu_{Language.ARABIC.value}_{formulation.name.lower()}:{normalize_subset(subset)}",
        prompt_function=get_mcq_prompt_function(
            Language.ARABIC,
            lambda line: {
                "instruction": "السؤال التالي هو سؤال متعدد الإختيارات. اختر الإجابة الصحيحة:",
                "context": line["Context"],
                "question": line["Question"],
                "choices": [str(o) for o in [line[f"Option {i}"] for i in range(1, 6)] if o],
                "gold_idx": LETTER_INDICES.index(line["Answer Key"]),
            },
            formulation=formulation,
        ),
        suite=("lighteval",),
        hf_repo="MBZUAI/ArabicMMLU",
        hf_subset=subset,
        evaluation_splits=("test",),
        hf_avail_splits=["dev"],
        metric=get_metrics_for_formulation(
            formulation,
            [
                loglikelihood_acc_metric(normalization=LogProbTokenNorm()),
                loglikelihood_acc_metric(normalization=LogProbCharNorm()),
            ],
        ),
    )
    for subset in ARABIC_MMLU_SUBSETS
    for formulation in [
        MCFFormulation("NativeLetters"),
    ]
]

Task name in this case is: templated_mmlu_ara_mcf.
Btw lighteval now supports running all all subsets by just specifying the common task name:
e.g: lighteval|templated_mmlu_ara_mcf|0|0 will run all subsets

Think it's a good middle ground between the two:

{'choices': [' أ', ' ب', ' ج', ' د'],
 'ctx': '',
 'gold_index': [0],
 'instruction': 'السؤال التالي هو سؤال متعدد الإختيارات. اختر الإجابة الصحيحة:',
 'num_asked_few_shots': -1,
 'num_effective_few_shots': -1,
 'original_query': '',
 'query': 'السؤال التالي هو سؤال متعدد الإختيارات. اختر الإجابة الصحيحة:\n'
          'سؤال: ما هي ذاكرة الوصول العشوائي\n'
          ' أ. هي الذاكرة التي تخزنبها البيانات التي يحتاجها الكمبيوتر أثناء '
          'فترة تشغيله\n'
          ' ب. هي الذاكرة التي تخزن بها البيانات التي تعالج من قبل الكمبيوتر\n'
          ' ج. هو المكان الذي نستخدمه لتخزين البيانات التي نتعامل معها دائماً '
          ', كما تستخدم لنقل البيانات من مكان آلخر\n'
          ' د. هي ذاكرة دائماً يحتاجها الكمبيوتر طول  فترة التشغيل وهي خاصة '
          'بنقل بيانات بطاقات الصوت والفيديو والشبكة\n'
          'إجابة:',
 'specific': None,
 'target_for_fewshot_sorting': None,
 'task_name': 'lighteval|templated_mmlu_ara_mcf:computer_science_university',
 'unconditioned_query': 'إجابة:'}

@alielfilali01
Copy link
Contributor Author

I noticed this as well:

Base models work better with yours, while chat models works better with lighteval.

But inceptionai/jais-family-6p7b-chat was an exception in ArabicMMLU.

I agree that for the sake of consistency, OpenAI MMMLU should be run through multilingual lighteval suite.

About this:

Btw lighteval now supports running all all subsets by just specifying the common task name:

I tried it and still running into errors, so i just copy pasted all the tasks in a txt file and run.

@alielfilali01
Copy link
Contributor Author

What errors ?

@hynky1999, Honestly i totally forgot, it is somewhere in one of the log files but i can't remember which one 😶
I remember i tried to debug it assuming i'm doing something wrong (which most probably i'm) but then it was so much hussle so i just copy the subset tasks in a file instead and ran it and it went through smoothly

@NathanHB
Copy link
Member

Then we are using lighteval's version for open ai MMLU and your version for arabic_mmlu. @alielfilali01 just waiting for you to use the templated version as showned here and we will be good to merge ! :)

@alielfilali01
Copy link
Contributor Author

alielfilali01 commented Nov 22, 2024

Hey @NathanHB, Is it already defined or should i define it ? Anyway, I will try to get it done tomorrow.
Also i just remembered i need to remove the openai mmmlu task from arabic_evals.py !

remove openai mmmlu task following the discussion here: #372
remove openai mmmlu task following the discussion here: #372
@NathanHB
Copy link
Member

it is not yet defined using templated prompt, @hynky1999 provided some code to use templates!

For mbuzai mmlu, I am open going with your implementation (since you are the creator), but would maybe prefer a middle ground, because it's still very hard-coded and don't allow switching to CF:
Could you try running with this task definition?

@alielfilali01
Copy link
Contributor Author

Hey @hynky1999, I'am wondering, which normalization is used exactly in mcf formulation when calling get_metrics_for_formulation:

metric=get_metrics_for_formulation(
            formulation,
            [
                loglikelihood_acc_metric(normalization=LogProbTokenNorm()),
                loglikelihood_acc_metric(normalization=LogProbCharNorm()),
                loglikelihood_acc_metric(normalization=LogProbPMINorm()),
            ],
        ),

For example in our ArabicMMLU, the metric is loglikelihood_acc_norm with LogProbCharNorm
From what i understood in src/lighteval/tasks/multilingual/utils/task_utils.py the normalization for mcf is None ! Can you confirm that ?
I assume probably this explains the differences in the scores above specially that chat models usually "understands" that they need to assign higher probability to a letter while base model give lower probability to a letter and more to a continuation ... Honestly idk i'm just hallucinating here

@alielfilali01
Copy link
Contributor Author

alielfilali01 commented Nov 23, 2024

Update on running templated_mmlu_ara_mcf tasks.
I keep on hitting this error:

[rank0]:     raise ValueError(f"Cannot find tasks {task_name} in task list or in custom task registry)")
[rank0]: ValueError: Cannot find tasks lighteval|templated_mmlu_ara_mcf:accounting_university in task list or in custom task registry)

The task is defined in lighteval/src/lighteval/tasks/multilingual/tasks.py as mentioned above, and the custom tasks tag is lighteval.tasks.multilingual.tasks
can't figure what exactly is wrong
cc: @hynky1999

@hynky1999
Copy link
Collaborator

hynky1999 commented Nov 23, 2024

Because it's not there! You have to create it, or give me rights to push to your branch
And I can't update your PR:

error: Authentication error: Authentication required: You must have push access to verify locks
error: failed to push some refs to 'https://github.com/alielfilali01/lighteval.git'

@hynky1999
Copy link
Collaborator

For example in our ArabicMMLU, the metric is loglikelihood_acc_norm with LogProbCharNorm
From what i understood in src/lighteval/tasks/multilingual/utils/task_utils.py the normalization for mcf is None ! Can you confirm that ?

Yes, the explanation is simple:

  • If you evaluate with A/B/C/D as targets (mcf) we don't normalize because you want to know probabiliyt of generating the letters. Also most of the llms use just single token for them anyways.
  • Contrary if you evaluate with cf, you are predicting the the continuations itself and they can have drastically different lenghts. In this case we use normalizations.

Lastly why bother with changing normalizations, if you normalize with chars/tokens for mcf you will get same results because targerts are just single token right? Well yes, but we also use PMI normalization for some tasks, which makes evals 2x more expensive (you have to run two logprob calls for single sample), so that's why we bother with chaning the norms for mcf.

PS: I have noticed a bug yesterday, and the produced metric is called acc_, will be fix it soon

@alielfilali01
Copy link
Contributor Author

Because it's not there! You have to create it, or give me rights to push to your branch

Indeed i did define it in my local lighteval/src/lighteval/tasks/multilingual/tasks.py as presented above.

@hynky1999
Copy link
Collaborator

🤔 it's tough to tell, can you push it ?

@hynky1999
Copy link
Collaborator

cc @alielfilali01 could you push the changes so I can check what's wrong?

Adding a templated version of arabic mmlu based on @hynky1999 request in the #372 PR
@alielfilali01
Copy link
Contributor Author

Hey @hynky1999, so sorry for delay ! I was planning to after seeing your first comment but i got carried away with other things and totally slipped out of my mind ! Tnx for the reminder too.
Plz review the last commit cuz that's how i did it locally and LMK what i did wrong. Also if it is sound then feel free to run it from your side as well.

@alielfilali01
Copy link
Contributor Author

Hey @hynky1999 , Have you managed to take a look yet ?

@hynky1999
Copy link
Collaborator

@alielfilali01 was a bit busy. The issue is that you didn't add it to TASK_TABLE

@alielfilali01
Copy link
Contributor Author

Ooooh 😶 alright will do tomorrow and get back to you on status

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@clefourrier
Copy link
Member

Doing one last run of the tests and we should be good to go

@clefourrier clefourrier merged commit de8dba3 into huggingface:main Dec 11, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants