-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new Arabic benchmarks (5) and enhance existing tasks #372
Conversation
Add new Arabic benchmarks and update existing tasks - Renamed `arabic_mmlu` to `arabic_mmlu_mt` to highlight its machine-translated origin. - Added new benchmarks: `arabic_mmlu` ArabicMMLU (https://arxiv.org/abs/2402.12840), `arabic_mmlu_ht` (human-translated), and `MadinahQA` from MBZUAI. As well as `arabic_mmmlu` (OpenAI MMMLU), and `AraTrust` a trustworthiness benchmark for Arabic LLMs (https://arxiv.org/abs/2403.09017). - Enhanced prompt functions for better flexibility in answer options.
Rename file to refelect that it is v1 leaderboard tasks
Tasks for v2 of OALL
add new and renamed tasks
Hi, we thanks for adding the benches. Secondly we have added the both arabic_mmmlu and openai_mmlu. I would prefer not adding duplicates, but I am open to discuss adding the with/without instruction modifications |
Hey @hynky1999, thanks for your input!
I went ahead and added them just to test how the different implementations might affect the scores (hopefully it doesn’t!). I can run this test on my fork and compare the results with your version, and then we can decide whether to keep them or nto. What do you think?
I haven’t fully wrapped my head around the templates yet—it might take me a few days. If you’re able to help with this integration in the meantime, feel free to contribute! Otherwise, I’ll try to get to it by next week max. Also, I’m unsure why the format check is failing. I ran Thanks! |
Hi ! Thanks for the PR, for the formating issues, you should use the precommit hooks
|
Fix formatting issues for
Thanks for fixing the formating ! You can find the doc to add a new task using the prompt templates here don't hesitate to reach out if you need any help :) |
Hey @NathanHB Thanks for pointing out to the docs of adding prompt templates. I'am planning to add that in a separate PR in the near future. For now i believe we can move on with this unless it contradicts with the team's plan for the future versions of LightEval. |
Hi! I think we can indeed move forward with this. One last thing before: did you check the difference in result between your and the current implementations of arabic mmlu and open ai mmlu? |
Hey @clefourrier, Unfortunately i was planning to but it slipped through as I forgot to add it to my to-do list ! I can prioritize this over the weekend and get back to you with an update by Monday. Does this timeline work ? I assume there will be no difference (or at least that's what we hope for) lighteval|mmlu_ara_mcf|0|0
lighteval|openai_mmlu_ara_mcf|0|0 Let me know if there’s anything else to consider. |
Yep, the timeline works on our side! I think these are the correct evals. |
Yes these are the correct evals |
I'am sorry but i don't understand why i keep hitting this error:
My main command: # Run the evaluation command
srun -N $SLURM_NNODES --ntasks=$SLURM_NTASKS --cpus-per-task=$SLURM_CPUS_PER_TASK --gres=gpu:$SLURM_GPUS_PER_NODE \
yes 'y' | lighteval accelerate \
--model_args "pretrained=$model,trust_remote_code=$TRUST_REMOTE_CODE" \
--tasks "lighteval|openai_mmlu_ara_mcf|0|0, lighteval|mmlu_ara_mcf|0|0" \
--override_batch_size 1 \
--output_dir=$RESULTS_DIR
I don't understand where i'm passing the cc: @hynky1999 , @NathanHB , @clefourrier |
You wanna run with |
Oooh I thought |
Add missing task: OpenAI's MMMLU arabic subset
Correct order
Hey @clefourrier, following up on your previous question :
Please find here teh results which i find pretty interesting !
I don't really see here any correlation ! sometimes score is up with community suite implementation, sometimes it goes up with lighteval @hynky1999 implementation |
I think there is one consistency: Re chat models, did you run with tempaltes? In any case: For mbuzai mmlu, I am open going with your implementation (since you are the creator), but would maybe prefer a middle ground, because it's still very hard-coded and don't allow switching to CF:
Task name in this case is: Think it's a good middle ground between the two:
|
I noticed this as well:
But I agree that for the sake of consistency, OpenAI MMMLU should be run through multilingual lighteval suite. About this:
I tried it and still running into errors, so i just copy pasted all the tasks in a txt file and run. |
@hynky1999, Honestly i totally forgot, it is somewhere in one of the log files but i can't remember which one 😶 |
Then we are using lighteval's version for open ai MMLU and your version for arabic_mmlu. @alielfilali01 just waiting for you to use the templated version as showned here and we will be good to merge ! :) |
Hey @NathanHB, Is it already defined or should i define it ? Anyway, I will try to get it done tomorrow. |
it is not yet defined using templated prompt, @hynky1999 provided some code to use templates!
|
Hey @hynky1999, I'am wondering, which normalization is used exactly in metric=get_metrics_for_formulation(
formulation,
[
loglikelihood_acc_metric(normalization=LogProbTokenNorm()),
loglikelihood_acc_metric(normalization=LogProbCharNorm()),
loglikelihood_acc_metric(normalization=LogProbPMINorm()),
],
), For example in our |
Update on running [rank0]: raise ValueError(f"Cannot find tasks {task_name} in task list or in custom task registry)")
[rank0]: ValueError: Cannot find tasks lighteval|templated_mmlu_ara_mcf:accounting_university in task list or in custom task registry) The task is defined in |
Because it's not there! You have to create it, or give me rights to push to your branch
|
Yes, the explanation is simple:
Lastly why bother with changing normalizations, if you normalize with chars/tokens for mcf you will get same results because targerts are just single token right? Well yes, but we also use PMI normalization for some tasks, which makes evals 2x more expensive (you have to run two logprob calls for single sample), so that's why we bother with chaning the norms for mcf. PS: I have noticed a bug yesterday, and the produced metric is called |
Indeed i did define it in my local |
🤔 it's tough to tell, can you push it ? |
cc @alielfilali01 could you push the changes so I can check what's wrong? |
Adding a templated version of arabic mmlu based on @hynky1999 request in the #372 PR
Hey @hynky1999, so sorry for delay ! I was planning to after seeing your first comment but i got carried away with other things and totally slipped out of my mind ! Tnx for the reminder too. |
Hey @hynky1999 , Have you managed to take a look yet ? |
@alielfilali01 was a bit busy. The issue is that you didn't add it to TASK_TABLE |
Ooooh 😶 alright will do tomorrow and get back to you on status |
remove arabic_mmlu_templated_tasks
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Doing one last run of the tests and we should be good to go |
Renamed
arabic_mmlu
toarabic_mmlu_mt
:Introduced three new MMLU-style benchmarks:
arabic_mmlu
: Native Arabic MMLU benchmark introduced by MBZUAI, based on the official "ArabicMMLU" paper (https://arxiv.org/abs/2402.12840).arabic_mmlu_ht
: Human-translated version from MBZUAI, providing a more accurate and high-quality translation of the original work by Hendrycks et al. (2021) on Measuring Massive Multitask Language Understanding (MMLU).arabic_mmmlu
: Arabic subset of OpenAI's Multilingual MMLU (MMMLU), which is human-annotated, targeting similar subjects.Added AraTrust benchmark:
Added MadinahQA benchmark:
Comparative study across different versions of Arabic MMLU:
arabic_mmlu_mt
(machine translated using an NMT engine) shows competitive results compared to human-translated versions, indicating the efficacy of the translation engine.arabic_mmlu_okapi
), which was translated using GPT-3.5 API (ChatGPT), shows lower correlation and performance, reflecting potential flaws and lower translation quality.The attached table below shows the comparative analysis of model performances across the different Arabic MMLU datasets.
cc : @clefourrier , @NathanHB , @hynky1999