Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reproduce the evaluation score of HellaSwag, WiC #37

Open
rycont opened this issue Nov 29, 2023 · 0 comments
Open

Cannot reproduce the evaluation score of HellaSwag, WiC #37

rycont opened this issue Nov 29, 2023 · 0 comments

Comments

@rycont
Copy link

rycont commented Nov 29, 2023

I evaluated polyglot-ko-1.3b model with HellaSwag and WiC from KoBEST, and I got different results with paper and model card from huggingface.

Environment

  • Few-shot examples: 5
  • Model: EleutherAI/polyglot-ko-1.3b
  • Metrics: F1(Macro) Score
  • Computing: Colab / GPU(T4) Instance

I'm going to share a notebook that I tested with.
https://colab.research.google.com/drive/1lyQQisuB5JzuGk72haSdxXfXP20q4YGr?usp=sharing

1. WiC

The paper says the score 0.486, But I got only 0.4541.

  • The paper
params 0-shot 5-shot 10-shot 50-shot
1.3B 0.489 0.486 0.506 0.487
  • In my test

hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 8

Task Version Metric Value Stderr
kobest_wic 0 acc 0.4952 ± 0.0141
macro_f1 0.4541 ± 0.0138

2. HellaSwag

The paper says the score 0.526, But I got only 0.3984.

  • In the paper
params 0-shot 5-shot 10-shot 50-shot
1.3B 0.525 0.526 0.528 0.543
  • In my test

hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 8

Task Version Metric Value Stderr
kobest_hellaswag 0 acc 0.4020 ± 0.0219
acc_norm 0.5280 ± 0.0223
macro_f1 0.3984 ± 0.0218

And I found out a Wandb Report Polyglot-Ko: Open-Source Korean Autoregressive Language Model
, And there's a HellaSwag score that is same as my test, 0.3984.

params n=0 n=5 n=10 n=50
1.3B 0.4013 0.3984 0.417 0.4416

In case of other models

There are also differences in kakaobrain/kogpt and skt/ko-gpt-trinity-1.2B-v0.5.

  • kakaobrain/kogpt
    Note that I tested kakaobrain/kogpt with Int 8 quantized model.
In the paper (FP16) In my test (Int8) In the Wandb Report
CoPA 0.7287 0.7277 (↓0.01%) 0.7287
HellaSwag 0.5833 0.4560 (↓21.82%) 0.456
BoolQ 0.5981 0.6015 (↑0.56%) -
WiC 0.4775 0.3706 (↓22.38%) -
  • skt/ko-gpt-trinity-1.2B-v0.5
In the paper In my test In the Wandb Report
WiC 0.4313 0.3953 -
HellaSwag 0.5272 0.400 0.4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant