Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add seq2seq eval benchmark callback #1274

Merged
merged 6 commits into from
Feb 13, 2024
Merged

Add seq2seq eval benchmark callback #1274

merged 6 commits into from
Feb 13, 2024

Conversation

LeonardoEmili
Copy link
Contributor

@LeonardoEmili LeonardoEmili commented Feb 8, 2024

Similar to #441, this PR adds an evaluation benchmark for generative tasks as machine translation.

Description

This additional evaluation benchmark is self-contained and can be triggered via the do_causal_lm_eval configuration. First it will generate completions for every sample in the eval set (configured via eval_max_new_tokens) and then score these against the reference completions via 🤗 Evaluate. Metrics can be chosen among a subset of supported metrics and skipped with a warning if the corresponding libraries are not available.

A few notes on this PR:

  • I believe it could be further improved by removing duplicated code (e.g. it shares the same generation as LogPredictionCallback)
  • 🤗 Evaluate requires additional libraries to be installed to compute metrics (e.g. sacrebleu is needed to compute BLEU). I didn't make to add this libraries as optional, so far code will simply raise a warning if the libraries for the requested metrics are missing. I believe this could be handled in a better way (maybe adding a [metrics] entry to the extras_require?)

Tagging @winglian and @tmm1 who maybe can help? Cheers.

Motivation and Context

Enable Axolotl to compute generative evaluation metrics to better support generative tasks (e.g. machine translation, language modelling, etc.).

How has this been tested?

End-to-end SFT fine-tuning of an existing LLama-2-hf model with and without the feature.

Screenshots (if appropriate)

Types of changes

  • Add CausalLMBenchEvalCallback callback for generative tasks evaluation
  • Add do_causal_lm_eval to AxolotlTrainingArguments
  • Refactor eval_table_max_new_tokens into eval_max_new_tokens
  • Bump evaluate to 0.4.1 for COMET fix (see related issue)
  • Test behaviour with sample_packing=True
  • Add unit and integration tests

Social Handles (Optional)

@winglian
Copy link
Collaborator

winglian commented Feb 9, 2024

Is this intended as a replacement of the log prediction callback?

Looks good so far. I assume you want to finish off the last two items before we merge?

@LeonardoEmili
Copy link
Contributor Author

Is this intended as a replacement of the log prediction callback?

I believe the callbacks serve similar purposes but are different in the objective: the log callback generates a few (e.g. 5) samples for logging purposes while the causal_lm benchmark generates completions for all the dataset to compute metrics for generative tasks.

Ideally these two would share the same code for the generation (model.generate rather than predict from this thread) and store them somewhere to be re-used if do_causal_lm_eval=True. Do you have any suggestions how to achieve it?

Looks good so far. I assume you want to finish off the last two items before we merge?

I'll have a look at the case when sample_packing=True but don't currently have time to create ad-hoc tests.

@LeonardoEmili
Copy link
Contributor Author

@winglian please have a look now if this looks good to you. Do you have further suggestions?

Copy link
Collaborator

@winglian winglian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@winglian winglian merged commit 5a5d474 into axolotl-ai-cloud:main Feb 13, 2024
7 checks passed
@LeonardoEmili LeonardoEmili deleted the causal-lm-bench branch February 13, 2024 16:26
djsaunde pushed a commit that referenced this pull request Dec 17, 2024
* Add CausalLMBenchEvalCallback for measuring seq2seq performance

* Fix code for pre-commit

* Fix typing and improve logging

* eval_sample_packing must be false with CausalLMBenchEvalCallback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants