Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update doc for metric_for_best_model when save_strategy="best". #35389

Merged
1 change: 0 additions & 1 deletion src/transformers/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -3156,7 +3156,6 @@ def _load_rng_state(self, checkpoint):
def _determine_best_metric(self, metrics, trial):
"""
Determine if the model should be saved based on the evaluation metrics.
If args.metric_for_best_model is not set, the loss is used.

Returns:
bool: True if a new best metric was found, else False
Expand Down
12 changes: 7 additions & 5 deletions src/transformers/training_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -476,11 +476,13 @@ class TrainingArguments:

metric_for_best_model (`str`, *optional*):
Use in conjunction with `load_best_model_at_end` to specify the metric to use to compare two different
models. Must be the name of a metric returned by the evaluation with or without the prefix `"eval_"`. Will
default to `"loss"` if unspecified and `load_best_model_at_end=True` (to use the evaluation loss).
models. Must be the name of a metric returned by the evaluation with or without the prefix `"eval_"`.

If you set this value, `greater_is_better` will default to `True`. Don't forget to set it to `False` if
your metric is better when lower.
If not specified, this will default to `"loss"` when either `load_best_model_at_end == True`
or `lr_scheduler_type == SchedulerType.REDUCE_ON_PLATEAU` (to use the evaluation loss).

If you set this value, `greater_is_better` will default to `True` unless the name ends with "loss".
Don't forget to set it to `False` if your metric is better when lower.
greater_is_better (`bool`, *optional*):
Use in conjunction with `load_best_model_at_end` and `metric_for_best_model` to specify if better models
should have a greater metric or not. Will default to:
Expand Down Expand Up @@ -1636,7 +1638,7 @@ def __post_init__(self):
self.save_steps = int(self.save_steps)

# Sanity checks for load_best_model_at_end: we require save and eval strategies to be compatible.
if self.load_best_model_at_end:
if self.load_best_model_at_end and self.save_strategy != SaveStrategy.BEST:
if self.eval_strategy != self.save_strategy:
raise ValueError(
"--load_best_model_at_end requires the save and eval strategy to match, but found\n- Evaluation "
Expand Down
19 changes: 17 additions & 2 deletions tests/trainer/test_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -4220,7 +4220,9 @@ def test_save_best_checkpoint(self):
total=total,
)

# Case 3: Metric name not provided; throw error.
def test_metric_for_best_model_behavior(self):
# Case 1: Metric name not provided when `save_strategy == "best"`.
# Should raise ValueError.
with tempfile.TemporaryDirectory() as tmpdir:
with self.assertRaises(ValueError) as context:
trainer = get_regression_trainer(
Expand All @@ -4232,9 +4234,22 @@ def test_save_best_checkpoint(self):
save_strategy="best",
compute_metrics=AlmostAccuracy(),
)

self.assertIn("`args.metric_for_best_model` must be provided", str(context.exception))

# Case 2: Metric name not provided when `load_best_model_at_end == True`.
# `metric_for_best_model` should be set to `"loss"` by default.
with tempfile.TemporaryDirectory() as tmpdir:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not critical / minor: tbh, it seems a bit out of place for the test_save_best_checkpoint (as well as the previous case). I would probably move it into a separate test. Or should it otherwise call at least train and test actual checkpoint saved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I agree that it logically does seem a bit out of place. I think cases 3 and 4 could be grouped into their own methods since the point isn't so much to test the save_strategy = "best" itself but more to test the behavior related to metric_for_best_model.

I'm not sure if actually running training would be necessary, though. Case 3 is simply to check whether a ValueError is being properly thrown at Trainer initialization time, and case 4 is also simply to check whether the __post_init__ method of TrainingArguments properly initializes metric_for_best_model to "loss" when save_strategy != "best" and load_best_model_at_end = True. To me, neither of these seem to require training/evaluation and Trainer instantiation seems sufficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I would also split it into a separate test (or two test). And, yes, we are testing the init here, that's why it was looking out of place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no strong opinion. We can split it into a separate test for case 3 and 4.

trainer = get_regression_trainer(
a=1.5,
b=2.5,
output_dir=tmpdir,
learning_rate=0.1,
eval_strategy="steps",
save_strategy="steps",
load_best_model_at_end=True,
)
self.assertTrue(trainer.args.metric_for_best_model == "loss")


@require_torch
@is_staging_test
Expand Down