Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch from training_args.bin training_args.json #35010

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

not-lain
Copy link
Contributor

What does this PR do?

switch from training_args.bin to training_args.json and only capture the parameters that the user passed
I'm using the same approach we are using in huggingface_hub's PyTorchModelHubMixin to store as little parameters as possible.
a minimalistic approach to test this is pr

from transformers import TrainingArguments
args = TrainingArguments(output_dir="folder",eval_strategy="no") # or any other paramters
print(args.to_json_string())
# outputs
"""
{
  "output_dir": "folder",
  "eval_strategy": "no",
  "logging_dir": "folder\\runs\\Nov29_02-44-45_iphone-laptop"
}
"""
# logging_dir is a special parameter that is always captured and added to the training_args because we want to ensure consistency

# stores the parameters into a file
args.to_json_file("training_args.json")

# loads an instance using the class directly
args2  = TrainingArguments.from_json_file("training_args.json") 

using this approach, we ensure that we only store the parameters that the user-defined manually and not the ones that have default values or the ones inferred from the system (ie cpu, cuda, tpu ... ), leaving some room for flexibility.

in a sense the parameters are mutable, meaning the user can physically alter them.

Fixes #34612

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@muellerzr @SunMarc

@not-lain
Copy link
Contributor Author

cc @muellerzr & @SunMarc for review.
failed test is unrelated

@not-lain
Copy link
Contributor Author

not-lain commented Dec 9, 2024

friendly tagging @muellerzr & @SunMarc
failed tests are unrelated

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I'm a fan of this change. However, we should be careful about old training resumptions from older versions and likely should do a deprecation until 5.0.0 (warning if a .bin was found but we do the shift to .json now)

How does that sound?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@not-lain
Copy link
Contributor Author

[don't merge yet I need to fix something fundamental that was not captured by the tests]

@not-lain
Copy link
Contributor Author

not-lain commented Dec 12, 2024

cc @muellerzr for feedback
I implemented a rollback, now default is traing_args.bin and people can switch to training_args.json by setting the TRAINER_SAFE_SERIALIZE environmental variable.

originally there was never a mechanism inside of the trainer API to load any training_args and people used to load the training arguments manually, I added something useful which is TrainingArguments.from_json_file(file_name )
but I think I will integrate this loading mechanism with the trainer api in another pr (optional new feature)

@not-lain
Copy link
Contributor Author

friendly tagging @muellerzr here for a review.

@not-lain
Copy link
Contributor Author

cc @muellerzr for review

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR ! Left a few comments !

src/transformers/trainer.py Outdated Show resolved Hide resolved
src/transformers/trainer.py Outdated Show resolved Hide resolved
# Handle the accelerator_config if passed
if is_accelerate_available() and isinstance(v, AcceleratorConfig):
d[k] = v.to_dict()
d[k] = serialize_parameter(k, v)
self._dict_torch_dtype_to_str(d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this and the function since you took care of that in serialize_parameter. Can you add a comment in serialize_parameter to explain what is happening to torch_dtype ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I did not misunderstand you, let me know if this has been resolved after the new changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine thanks ! SInce we are not using _dict_torch_dtype_to_str method, you can also remove it in this PR

tests/trainer/test_trainer.py Outdated Show resolved Hide resolved
@not-lain
Copy link
Contributor Author

cc @SunMarc for review

I've switched to dynamic serialization which something that came up with to tackle #25903 .
this will serialize the parameters needed to make the training arguments lighweight, mutable and reproducible.

here's a thorough example on what to expect :

from transformers import TrainingArguments
args = TrainingArguments(output_dir="folder",eval_strategy="no") # or any other parameters
print(args.to_json_string())
# outputs (logging_dir needs to stay consistant in case we want tensorboard to continue the plots)
"""
{
  "output_dir": "folder",
  "eval_strategy": "no",
  "logging_dir": "folder\\runs\\Dec27_21-48-40_iphone-laptop"
}
"""
# has some default parameters in the method which have been captured and serialized
# this also captures manually set parameters in either args or kwargs format.
args = args.set_dataloader() 
print(args.to_json_string())
# outputs
"""
{
  "output_dir": "folder",
  "eval_strategy": "no",
  "logging_dir": "folder\\runs\\Dec27_21-48-40_iphone-laptop",
  "auto_find_batch_size": false,
  "ignore_data_skip": false
}
"""
# save and reload
args.to_json_file("training_args.json")
args2  = TrainingArguments.from_json_file("training_args.json")
print(args2.to_json_string())
# output
"""
{
  "output_dir": "folder",
  "eval_strategy": "no",
  "logging_dir": "folder\\runs\\Dec27_21-48-40_iphone-laptop",
  "auto_find_batch_size": false,
  "ignore_data_skip": false
}
"""

let me know if you have any reviews

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! thanks for iterating ! Can you have a second look @muellerzr ?

@SunMarc SunMarc requested a review from muellerzr December 30, 2024 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

save training_args in another file format
4 participants