Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Upgrade Weights & Biases callback #29125

Closed

Conversation

parambharat
Copy link
Contributor

What does this PR do?

This PR adds a few new functionalities to the Weights & Biases Callback

  • Logs Peft and Lora Config to wandb if present

  • Adds model parameter counts to wandb config and artifact metadata

  • Adds on_predict methods to log prediction metrics

  • Prints the model architecture to a file alongside the wandb artifact

  • Logs initial and final models to the wandb artifact to full reproducibility

  • Adds steps and epoch aliases to checkpoint artifacts

  • Here's a link to the what the logged artifacts look like

  • Here's a run overview page with added config and metadata for the run with peft configs logged

Before submitting

Who can review?

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Overall looks good to me

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks have a few small questions but good otherwise!

logger.info("Logging model artifacts. ...")
model_name = (
f"model-{self._wandb.run.id}"
if (args.run_name is None or args.run_name == args.output_dir)
else f"model-{self._wandb.run.name}"
)
# add the model architecture to a separate text file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure I understand why the architecture would change over time during training?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure I understand why the architecture would change over time during training.

@ArthurZucker : Here, we log the model architecture to a text file in the model artifact. Some of our users have requested this. This helps to audit and reproduce the experiment easily.

Comment on lines -824 to +898
f"checkpoint-{self._wandb.run.id}"
f"model-{self._wandb.run.id}"
if (args.run_name is None or args.run_name == args.output_dir)
else f"checkpoint-{self._wandb.run.name}"
else f"model-{self._wandb.run.name}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that not breaking? could we keep model and aliases=["model"] ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that not breaking? could we keep model and aliases=["model"] ?

@ArthurZucker : This doesn't break existing functionality. Currently, we create separate Weights & Biases Artifacts for models and checkpoints. However, this results in multiple artifacts for the same experiment. This instead creates a single artifact with model and checkpoint aliases, allowing the users to log models and checkpoints to the same artifact.

Comment on lines +906 to +913
def on_predict(self, args, state, control, metrics, **kwargs):
if self._wandb is None:
return
if not self._initialized:
self.setup(args, state, **kwargs)
if state.is_world_process_zero:
metrics = rewrite_logs(metrics)
self._wandb.log(metrics)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain why we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain why we need this?

@ArthurZucker : This allows users to log model predictions and metrics to W&B dashboard when they use model.predict . This is quite useful in evaluation runs.

@parambharat
Copy link
Contributor Author

@muellerzr and @ArthurZucker : Thank you for the review and comments. I have responded to the review comments with some explanations. Let me know if it looks good.

@muellerzr
Copy link
Contributor

@parambharat can you resolve the conflict? Thanks!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

# Conflicts:
#	src/transformers/integrations/integration_utils.py
@parambharat
Copy link
Contributor Author

@muellerzr : Resolved 👍

@muellerzr muellerzr requested a review from amyeroberts March 13, 2024 20:13
@muellerzr
Copy link
Contributor

cc @amyeroberts for a final review as she's on watch this week 🤗

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

Just a few small-ish questions and comments


ckpt_dir = f"checkpoint-{state.global_step}"
artifact_path = os.path.join(args.output_dir, ckpt_dir)
logger.info(f"Logging checkpoint artifacts in {ckpt_dir}. ...")
checkpoint_name = (
f"checkpoint-{self._wandb.run.id}"
f"model-{self._wandb.run.id}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match the logging message above in L895

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we name the artifact as model, for checkpoints we add an additional alias checkpoint. This ensures that an artifact for a single run includes both model and checkpoints. Here's an example with different versions

image

)
artifact = self._wandb.Artifact(name=checkpoint_name, type="model", metadata=checkpoint_metadata)
artifact.add_dir(artifact_path)
self._wandb.log_artifact(artifact, aliases=[f"checkpoint-{state.global_step}"])
self._wandb.log_artifact(
artifact, aliases=["checkpoint", f"epoch_{round(state.epoch, 2)}", f"global_step_{state.global_step}"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underscores and hyphens are inconsistently used for the alias - e.g. final-model versus here. We should pick one or the other

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Using underscores consistently across names and aliases now.

Comment on lines 865 to 866
elif isinstance(model, (torch.nn.Module, PushToHubMixin)) and hasattr(model, "base_model"):
print(model, file=f)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means than some modules won't be printed out - this would only work for PEFT models

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we intend to support logging architectures for PreTrainedModel, TFPreTrainedModel, and PEFT models. The current logic handles this. We can add Flax and other model types based on usage and user feature requests.

type="model",
metadata={
"model_config": model.config.to_dict() if hasattr(model, "config") else None,
"num_parameters": model.num_parameters(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the logic on L769 - we aren't guaranteed the model has this property

Copy link
Contributor Author

@parambharat parambharat Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, - Handled this and changed it to use config.get instead.

Comment on lines 798 to 799
elif isinstance(model, (torch.nn.Module, PushToHubMixin)) and hasattr(model, "base_model"):
print(model, file=f)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about PEFT models

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above,

Currently, we intend to support logging architectures for PreTrainedModel, TFPreTrainedModel, and PEFT models. The current logic handles this. We can add Flax and other model types based on usage and user feature requests.

)
model.save_pretrained(temp_dir)
# add the architecture to a separate text file
with open(f"{temp_dir}/model_architecture.txt", "w+") as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is a repeat of L856-L866 - it should be abtracted out to a utility function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Have abstracted it out to a separate function save_model_architecture_to_file

elif isinstance(model, (torch.nn.Module, PushToHubMixin)) and hasattr(model, "base_model"):
print(model, file=f)

for f in Path(temp_dir).glob("*"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could be really slow - taking all the files in the current directory and saving them out as artefacts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be an issue. This is how current artifact logging works. While the files are added to the artifact the logging happens async using the wandb service. This allows the experiment to continue while the artifact files are uploaded to W&B by the wandb service.

Also, having all the files in a model/checkpoint allows users to make reproducible runs.

@parambharat
Copy link
Contributor Author

@amyeroberts . Thanks for the detailed comments and review. I have addressed all the comments and responded to them. Please take a look and let me known if there is anything pending.

@parambharat
Copy link
Contributor Author

Hi @amyeroberts , Just checking in on this PR. Let me know if you have any comments and if my responses to the review comments are accepted and resolve them.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this!

I'm still not 100% sure about some aspects e.g. saving out the model arch at the end of training, as there's been demonstrated successful runs with outputs and integrations are maintained by contributors rather than the transformers team, I'm happy for this to be merge, with any other maintenance / follow up handle by the W&B team

)
artifact = self._wandb.Artifact(name=checkpoint_name, type="model", metadata=checkpoint_metadata)
artifact.add_dir(artifact_path)
self._wandb.log_artifact(artifact, aliases=[f"checkpoint-{state.global_step}"])
self._wandb.log_artifact(
artifact, aliases=[f"epoch_{round(state.epoch, 2)}", f"checkpoint_global_step_{state.global_step}"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still inconsistent here with - vs _

@amyeroberts
Copy link
Collaborator

@parambharat Can you rebase on main to include any upstream changes?

jla524 and others added 4 commits April 3, 2024 20:54
feat: enable mult-idevice for efficientnet
* implement convert_mamba_ssm_checkpoint_to_pytorch

* Add test test_model_from_mamba_ssm_conversion

* moved convert_ssm_config_to_hf_config to inside mamba_ssm_available check

* fix skipif clause

* moved skips to inside test since skipif decorator isn't working for some reason

* Added validation

* removed test

* fixup

* only compare logits

* remove weight rename

* Update src/transformers/models/mamba/convert_mamba_ssm_checkpoint_to_pytorch.py

Co-authored-by: amyeroberts <[email protected]>

* nits

---------

Co-authored-by: amyeroberts <[email protected]>
)

* Defaulted IdeficsProcessor padding to 'longest', removed manual padding

* make fixup

* Defaulted processor call to padding=False

* Add padding to processor call in IdeficsModelIntegrationTest as well

* Defaulted IdeficsProcessor padding to 'longest', removed manual padding

* make fixup

* Defaulted processor call to padding=False

* Add padding to processor call in IdeficsModelIntegrationTest as well

* redefaulted padding=longest again

* fixup/doc
* changes

* addressing comments

* smol fix
@amyeroberts
Copy link
Collaborator

@parambharat Could you try again rebasing? We can't merge until all the tests are 🟢 At the moment, these don't seem related to this PR, but I haven't observed them failing on other open PRs

ydshieh and others added 19 commits April 5, 2024 09:06
Add whisper

Co-authored-by: ydshieh <[email protected]>
…0044)

skip test_encode_decode_fast_slow_all_tokens for now

Co-authored-by: ydshieh <[email protected]>
huggingface#29722)

* if output is tuple like facebook/hf-seamless-m4t-medium, waveform is the first element

Signed-off-by: Wang, Yi <[email protected]>

* add test and fix batch issue

Signed-off-by: Wang, Yi <[email protected]>

* add dict output support for seamless_m4t

Signed-off-by: Wang, Yi <[email protected]>

---------

Signed-off-by: Wang, Yi <[email protected]>
* fix mixtral onnx export

* fix qwen model
* Add image processor to trainer

* Replace tokenizer=image_processor everywhere
# Conflicts:
#	src/transformers/integrations/integration_utils.py
# Conflicts:
#	src/transformers/integrations/integration_utils.py
# Conflicts:
#	src/transformers/integrations/integration_utils.py
@parambharat
Copy link
Contributor Author

parambharat commented Apr 9, 2024

@amyeroberts, @muellerzr : It looks like rebasing with main pulled a bunch of commits and changes into the branch. This was unintentional. I'm closing this PR and raising a new one with just the changes relevant to the integration.

@parambharat
Copy link
Contributor Author

@amyeroberts and @muellerzr : Please find the new PR here

@parambharat parambharat deleted the wandb/callback-upgrade branch April 9, 2024 05:02
@amyeroberts
Copy link
Collaborator

@parambharat For future PRs, when many commits are added like this, it normally indicates that the force push wasn't used when pushing the rebased branch to the remote. As rebasing is effectively rewriting the history, it's necessary to do git push -f to have the rebase changes reflected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.