-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Upgrade Weights & Biases callback #29125
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Overall looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks have a few small questions but good otherwise!
logger.info("Logging model artifacts. ...") | ||
model_name = ( | ||
f"model-{self._wandb.run.id}" | ||
if (args.run_name is None or args.run_name == args.output_dir) | ||
else f"model-{self._wandb.run.name}" | ||
) | ||
# add the model architecture to a separate text file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure I understand why the architecture would change over time during training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure I understand why the architecture would change over time during training.
@ArthurZucker : Here, we log the model architecture to a text file in the model artifact. Some of our users have requested this. This helps to audit and reproduce the experiment easily.
f"checkpoint-{self._wandb.run.id}" | ||
f"model-{self._wandb.run.id}" | ||
if (args.run_name is None or args.run_name == args.output_dir) | ||
else f"checkpoint-{self._wandb.run.name}" | ||
else f"model-{self._wandb.run.name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that not breaking? could we keep model and aliases=["model"]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that not breaking? could we keep model and
aliases=["model"]
?
@ArthurZucker : This doesn't break existing functionality. Currently, we create separate Weights & Biases Artifacts for models and checkpoints. However, this results in multiple artifacts for the same experiment. This instead creates a single artifact with model and checkpoint aliases, allowing the users to log models and checkpoints to the same artifact.
def on_predict(self, args, state, control, metrics, **kwargs): | ||
if self._wandb is None: | ||
return | ||
if not self._initialized: | ||
self.setup(args, state, **kwargs) | ||
if state.is_world_process_zero: | ||
metrics = rewrite_logs(metrics) | ||
self._wandb.log(metrics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain why we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain why we need this?
@ArthurZucker : This allows users to log model predictions and metrics to W&B dashboard when they use model.predict
. This is quite useful in evaluation runs.
@muellerzr and @ArthurZucker : Thank you for the review and comments. I have responded to the review comments with some explanations. Let me know if it looks good. |
@parambharat can you resolve the conflict? Thanks! |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
# Conflicts: # src/transformers/integrations/integration_utils.py
@muellerzr : Resolved 👍 |
cc @amyeroberts for a final review as she's on watch this week 🤗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
Just a few small-ish questions and comments
|
||
ckpt_dir = f"checkpoint-{state.global_step}" | ||
artifact_path = os.path.join(args.output_dir, ckpt_dir) | ||
logger.info(f"Logging checkpoint artifacts in {ckpt_dir}. ...") | ||
checkpoint_name = ( | ||
f"checkpoint-{self._wandb.run.id}" | ||
f"model-{self._wandb.run.id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't match the logging message above in L895
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | ||
artifact = self._wandb.Artifact(name=checkpoint_name, type="model", metadata=checkpoint_metadata) | ||
artifact.add_dir(artifact_path) | ||
self._wandb.log_artifact(artifact, aliases=[f"checkpoint-{state.global_step}"]) | ||
self._wandb.log_artifact( | ||
artifact, aliases=["checkpoint", f"epoch_{round(state.epoch, 2)}", f"global_step_{state.global_step}"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
underscores and hyphens are inconsistently used for the alias - e.g. final-model
versus here. We should pick one or the other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Using underscores consistently across names and aliases now.
elif isinstance(model, (torch.nn.Module, PushToHubMixin)) and hasattr(model, "base_model"): | ||
print(model, file=f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this means than some modules won't be printed out - this would only work for PEFT models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we intend to support logging architectures for PreTrainedModel
, TFPreTrainedModel
, and PEFT
models. The current logic handles this. We can add Flax and other model types based on usage and user feature requests.
type="model", | ||
metadata={ | ||
"model_config": model.config.to_dict() if hasattr(model, "config") else None, | ||
"num_parameters": model.num_parameters(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the logic on L769 - we aren't guaranteed the model has this property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, - Handled this and changed it to use config.get
instead.
elif isinstance(model, (torch.nn.Module, PushToHubMixin)) and hasattr(model, "base_model"): | ||
print(model, file=f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about PEFT models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above,
Currently, we intend to support logging architectures for PreTrainedModel
, TFPreTrainedModel
, and PEFT
models. The current logic handles this. We can add Flax and other model types based on usage and user feature requests.
) | ||
model.save_pretrained(temp_dir) | ||
# add the architecture to a separate text file | ||
with open(f"{temp_dir}/model_architecture.txt", "w+") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is a repeat of L856-L866 - it should be abtracted out to a utility function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Have abstracted it out to a separate function save_model_architecture_to_file
elif isinstance(model, (torch.nn.Module, PushToHubMixin)) and hasattr(model, "base_model"): | ||
print(model, file=f) | ||
|
||
for f in Path(temp_dir).glob("*"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it could be really slow - taking all the files in the current directory and saving them out as artefacts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be an issue. This is how current artifact logging works. While the files are added to the artifact the logging happens async using the wandb service. This allows the experiment to continue while the artifact files are uploaded to W&B by the wandb service.
Also, having all the files in a model/checkpoint allows users to make reproducible runs.
@amyeroberts . Thanks for the detailed comments and review. I have addressed all the comments and responded to them. Please take a look and let me known if there is anything pending. |
Hi @amyeroberts , Just checking in on this PR. Let me know if you have any comments and if my responses to the review comments are accepted and resolve them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this!
I'm still not 100% sure about some aspects e.g. saving out the model arch at the end of training, as there's been demonstrated successful runs with outputs and integrations are maintained by contributors rather than the transformers team, I'm happy for this to be merge, with any other maintenance / follow up handle by the W&B team
) | ||
artifact = self._wandb.Artifact(name=checkpoint_name, type="model", metadata=checkpoint_metadata) | ||
artifact.add_dir(artifact_path) | ||
self._wandb.log_artifact(artifact, aliases=[f"checkpoint-{state.global_step}"]) | ||
self._wandb.log_artifact( | ||
artifact, aliases=[f"epoch_{round(state.epoch, 2)}", f"checkpoint_global_step_{state.global_step}"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still inconsistent here with -
vs _
@parambharat Can you rebase on main to include any upstream changes? |
feat: enable mult-idevice for efficientnet
* implement convert_mamba_ssm_checkpoint_to_pytorch * Add test test_model_from_mamba_ssm_conversion * moved convert_ssm_config_to_hf_config to inside mamba_ssm_available check * fix skipif clause * moved skips to inside test since skipif decorator isn't working for some reason * Added validation * removed test * fixup * only compare logits * remove weight rename * Update src/transformers/models/mamba/convert_mamba_ssm_checkpoint_to_pytorch.py Co-authored-by: amyeroberts <[email protected]> * nits --------- Co-authored-by: amyeroberts <[email protected]>
) * Defaulted IdeficsProcessor padding to 'longest', removed manual padding * make fixup * Defaulted processor call to padding=False * Add padding to processor call in IdeficsModelIntegrationTest as well * Defaulted IdeficsProcessor padding to 'longest', removed manual padding * make fixup * Defaulted processor call to padding=False * Add padding to processor call in IdeficsModelIntegrationTest as well * redefaulted padding=longest again * fixup/doc
* changes * addressing comments * smol fix
@parambharat Could you try again rebasing? We can't merge until all the tests are 🟢 At the moment, these don't seem related to this PR, but I haven't observed them failing on other open PRs |
Add whisper Co-authored-by: ydshieh <[email protected]>
…0044) skip test_encode_decode_fast_slow_all_tokens for now Co-authored-by: ydshieh <[email protected]>
huggingface#29722) * if output is tuple like facebook/hf-seamless-m4t-medium, waveform is the first element Signed-off-by: Wang, Yi <[email protected]> * add test and fix batch issue Signed-off-by: Wang, Yi <[email protected]> * add dict output support for seamless_m4t Signed-off-by: Wang, Yi <[email protected]> --------- Signed-off-by: Wang, Yi <[email protected]>
* fix mixtral onnx export * fix qwen model
* Add image processor to trainer * Replace tokenizer=image_processor everywhere
# Conflicts: # src/transformers/integrations/integration_utils.py
# Conflicts: # src/transformers/integrations/integration_utils.py
# Conflicts: # src/transformers/integrations/integration_utils.py
@amyeroberts, @muellerzr : It looks like rebasing with main pulled a bunch of commits and changes into the branch. This was unintentional. I'm closing this PR and raising a new one with just the changes relevant to the integration. |
@amyeroberts and @muellerzr : Please find the new PR here |
@parambharat For future PRs, when many commits are added like this, it normally indicates that the force push wasn't used when pushing the rebased branch to the remote. As rebasing is effectively rewriting the history, it's necessary to do |
What does this PR do?
This PR adds a few new functionalities to the Weights & Biases Callback
Logs Peft and Lora Config to wandb if present
Adds model parameter counts to wandb config and artifact metadata
Adds on_predict methods to log prediction metrics
Prints the model architecture to a file alongside the wandb artifact
Logs initial and final models to the wandb artifact to full reproducibility
Adds steps and epoch aliases to checkpoint artifacts
Here's a link to the what the logged artifacts look like
Here's a run overview page with added config and metadata for the run with peft configs logged
Before submitting
Pull Request section?
Who can review?