feat: Upgrade Weights & Biases callback #29125

parambharat · 2024-02-20T04:10:25Z

What does this PR do?

This PR adds a few new functionalities to the Weights & Biases Callback

Logs Peft and Lora Config to wandb if present
Adds model parameter counts to wandb config and artifact metadata
Adds on_predict methods to log prediction metrics
Prints the model architecture to a file alongside the wandb artifact
Logs initial and final models to the wandb artifact to full reproducibility
Adds steps and epoch aliases to checkpoint artifacts
Here's a link to the what the logged artifacts look like
Here's a run overview page with added config and metadata for the run with peft configs logged

Before submitting

Did you read the contributor guideline,
Pull Request section?

Who can review?

trainer: @muellerzr and @pacman100

muellerzr

Thanks! Overall looks good to me

ArthurZucker

Thanks have a few small questions but good otherwise!

ArthurZucker · 2024-03-04T08:03:05Z

src/transformers/integrations/integration_utils.py

                logger.info("Logging model artifacts. ...")
                model_name = (
                    f"model-{self._wandb.run.id}"
                    if (args.run_name is None or args.run_name == args.output_dir)
                    else f"model-{self._wandb.run.name}"
                )
+                # add the model architecture to a separate text file


I am not 100% sure I understand why the architecture would change over time during training?

I am not 100% sure I understand why the architecture would change over time during training.

@ArthurZucker : Here, we log the model architecture to a text file in the model artifact. Some of our users have requested this. This helps to audit and reproduce the experiment easily.

ArthurZucker · 2024-03-04T08:03:56Z

src/transformers/integrations/integration_utils.py

-                f"checkpoint-{self._wandb.run.id}"
+                f"model-{self._wandb.run.id}"
                if (args.run_name is None or args.run_name == args.output_dir)
-                else f"checkpoint-{self._wandb.run.name}"
+                else f"model-{self._wandb.run.name}"


is that not breaking? could we keep model and aliases=["model"] ?

is that not breaking? could we keep model and aliases=["model"] ?

@ArthurZucker : This doesn't break existing functionality. Currently, we create separate Weights & Biases Artifacts for models and checkpoints. However, this results in multiple artifacts for the same experiment. This instead creates a single artifact with model and checkpoint aliases, allowing the users to log models and checkpoints to the same artifact.

ArthurZucker · 2024-03-04T08:04:23Z

src/transformers/integrations/integration_utils.py

+    def on_predict(self, args, state, control, metrics, **kwargs):
+        if self._wandb is None:
+            return
+        if not self._initialized:
+            self.setup(args, state, **kwargs)
+        if state.is_world_process_zero:
+            metrics = rewrite_logs(metrics)
+            self._wandb.log(metrics)


could you explain why we need this?

could you explain why we need this?

@ArthurZucker : This allows users to log model predictions and metrics to W&B dashboard when they use model.predict . This is quite useful in evaluation runs.

parambharat · 2024-03-12T11:26:11Z

@muellerzr and @ArthurZucker : Thank you for the review and comments. I have responded to the review comments with some explanations. Let me know if it looks good.

muellerzr · 2024-03-12T12:07:03Z

@parambharat can you resolve the conflict? Thanks!

HuggingFaceDocBuilderDev · 2024-03-12T12:29:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

# Conflicts: # src/transformers/integrations/integration_utils.py

parambharat · 2024-03-12T12:38:35Z

@muellerzr : Resolved 👍

muellerzr · 2024-03-13T20:14:05Z

cc @amyeroberts for a final review as she's on watch this week 🤗

amyeroberts

Thanks for working on this!

Just a few small-ish questions and comments

amyeroberts · 2024-03-15T14:52:45Z

src/transformers/integrations/integration_utils.py


            ckpt_dir = f"checkpoint-{state.global_step}"
            artifact_path = os.path.join(args.output_dir, ckpt_dir)
            logger.info(f"Logging checkpoint artifacts in {ckpt_dir}. ...")
            checkpoint_name = (
-                f"checkpoint-{self._wandb.run.id}"
+                f"model-{self._wandb.run.id}"


This doesn't match the logging message above in L895

While we name the artifact as model, for checkpoints we add an additional alias checkpoint. This ensures that an artifact for a single run includes both model and checkpoints. Here's an example with different versions

amyeroberts · 2024-03-15T14:53:59Z

src/transformers/integrations/integration_utils.py

            )
            artifact = self._wandb.Artifact(name=checkpoint_name, type="model", metadata=checkpoint_metadata)
            artifact.add_dir(artifact_path)
-            self._wandb.log_artifact(artifact, aliases=[f"checkpoint-{state.global_step}"])
+            self._wandb.log_artifact(
+                artifact, aliases=["checkpoint", f"epoch_{round(state.epoch, 2)}", f"global_step_{state.global_step}"]


underscores and hyphens are inconsistently used for the alias - e.g. final-model versus here. We should pick one or the other

Fixed. Using underscores consistently across names and aliases now.

amyeroberts · 2024-03-15T14:57:24Z

src/transformers/integrations/integration_utils.py

+                    elif isinstance(model, (torch.nn.Module, PushToHubMixin)) and hasattr(model, "base_model"):
+                        print(model, file=f)


I think this means than some modules won't be printed out - this would only work for PEFT models

Currently, we intend to support logging architectures for PreTrainedModel, TFPreTrainedModel, and PEFT models. The current logic handles this. We can add Flax and other model types based on usage and user feature requests.

amyeroberts · 2024-03-15T14:58:42Z

src/transformers/integrations/integration_utils.py

+                    type="model",
+                    metadata={
+                        "model_config": model.config.to_dict() if hasattr(model, "config") else None,
+                        "num_parameters": model.num_parameters(),


As per the logic on L769 - we aren't guaranteed the model has this property

Good catch, - Handled this and changed it to use config.get instead.

amyeroberts · 2024-03-15T14:59:08Z

src/transformers/integrations/integration_utils.py

+                    elif isinstance(model, (torch.nn.Module, PushToHubMixin)) and hasattr(model, "base_model"):
+                        print(model, file=f)


Same comment here about PEFT models

Same as above,

Currently, we intend to support logging architectures for PreTrainedModel, TFPreTrainedModel, and PEFT models. The current logic handles this. We can add Flax and other model types based on usage and user feature requests.

amyeroberts · 2024-03-15T14:59:41Z

src/transformers/integrations/integration_utils.py

+                )
+                model.save_pretrained(temp_dir)
+                # add the architecture to a separate text file
+                with open(f"{temp_dir}/model_architecture.txt", "w+") as f:


This logic is a repeat of L856-L866 - it should be abtracted out to a utility function

Thanks. Have abstracted it out to a separate function save_model_architecture_to_file

amyeroberts · 2024-03-15T15:00:57Z

src/transformers/integrations/integration_utils.py

+                    elif isinstance(model, (torch.nn.Module, PushToHubMixin)) and hasattr(model, "base_model"):
+                        print(model, file=f)
+
+                for f in Path(temp_dir).glob("*"):


This seems like it could be really slow - taking all the files in the current directory and saving them out as artefacts

This should not be an issue. This is how current artifact logging works. While the files are added to the artifact the logging happens async using the wandb service. This allows the experiment to continue while the artifact files are uploaded to W&B by the wandb service.

Also, having all the files in a model/checkpoint allows users to make reproducible runs.

…/callback-upgrade

parambharat · 2024-03-21T05:27:46Z

@amyeroberts . Thanks for the detailed comments and review. I have addressed all the comments and responded to them. Please take a look and let me known if there is anything pending.

parambharat · 2024-03-26T10:26:39Z

Hi @amyeroberts , Just checking in on this PR. Let me know if you have any comments and if my responses to the review comments are accepted and resolve them.

amyeroberts

Thanks for iterating on this!

I'm still not 100% sure about some aspects e.g. saving out the model arch at the end of training, as there's been demonstrated successful runs with outputs and integrations are maintained by contributors rather than the transformers team, I'm happy for this to be merge, with any other maintenance / follow up handle by the W&B team

amyeroberts · 2024-03-28T16:23:52Z

src/transformers/integrations/integration_utils.py

            )
            artifact = self._wandb.Artifact(name=checkpoint_name, type="model", metadata=checkpoint_metadata)
            artifact.add_dir(artifact_path)
-            self._wandb.log_artifact(artifact, aliases=[f"checkpoint-{state.global_step}"])
+            self._wandb.log_artifact(
+                artifact, aliases=[f"epoch_{round(state.epoch, 2)}", f"checkpoint_global_step_{state.global_step}"]


Still inconsistent here with - vs _

amyeroberts · 2024-03-28T16:30:16Z

@parambharat Can you rebase on main to include any upstream changes?

feat: enable mult-idevice for efficientnet

* implement convert_mamba_ssm_checkpoint_to_pytorch * Add test test_model_from_mamba_ssm_conversion * moved convert_ssm_config_to_hf_config to inside mamba_ssm_available check * fix skipif clause * moved skips to inside test since skipif decorator isn't working for some reason * Added validation * removed test * fixup * only compare logits * remove weight rename * Update src/transformers/models/mamba/convert_mamba_ssm_checkpoint_to_pytorch.py Co-authored-by: amyeroberts <[email protected]> * nits --------- Co-authored-by: amyeroberts <[email protected]>

) * Defaulted IdeficsProcessor padding to 'longest', removed manual padding * make fixup * Defaulted processor call to padding=False * Add padding to processor call in IdeficsModelIntegrationTest as well * Defaulted IdeficsProcessor padding to 'longest', removed manual padding * make fixup * Defaulted processor call to padding=False * Add padding to processor call in IdeficsModelIntegrationTest as well * redefaulted padding=longest again * fixup/doc

* changes * addressing comments * smol fix

amyeroberts · 2024-04-04T14:00:38Z

@parambharat Could you try again rebasing? We can't merge until all the tests are 🟢 At the moment, these don't seem related to this PR, but I haven't observed them failing on other open PRs

Add whisper Co-authored-by: ydshieh <[email protected]>

…0044) skip test_encode_decode_fast_slow_all_tokens for now Co-authored-by: ydshieh <[email protected]>

huggingface#29722) * if output is tuple like facebook/hf-seamless-m4t-medium, waveform is the first element Signed-off-by: Wang, Yi <[email protected]> * add test and fix batch issue Signed-off-by: Wang, Yi <[email protected]> * add dict output support for seamless_m4t Signed-off-by: Wang, Yi <[email protected]> --------- Signed-off-by: Wang, Yi <[email protected]>

* fix mixtral onnx export * fix qwen model

* Add image processor to trainer * Replace tokenizer=image_processor everywhere

# Conflicts: # src/transformers/integrations/integration_utils.py

parambharat · 2024-04-09T04:51:22Z

@amyeroberts, @muellerzr : It looks like rebasing with main pulled a bunch of commits and changes into the branch. This was unintentional. I'm closing this PR and raising a new one with just the changes relevant to the integration.

parambharat · 2024-04-09T05:00:46Z

@amyeroberts and @muellerzr : Please find the new PR here

amyeroberts · 2024-04-09T16:11:36Z

@parambharat For future PRs, when many commits are added like this, it normally indicates that the force push wasn't used when pushing the rebased branch to the remote. As rebasing is effectively rewriting the history, it's necessary to do git push -f to have the rebase changes reflected.

parambharat and others added 13 commits October 25, 2023 15:38

feat: add peft config to wandb if it exists in the model

bb7e5fd

feat: add model parameter count to wandb config and model metadata

2b386bb

feat: add metrics on prediction to wandb

665f284

feat: add model architecture to the model artifact

d0f3176

feat: add initial model and architecture to the model artifact on setup

46d0115

Merge branch 'main' into wandb/callback-upgrade

72480ff

feat: add markdown badge to model card

7a3b476

feat: add parameters for peft models and model card badge

44a4226

Merge branch 'main' into wandb/callback-upgrade

e59e15e

refactor: change checkpoints to log and model and rename initial to base

bf93923

feat: add step and epoch aliases to the checkpoints

8ab50ad

chore: run fixup and style fixes

08ced55

Merge branch 'main' into wandb/callback-upgrade

f0bcb24

muellerzr approved these changes Feb 29, 2024

View reviewed changes

muellerzr requested a review from ArthurZucker February 29, 2024 20:43

ArthurZucker reviewed Mar 4, 2024

View reviewed changes

Merge branch 'main' into wandb/callback-upgrade

62155b2

# Conflicts: # src/transformers/integrations/integration_utils.py

muellerzr requested a review from amyeroberts March 13, 2024 20:13

amyeroberts reviewed Mar 15, 2024

View reviewed changes

parambharat added 2 commits March 21, 2024 10:14

fix: address review comments related to DRY and naming consistency

b1a3110

Merge branch 'main' of github.com:parambharat/transformers into wandb…

9042c82

…/callback-upgrade

amyeroberts approved these changes Mar 28, 2024

View reviewed changes

jla524 and others added 4 commits April 3, 2024 20:54

Enable multi-device for efficientnet (huggingface#29989)

03732de

feat: enable mult-idevice for efficientnet

Refactor Cohere Model (huggingface#30027)

517a3e6

* changes * addressing comments * smol fix

ydshieh and others added 19 commits April 5, 2024 09:06

Add whisper to IMPORTANT_MODELS (huggingface#30046)

24d787c

Add whisper Co-authored-by: ydshieh <[email protected]>

skip test_encode_decode_fast_slow_all_tokens for now (huggingface#3…

8b52fa6

…0044) skip test_encode_decode_fast_slow_all_tokens for now Co-authored-by: ydshieh <[email protected]>

Fix mixtral ONNX Exporter Issue. (huggingface#29858)

d704c0b

* fix mixtral onnx export * fix qwen model

[Trainer] Allow passing image processor (huggingface#29896)

1ab7136

* Add image processor to trainer * Replace tokenizer=image_processor everywhere

feat: add peft config to wandb if it exists in the model

ec7e47a

feat: add model parameter count to wandb config and model metadata

d1717c6

feat: add metrics on prediction to wandb

042d1aa

feat: add model architecture to the model artifact

cf31c9a

feat: add initial model and architecture to the model artifact on setup

13a4d43

chore: update and rebase with upstream main

940f296

# Conflicts: # src/transformers/integrations/integration_utils.py

feat: add parameters for peft models and model card badge

859b414

refactor: change checkpoints to log and model and rename initial to base

f43dd42

feat: add step and epoch aliases to the checkpoints

a98ffeb

chore: run fixup and style fixes

e80a34e

fix: address review comments related to DRY and naming consistency

b25675b

chore: update and rebase with upstream main

4e5e2a4

# Conflicts: # src/transformers/integrations/integration_utils.py

chore: update and rebase with upstream main

e5ad376

# Conflicts: # src/transformers/integrations/integration_utils.py

chore: run make fixup

10c1142

parambharat closed this Apr 9, 2024

parambharat mentioned this pull request Apr 9, 2024

feat: Upgrade Weights & Biases callback #30135

Merged

1 task

parambharat deleted the wandb/callback-upgrade branch April 9, 2024 05:02

		elif isinstance(model, (torch.nn.Module, PushToHubMixin)) and hasattr(model, "base_model"):
		print(model, file=f)

feat: Upgrade Weights & Biases callback #29125

feat: Upgrade Weights & Biases callback #29125

Conversation

parambharat commented Feb 20, 2024

What does this PR do?

Before submitting

Who can review?

muellerzr left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parambharat commented Mar 12, 2024

muellerzr commented Mar 12, 2024

HuggingFaceDocBuilderDev commented Mar 12, 2024

parambharat commented Mar 12, 2024

muellerzr commented Mar 13, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parambharat Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parambharat commented Mar 21, 2024

parambharat commented Mar 26, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts commented Mar 28, 2024

amyeroberts commented Apr 4, 2024

parambharat commented Apr 9, 2024 • edited Loading

parambharat commented Apr 9, 2024

amyeroberts commented Apr 9, 2024

parambharat Mar 21, 2024 •

edited

Loading

parambharat commented Apr 9, 2024 •

edited

Loading