Fix MPS with sequence loss #2834

JAEarly · 2024-01-10T16:12:28Z

What does this PR do?

TLDR
Fixes the trainer eval loop for MPS devices when the model output is a list of tensors rather than a torch.Tensor or Mapping.

Related
Similar issue but with dict: #2632
PR for above issue: #2706

Details
In the trainer evaluation loop, model outputs (self.state.outputs) are moved to the CPU if they are on an MPS device (to avoid torchvision numerical errors on MPS devices). Currently this works fine if self.state.outputs is a Mapping or a torch.Tensor. However, it fails for Sequence types (e.g. a list of torch.Tensors).

According to the State class, outputs is expected to be of type torch.Tensor | Sequence[torch.Tensor]. So Sequence types should be supported in this process in the eval loop. Strictly, outputs should not be a Mapping, despite the eval operation supporting this. I have left the Mapping support in place for now but happy to revisit.

As it was a little difficult to debug this issue, I have added an error message which should make it clearer if an invalid output type is used (such that it cannot correctly be mapped from MPS to CPU).

Before submitting

Have you read the contributor guidelines?
Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
Did you update any related docs and document your change?
Did you update any related tests and add any new tests related to your change? (see testing)
Did you run the tests locally to make sure they pass?
Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

JAEarly · 2024-01-10T17:21:24Z

Upon further inspection, eval_forward can strictly return Any type, so keeping support for Mapping makes sense. Adding support for self.state.options == None in this PR may also be useful.

JAEarly · 2024-01-19T11:36:54Z

@mvpatel2000 this is a relatively small change but is fairly annoying for running stuff on MPS, could it be merged in please? Thank you!

dakinggg

Thanks for the fix! I left one small comment, and happy to approve if that makes sense to you!

composer/trainer/trainer.py

dakinggg

Thanks!

mvpatel2000 · 2024-01-20T02:31:16Z

@JAEarly apologies for the delay in review! Next time please feel free to tag me as a reviewer.... I will see if we can setup auto-tagging as well. We didn't see this earlier :(

* Add MPS support for list outputs in training eval loop * Add error for invalid state outputs type in trainer * Remove raise ValueError in trainer eval loop --------- Co-authored-by: Daniel King <[email protected]>

JAEarly force-pushed the eval_loop_mps branch 2 times, most recently from 31d92e8 to b945c5a Compare January 18, 2024 10:18

JAEarly marked this pull request as draft January 19, 2024 11:30

JAEarly marked this pull request as ready for review January 19, 2024 11:30

JAEarly added 2 commits January 19, 2024 11:31

Add MPS support for list outputs in training eval loop

4a54f98

Add error for invalid state outputs type in trainer

1dd8bc0

JAEarly force-pushed the eval_loop_mps branch from b945c5a to 1dd8bc0 Compare January 19, 2024 11:31

Merge branch 'dev' into eval_loop_mps

b7ab6ca

dakinggg reviewed Jan 19, 2024

View reviewed changes

composer/trainer/trainer.py Outdated Show resolved Hide resolved

Remove raise ValueError in trainer eval loop

6a4d1f3

dakinggg approved these changes Jan 19, 2024

View reviewed changes

mvpatel2000 approved these changes Jan 20, 2024

View reviewed changes

Merge branch 'dev' into eval_loop_mps

d475e28

JAEarly requested review from dakinggg and mvpatel2000 January 22, 2024 10:00

mvpatel2000 merged commit 1df5557 into mosaicml:dev Jan 22, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MPS with sequence loss #2834

Fix MPS with sequence loss #2834

JAEarly commented Jan 10, 2024

JAEarly commented Jan 10, 2024

JAEarly commented Jan 19, 2024

dakinggg left a comment

dakinggg left a comment

mvpatel2000 commented Jan 20, 2024

Fix MPS with sequence loss #2834

Fix MPS with sequence loss #2834

Conversation

JAEarly commented Jan 10, 2024

What does this PR do?

Before submitting

JAEarly commented Jan 10, 2024

JAEarly commented Jan 19, 2024

dakinggg left a comment

Choose a reason for hiding this comment

dakinggg left a comment

Choose a reason for hiding this comment

mvpatel2000 commented Jan 20, 2024