Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 1.9.latest] Microbatch first last batch serial #11107

Merged
merged 1 commit into from
Dec 9, 2024

Conversation

github-actions[bot]
Copy link
Contributor

@github-actions github-actions bot commented Dec 7, 2024

Backport 03fdb4c from #11072.

* microbatch: split out first and last batch to run in serial

* only run pre_hook on first batch, post_hook on last batch

* refactor: internalize parallel to RunTask._submit_batch

* Add optional `force_sequential` to `_submit_batch` to allow for skipping parallelism check

* Force last batch to run sequentially

* Force first batch to run sequentially

* Remove batch_idx check in `should_run_in_parallel`

`should_run_in_parallel` shouldn't, and no longer needs to, take into
consideration where in batch exists in a larger context. The first and
last batch for a microbatch model are now forced to run sequentially
by `handle_microbatch_model`

* Begin skipping batches if first batch fails

* Write custom `on_skip` for `MicrobatchModelRunner` to better handle when batches are skipped

This was necessary specifically because the default on skip set the `X of Y` part
of the skipped log using the `node_index` and the `num_nodes`. If there was 2
nodes and we are on the 4th batch of the second node, we'd get a message like
`SKIPPED 4 of 2...` which didn't make much sense. We're likely in a future commit
going to add a custom event for logging the start, result, and skipping of batches
for better readability of the logs.

* Add microbatch pre-hook, post-hook, and sequential first/last batch tests

* Fix/Add tests around first batch failure vs latter batch failure

* Correct MicrobatchModelRunner.on_skip to handle skipping the entire node

Previously `MicrobatchModelRunner.on_skip` only handled when a _batch_ of
the model was being skipped. However, that method is also used when the
entire microbatch model is being skipped due to an upstream node error. Because
we previously _weren't_ handling this second case, it'd cause an unhandled
runtime exception. Thus, we now need to check whether we're running a batch or not,
and there is no batch, then use the super's on_skip method.

* Correct conditional logic for setting pre- and post-hooks for batches

Previously we were doing an if+elif for setting pre- and post-hooks
for batches, where in the `if` matched if the batch wasn't the first
batch, and the `elif` matched if the batch wasn't the last batch. The
issue with this is that if the `if` was hit, the `elif` _wouldn't_ be hit.
This caused the first batch to appropriately not run the `post-hook` but
then every hook after would run the `post-hook`.

* Add two new event types `LogStartBatch` and `LogBatchResult`

* Update MicrobatchModelRunner to use new batch specific log events

* Fix event testing

* Update microbatch integration tests to catch batch specific event types

---------

Co-authored-by: Quigley Malcolm <[email protected]>
(cherry picked from commit 03fdb4c)
@github-actions github-actions bot requested a review from a team as a code owner December 7, 2024 18:43
@cla-bot cla-bot bot added the cla:yes label Dec 7, 2024
@QMalcolm
Copy link
Contributor

QMalcolm commented Dec 7, 2024

Closing and re-opening to get github actions to behave

@QMalcolm QMalcolm closed this Dec 7, 2024
@QMalcolm QMalcolm reopened this Dec 7, 2024
Copy link

codecov bot commented Dec 7, 2024

Codecov Report

Attention: Patch coverage is 98.38710% with 1 line in your changes missing coverage. Please review.

Project coverage is 89.12%. Comparing base (4e74e69) to head (5a42696).
Report is 1 commits behind head on 1.9.latest.

Additional details and impacted files
@@              Coverage Diff               @@
##           1.9.latest   #11107      +/-   ##
==============================================
- Coverage       89.19%   89.12%   -0.08%     
==============================================
  Files             183      183              
  Lines           23798    23853      +55     
==============================================
+ Hits            21226    21258      +32     
- Misses           2572     2595      +23     
Flag Coverage Δ
integration 86.43% <96.77%> (-0.15%) ⬇️
unit 62.08% <35.48%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Unit Tests 62.08% <35.48%> (-0.06%) ⬇️
Integration Tests 86.43% <96.77%> (-0.15%) ⬇️

Copy link
Contributor

@QMalcolm QMalcolm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reading the code, nothing worrying has jumped out at me. Send it! 🚀

@QMalcolm
Copy link
Contributor

QMalcolm commented Dec 9, 2024

Moving forward with merge of this backport to assure it makes it into 1.9.0 GA release

@QMalcolm QMalcolm merged commit 9f5f002 into 1.9.latest Dec 9, 2024
59 of 61 checks passed
@QMalcolm QMalcolm deleted the backport-11072-to-1.9.latest branch December 9, 2024 02:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants