Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --megascale_abort_on_hangs flag for multi-slice TPU jobs #731

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mugithi
Copy link

@mugithi mugithi commented Oct 3, 2024

@Ethanlm @markblee PTAL

* Introduce flag to terminate jobs on MegaScale Runtime Errors
    * Enable auto-restart of jax process when errors occur
    * Prevent silent hangs in multi-slice TPU configurations
    * Reduce time to recovery for failed jobs
    * ref: apple#716
    * co-authored by Nick Stogner <[email protected]>
@Ethanlm
Copy link
Contributor

Ethanlm commented Oct 3, 2024

Please don't merge yet. Kyle is helping us testing this.

@kyle-google
Copy link

Please don't merge yet. Kyle is helping us testing this.

Tested in internal environment by scheduling a multi slice v5p job in the internal environment test area. Job was able to make progress and the flag was set for the job using Isaack's branch for axlearn.

# enabling this flag will allow for termination of the job, triggering
# the process to exit. This is set to true to prevent the job from
# silently hanging and to reduce time to recovery.
megascale_abort_on_hangs="true",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a XLA flag? Curious since other xla flags have xla_ prefix

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a XLA compiler flags, but rather a libtpu runtime flag. As long as it eventually pass into LIBTPU_INIT_ARGS it should work.

Copy link
Contributor

@apghml apghml Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this won't work with AOT compilation. Could you test the AOT compilation script run_aot_compilation.py to confirm?
The reason I ask is the other megascale flags I have used don't work with AOT compilation.
If it doesn't work with AOT compilation, we can move the megascale flag to launch.py.

# enabling this flag will allow for termination of the job, triggering
# the process to exit. This is set to true to prevent the job from
# silently hanging and to reduce time to recovery.
megascale_abort_on_hangs="true",
Copy link
Contributor

@apghml apghml Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this won't work with AOT compilation. Could you test the AOT compilation script run_aot_compilation.py to confirm?
The reason I ask is the other megascale flags I have used don't work with AOT compilation.
If it doesn't work with AOT compilation, we can move the megascale flag to launch.py.

axlearn/common/compiler_options.py Show resolved Hide resolved
@apghml
Copy link
Contributor

apghml commented Oct 9, 2024

Also do we know if there is there a list of libtpu-only (non-xla) flags, maybe with some brief description about what they do?

@apghml
Copy link
Contributor

apghml commented Oct 9, 2024

BTW, thanks a lot for working on this! Getting the hanging situation improved is super valuable.

@Ethanlm
Copy link
Contributor

Ethanlm commented Oct 18, 2024

Based on recent discussion,
https://chat.google.com/room/AAAAE7IGW88/3qZf4tP48RU/m5MskM43z4o?cls=10
https://chat.google.com/room/AAAAE7IGW88/3qZf4tP48RU/AXtt8F5CztM?cls=10
looks like we should not enable this flag.

With jax 0.4.33 we build some error aggregation in the coordinator to help on identify bad TPU host for example. but it only works with megascale_abort_on_hangs=false because we need the all workers to report the error to the coordinators

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants