Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU CI should fail on hanging tests. #6385

Closed
vanbasten23 opened this issue Jan 25, 2024 · 4 comments
Closed

GPU CI should fail on hanging tests. #6385

vanbasten23 opened this issue Jan 25, 2024 · 4 comments
Assignees

Comments

@vanbasten23
Copy link
Collaborator

🐛 Bug

In #6247, we found that when tests hang the GPU CI still showed green. This is unexpected.

To Reproduce

Apply the change in #6247, we should be able to see that the GPU CI passes but there is error message in the GPU CI log and the tests hang when running locally.

Expected behavior

When tests hang, GPU CI should fail.

Environment

  • Reproducible on XLA backend [CPU/TPU]: GPU
  • torch_xla version: nightly.

Additional context

@vanbasten23 vanbasten23 self-assigned this Jan 25, 2024
@vanbasten23
Copy link
Collaborator Author

vanbasten23 commented Feb 3, 2024

Was able to reproduce the issue in the pr #6383 as well as running locally a bash script

#!/bin/bash

PJRT_DEVICE=CUDA GPU_NUM_DEVICES=2 python pytorch/xla/test/pjrt/test_runtime.py
PJRT_DEVICE=CUDA GPU_NUM_DEVICES=2 python pytorch/xla/test/pjrt/test_runtime_gpu.py
PJRT_DEVICE=CUDA GPU_NUM_DEVICES=2 python pytorch/xla/test/pjrt/test_runtime_multi_cpu.py

output and output.

If you see the output, you can see that the exit code is 0 (success).

@vanbasten23
Copy link
Collaborator Author

The issue happens in

def _prepare_to_exit():
  device = _XLAC._xla_get_default_device()
  _XLAC._set_all_reduce_token(device, None)
  _XLAC._prepare_to_exit()
  if int(os.environ.get('PT_XLA_DEBUG', '0')):
    _summarize_fn_tracker()

where everything runs in the atexit callback.
A runtime exception is raised in the callback but the exit code remains to be 0, which is why the CI couldn't catch the error.

According to https://bugs.python.org/issue27035, python 2 can set the correct exit code (expect to be non-zero) if an exception is raised in the atexit callback but python 3 cannot. But the bug seems still to be open.

@vanbasten23
Copy link
Collaborator Author

Even if I make the atexit callback to raise an SystemExit exception such as

def _prepare_to_exit():
  try:
    device = _XLAC._xla_get_default_device()
    _XLAC._set_all_reduce_token(device, None)
    _XLAC._prepare_to_exit()
    if int(os.environ.get('PT_XLA_DEBUG', '0')):
      _summarize_fn_tracker()
  except Exception as e: # type(e)= <class 'RuntimeError'>
    raise SystemExit(1)

, the exit status is still 0. See output

@vanbasten23
Copy link
Collaborator Author

With the change in the PR #6383, the GPU CI fails as expected
image after we apply the offending change #6247,
and the CI log shows the exception detail https://gist.github.com/vanbasten23/d8650b07df4af347447a43d8362f0ef5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant