Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bad error message when PjRtComputationClient throws exception #5946

Merged
merged 5 commits into from
Dec 12, 2023

Conversation

will-cromar
Copy link
Collaborator

@will-cromar will-cromar commented Nov 29, 2023

The g_computation_client_initialized doesn't get reset in this case, leading to incorrect behavior in GetComputationClientIfInitialized. Make CreateClient anonymous so we can't call it twice by accident.

Example from @jonb377:

ptxla@t1v-n-f494979e-w-0:/workspaces/work/pytorch/xla$ python -c 'import torch_xla; torch_xla._XLAC._xla_get_runtime_devices()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: torch_xla/csrc/runtime/runtime.cc:27 : $PJRT_DEVICE is not set.

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/workspaces/work/pytorch/xla/torch_xla/__init__.py", line 127, in _prepare_to_exit
    _XLAC._prepare_to_exit()
RuntimeError: torch_xla/csrc/runtime/runtime.cc:17 : Check failed: !was_initialized 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()

        torch_xla::runtime::GetComputationClient()


        PyCFunction_Call
        _PyObject_MakeTpCall
        _PyEval_EvalFrameDefault
        _PyFunction_Vectorcall
        PyVectorcall_Call

        Py_FinalizeEx
        Py_RunMain
        Py_BytesMain
        __libc_start_main
        _start
*** End stack trace ***
ComputationClient already initialized

ComputationClient was never initialized, so it doesn't make sense to print ComputationClient already initialized.

@JackCaoG
Copy link
Collaborator

what would the error message be after your change?

Copy link
Collaborator

@jonb377 jonb377 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Will!

torch_xla/csrc/runtime/runtime.cc Show resolved Hide resolved
@will-cromar will-cromar force-pushed the wcromar/fix-bad-init-error-message branch from 0e06652 to fde5b87 Compare November 30, 2023 17:51
@will-cromar
Copy link
Collaborator Author

what would the error message be after your change?

It would just be the first error, $PJRT_DEVICE is not set.

@will-cromar will-cromar force-pushed the wcromar/fix-bad-init-error-message branch from fde5b87 to 4e3e9cb Compare December 11, 2023 18:40
@will-cromar
Copy link
Collaborator Author

There seems to be a consistent error during teardown during the GPU tests. Creating a GPU dev environment to see what's going on

@will-cromar
Copy link
Collaborator Author

Setting up a GPU VM is taking too long, so I came up with a better idea. We don't have to carefully prevent CreateClient from being called twice if we just make it anonymous and thus impossible to call twice.

Let's see if this works on GPU...

@will-cromar
Copy link
Collaborator Author

Some of this GPU tests are starting to pass, so it looks like this tweak works.


ComputationClient* client;
ComputationClient* GetComputationClient() {
static std::unique_ptr<ComputationClient> client = []() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this and the original approach, since both rely on the static initializer? It seems like the logic has just been moved out of CreateClient and into the lambda.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's right. This lets me skip checking the case where this function gets called twice, the handling of which causes the bad error message. C++11 statics will ensure that it only completes once, and the function is anonymous now, which will prevent future code from erroneously calling CreateClient somewhere else in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, the lambda is guaranteed to only ever be called once. Do we know how CreateClient was called twice in the first place?

Copy link
Collaborator Author

@will-cromar will-cromar Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added that check to defend against future errors (likely by future me). There was never a case in the original code where CreateClient actually completed twice or was called concurrently.

The bad error messaging occurred when the constructor of PjRtComputationClient threw an exception. g_computation_client_initialized was never reset to false, GetComputationClientIfInitialized calls GetComputationClient during teardown, which re-runs CreateClient, which complains because g_computation_client_initialized never got reset.

In this PR, I only set g_computation_client_initialized after the actual runtime init completed without breaking, since there are guaranteed to be no other concurrent callers of CreateClient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants