Fix bad error message when `PjRtComputationClient` throws exception #5946

will-cromar · 2023-11-29T23:41:42Z

The g_computation_client_initialized doesn't get reset in this case, leading to incorrect behavior in GetComputationClientIfInitialized. Make CreateClient anonymous so we can't call it twice by accident.

Example from @jonb377:

ptxla@t1v-n-f494979e-w-0:/workspaces/work/pytorch/xla$ python -c 'import torch_xla; torch_xla._XLAC._xla_get_runtime_devices()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: torch_xla/csrc/runtime/runtime.cc:27 : $PJRT_DEVICE is not set.

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/workspaces/work/pytorch/xla/torch_xla/__init__.py", line 127, in _prepare_to_exit
    _XLAC._prepare_to_exit()
RuntimeError: torch_xla/csrc/runtime/runtime.cc:17 : Check failed: !was_initialized 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()

        torch_xla::runtime::GetComputationClient()


        PyCFunction_Call
        _PyObject_MakeTpCall
        _PyEval_EvalFrameDefault
        _PyFunction_Vectorcall
        PyVectorcall_Call

        Py_FinalizeEx
        Py_RunMain
        Py_BytesMain
        __libc_start_main
        _start
*** End stack trace ***
ComputationClient already initialized

ComputationClient was never initialized, so it doesn't make sense to print ComputationClient already initialized.

torch_xla/csrc/runtime/runtime.cc

JackCaoG · 2023-11-30T00:30:43Z

what would the error message be after your change?

jonb377

Thanks Will!

torch_xla/csrc/runtime/runtime.cc

will-cromar · 2023-11-30T17:52:33Z

what would the error message be after your change?

It would just be the first error, $PJRT_DEVICE is not set.

…tion

will-cromar · 2023-12-11T19:39:36Z

There seems to be a consistent error during teardown during the GPU tests. Creating a GPU dev environment to see what's going on

will-cromar · 2023-12-11T21:44:11Z

Setting up a GPU VM is taking too long, so I came up with a better idea. We don't have to carefully prevent CreateClient from being called twice if we just make it anonymous and thus impossible to call twice.

Let's see if this works on GPU...

will-cromar · 2023-12-11T23:06:32Z

Some of this GPU tests are starting to pass, so it looks like this tweak works.

jonb377 · 2023-12-11T23:51:00Z

torch_xla/csrc/runtime/runtime.cc

-
-  ComputationClient* client;
+ComputationClient* GetComputationClient() {
+  static std::unique_ptr<ComputationClient> client = []() {


What's the difference between this and the original approach, since both rely on the static initializer? It seems like the logic has just been moved out of CreateClient and into the lambda.

Yeah, that's right. This lets me skip checking the case where this function gets called twice, the handling of which causes the bad error message. C++11 statics will ensure that it only completes once, and the function is anonymous now, which will prevent future code from erroneously calling CreateClient somewhere else in the future.

Ah I see, the lambda is guaranteed to only ever be called once. Do we know how CreateClient was called twice in the first place?

I added that check to defend against future errors (likely by future me). There was never a case in the original code where CreateClient actually completed twice or was called concurrently.

The bad error messaging occurred when the constructor of PjRtComputationClient threw an exception. g_computation_client_initialized was never reset to false, GetComputationClientIfInitialized calls GetComputationClient during teardown, which re-runs CreateClient, which complains because g_computation_client_initialized never got reset.

In this PR, I only set g_computation_client_initialized after the actual runtime init completed without breaking, since there are guaranteed to be no other concurrent callers of CreateClient.

…5946)

…ytorch#5946)

…5946)

will-cromar added the runtime label Nov 29, 2023

will-cromar requested review from jonb377 and JackCaoG November 29, 2023 23:41

JackCaoG reviewed Nov 30, 2023

View reviewed changes

torch_xla/csrc/runtime/runtime.cc Outdated Show resolved Hide resolved

jonb377 reviewed Nov 30, 2023

View reviewed changes

torch_xla/csrc/runtime/runtime.cc Show resolved Hide resolved

will-cromar force-pushed the wcromar/fix-bad-init-error-message branch from 0e06652 to fde5b87 Compare November 30, 2023 17:51

JackCaoG added the backport_2.2 label Dec 1, 2023

will-cromar added 3 commits December 11, 2023 18:40

Fix confusing error message when PjRtComputationClient throws excep…

ea7c75c

…tion

format

56470fa

remove extra default arg

4e3e9cb

will-cromar force-pushed the wcromar/fix-bad-init-error-message branch from fde5b87 to 4e3e9cb Compare December 11, 2023 18:40

jonb377 approved these changes Dec 11, 2023

View reviewed changes

Make CreateClient anonymous

27d02ee

will-cromar requested review from jonb377 and JackCaoG December 11, 2023 23:06

jonb377 reviewed Dec 11, 2023

View reviewed changes

exchange -> =

ae4a86a

will-cromar merged commit 280ca1d into master Dec 12, 2023
19 checks passed

will-cromar added a commit that referenced this pull request Dec 13, 2023

Fix bad error message when PjRtComputationClient throws exception (#…

7ef93c3

…5946)

will-cromar mentioned this pull request Dec 13, 2023

[Backport] Fix bad error message when PjRtComputationClient throws exception #6144

Merged

chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023

Fix bad error message when PjRtComputationClient throws exception (p…

3e4af02

…ytorch#5946)

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Fix bad error message when PjRtComputationClient throws exception (#…

f433dd5

…5946)

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Fix bad error message when PjRtComputationClient throws exception (#…

4723d96

…5946)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bad error message when `PjRtComputationClient` throws exception #5946

Fix bad error message when `PjRtComputationClient` throws exception #5946

will-cromar commented Nov 29, 2023 •

edited

Loading

JackCaoG commented Nov 30, 2023

jonb377 left a comment

will-cromar commented Nov 30, 2023

will-cromar commented Dec 11, 2023

will-cromar commented Dec 11, 2023

will-cromar commented Dec 11, 2023

jonb377 Dec 11, 2023

will-cromar Dec 12, 2023

jonb377 Dec 12, 2023

will-cromar Dec 12, 2023 •

edited

Loading

Fix bad error message when PjRtComputationClient throws exception #5946

Fix bad error message when PjRtComputationClient throws exception #5946

Conversation

will-cromar commented Nov 29, 2023 • edited Loading

JackCaoG commented Nov 30, 2023

jonb377 left a comment

Choose a reason for hiding this comment

will-cromar commented Nov 30, 2023

will-cromar commented Dec 11, 2023

will-cromar commented Dec 11, 2023

will-cromar commented Dec 11, 2023

jonb377 Dec 11, 2023

Choose a reason for hiding this comment

will-cromar Dec 12, 2023

Choose a reason for hiding this comment

jonb377 Dec 12, 2023

Choose a reason for hiding this comment

will-cromar Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Fix bad error message when `PjRtComputationClient` throws exception #5946

Fix bad error message when `PjRtComputationClient` throws exception #5946

will-cromar commented Nov 29, 2023 •

edited

Loading

will-cromar Dec 12, 2023 •

edited

Loading