Refactor Dynamo (custom op) integration code #5805

yeounoh · 2023-11-15T07:22:07Z

Refactor after #5712 cc @wonjoolee95 @JackCaoG

yeounoh · 2023-11-15T07:29:27Z

@JackCaoG I see this test failing, but other dynamo tests are still passing

======================================================================
FAIL: test_dynamo_input_sharding_threashold (__main__.DynamoSpmdInferenceTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/spmd/test_dynamo_spmd.py", line 174, in test_dynamo_input_sharding_threashold
    self.assertTrue(torch_xla._XLAC._is_placecholder(dynamo_res))
AssertionError: False is not true

----------------------------------------------------------------------

not sure what's happening, this fails consistently with/without the dynamo custom op. cc @wonjoolee95

JackCaoG · 2023-11-15T18:15:10Z

Let me take a look later today

yeounoh · 2023-11-15T19:43:35Z

Let me take a look later today

The test requires multiple devices & non-GPU. So it won't run on the CI -- needs to test locally.

JackCaoG · 2023-11-15T23:48:18Z

OK.. I can tell you what's support to happen. XLA_DYNAMO_INPUT_SHARDING_CHECK_THRESHOLD controls how many times dynamo will check if input sharding changes.

xla/torch_xla/core/dynamo_bridge.py

Lines 364 to 378 in 6e66130

    
           # if the input sharding was the same for skip_checking_input_sharding_threashold times 
        
           # we will skip checking the input sharding since it can be expensive. 
        
           if skip_checking_input_sharding_threashold > 0: 
        
             if torch_xla._XLAC._get_xla_sharding_specs( 
        
                 args) != xla_args_sharding_spec: 
        
               # update the xla_args with the input with new sharding and retrace 
        
               xla_model.xla_args = args 
        
               (xla_args_sharding_spec, args_and_ou_copy, graph_hash, 
        
                arg_index_to_need_update_index, none_remover, graph_input_matcher, 
        
                dumb_return_handler, 
        
                xla_args_need_update) = extract_graph_helper(xla_model) 
        
               skip_checking_input_sharding_threashold = xu.getenv_as( 
        
                   'XLA_DYNAMO_INPUT_SHARDING_CHECK_THRESHOLD', int, 5) 
        
             else: 
        
               skip_checking_input_sharding_threashold -= 1

After THRESHOLD is reached, the dynamo won't check about input sharding. If we changed the input sharding, we will try to execute a compiled program with input that has different sharding, so it will crash. The check in the test was to make sure crash actually happened. I think you can just step through the test, you should see some C++ exception long during try catch.

torch_xla/distributed/spmd/xla_sharding.py

torch_xla/csrc/xla_sharding_util.h

yeounoh · 2023-11-22T01:17:35Z

OK.. I can tell you what's support to happen. XLA_DYNAMO_INPUT_SHARDING_CHECK_THRESHOLD controls how many times dynamo will check if input sharding changes.

xla/torch_xla/core/dynamo_bridge.py

Lines 364 to 378 in 6e66130

# if the input sharding was the same for skip_checking_input_sharding_threashold times

# we will skip checking the input sharding since it can be expensive.

if skip_checking_input_sharding_threashold > 0:

if torch_xla._XLAC._get_xla_sharding_specs(

args) != xla_args_sharding_spec:

# update the xla_args with the input with new sharding and retrace

xla_model.xla_args = args

(xla_args_sharding_spec, args_and_ou_copy, graph_hash,

arg_index_to_need_update_index, none_remover, graph_input_matcher,

dumb_return_handler,

xla_args_need_update) = extract_graph_helper(xla_model)

skip_checking_input_sharding_threashold = xu.getenv_as(

'XLA_DYNAMO_INPUT_SHARDING_CHECK_THRESHOLD', int, 5)

else:

skip_checking_input_sharding_threashold -= 1

After THRESHOLD is reached, the dynamo won't check about input sharding. If we changed the input sharding, we will try to execute a compiled program with input that has different sharding, so it will crash. The check in the test was to make sure crash actually happened. I think you can just step through the test, you should see some C++ exception long during try catch.

Synced offline with @JackCaoG he would help follow up. TLDR, it's recompiling, and we need to check if the threashold is being enforced.

# TODO(yeounoh) - this actually returns False, which means that the program was recompiled
    # with the new sharding change. We expect it to be True after a crash without
    # recompilation. Disabling the test until we debug.
    #self.assertTrue(torch_xla._XLAC._is_placecholder(dynamo_res))
    ```

yeounoh · 2023-11-22T01:21:43Z

@JackCaoG @wonjoolee95 we can review & land this refacotring PR, I won't address the test issue here (also, the test is not being run in the CPU/GPU CI).

yeounoh · 2023-11-22T22:39:31Z

Ok, found another test regression -- test_mark_sharding_inside_compile works with torch nightly from 11/14/2023 but started failng with the latest (11/22/2023):

======================================================================
FAIL: test_mark_sharding_inside_compile (__main__.DynamoSpmdInferenceTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/spmd/test_dynamo_spmd.py", line 232, in test_mark_sharding_inside_compile
    dynamo_res = dynamo_linear(xla_x)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "test/spmd/test_dynamo_spmd.py", line 32, in forward
    xs.mark_sharding(
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/runtime.py", line 78, in wrapper
    if not using_pjrt():
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/runtime.py", line 82, in resume_in_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/distributed/spmd/xla_sharding.py", line 499, in mark_sharding
    num_devices = xr.global_runtime_device_count()
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/distributed/spmd/xla_sharding.py", line 499, in resume_in_mark_sharding
    num_devices = xr.global_runtime_device_count()
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/distributed/spmd/xla_sharding.py", line 510, in resume_in_mark_sharding
    tile_assignment, group_assignment, replication_groups, sharding_type = _extract_op_sharding_specs(
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/distributed/spmd/xla_sharding.py", line 413, in _extract_op_sharding_specs
    def _extract_op_sharding_specs(mesh: Mesh, partition_spec: Tuple):
  File "/usr/local/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 4960, in forward
    return compiled_fn(full_args)
  File "/usr/local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 2017, in g
    return f(*args)
  File "/usr/local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 3164, in runtime_wrapper
    all_outs = call_func_with_args(
  File "/usr/local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 2041, in call_func_with_args
    out = normalize_as_list(f(args))
  File "/usr/local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 2145, in rng_functionalization_wrapper
    return compiled_fw(args)
  File "/usr/local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 2017, in g
    return f(*args)
  File "/usr/local/lib/python3.8/site-packages/torch/_dynamo/backends/torchxla.py", line 49, in fwd
    compiled_graph = bridge.extract_compiled_graph(model, args)
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/core/dynamo_bridge.py", line 540, in extract_compiled_graph
    extract_internal(fused_module), node.args, None)
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/core/dynamo_bridge.py", line 338, in extract_internal
    dumb_return_handler, xla_args_need_update) = extract_graph_helper(xla_model)
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/core/dynamo_bridge.py", line 212, in extract_graph_helper
    assert all(
AssertionError: All tensors should be on xla

cc @JackCaoG @wonjoolee95

yeounoh · 2023-11-27T06:40:06Z

Ok, found another test regression -- test_mark_sharding_inside_compile works with torch nightly from 11/14/2023 but started failng with the latest (11/22/2023):

======================================================================
FAIL: test_mark_sharding_inside_compile (__main__.DynamoSpmdInferenceTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/spmd/test_dynamo_spmd.py", line 232, in test_mark_sharding_inside_compile
    dynamo_res = dynamo_linear(xla_x)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "test/spmd/test_dynamo_spmd.py", line 32, in forward
    xs.mark_sharding(
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/runtime.py", line 78, in wrapper
    if not using_pjrt():
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/runtime.py", line 82, in resume_in_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/distributed/spmd/xla_sharding.py", line 499, in mark_sharding
    num_devices = xr.global_runtime_device_count()
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/distributed/spmd/xla_sharding.py", line 499, in resume_in_mark_sharding
    num_devices = xr.global_runtime_device_count()
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/distributed/spmd/xla_sharding.py", line 510, in resume_in_mark_sharding
    tile_assignment, group_assignment, replication_groups, sharding_type = _extract_op_sharding_specs(
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/distributed/spmd/xla_sharding.py", line 413, in _extract_op_sharding_specs
    def _extract_op_sharding_specs(mesh: Mesh, partition_spec: Tuple):
  File "/usr/local/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 4960, in forward
    return compiled_fn(full_args)
  File "/usr/local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 2017, in g
    return f(*args)
  File "/usr/local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 3164, in runtime_wrapper
    all_outs = call_func_with_args(
  File "/usr/local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 2041, in call_func_with_args
    out = normalize_as_list(f(args))
  File "/usr/local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 2145, in rng_functionalization_wrapper
    return compiled_fw(args)
  File "/usr/local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 2017, in g
    return f(*args)
  File "/usr/local/lib/python3.8/site-packages/torch/_dynamo/backends/torchxla.py", line 49, in fwd
    compiled_graph = bridge.extract_compiled_graph(model, args)
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/core/dynamo_bridge.py", line 540, in extract_compiled_graph
    extract_internal(fused_module), node.args, None)
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/core/dynamo_bridge.py", line 338, in extract_internal
    dumb_return_handler, xla_args_need_update) = extract_graph_helper(xla_model)
  File "/usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git4e506b7-py3.8-linux-x86_64.egg/torch_xla/core/dynamo_bridge.py", line 212, in extract_graph_helper
    assert all(
AssertionError: All tensors should be on xla

cc @JackCaoG @wonjoolee95

This works now.

wonjoolee95

LGTM

* Refactor and clean SPMD+Dynamo integration code

yeounoh marked this pull request as draft November 15, 2023 07:22

yeounoh self-assigned this Nov 15, 2023

yeounoh added SPMD / Distributed dynamo labels Nov 15, 2023

yeounoh changed the title ~~Refactor and clean SPMD+Dynamo integration code~~ Refactor and clean Dynamo (custom op) integration code Nov 15, 2023

yeounoh changed the title ~~Refactor and clean Dynamo (custom op) integration code~~ Refactor Dynamo (custom op) integration code Nov 15, 2023

yeounoh force-pushed the refactor_spmd_dynamo branch from 105e4fe to b4e7a5a Compare November 15, 2023 19:33

yeounoh force-pushed the refactor_spmd_dynamo branch from d3a391f to 46a7eb8 Compare November 15, 2023 19:56

wonjoolee95 reviewed Nov 16, 2023

View reviewed changes

torch_xla/distributed/spmd/xla_sharding.py Outdated Show resolved Hide resolved

torch_xla/csrc/xla_sharding_util.h Show resolved Hide resolved

Refactor and clean SPMD+Dynamo integration code

7faff6c

yeounoh force-pushed the refactor_spmd_dynamo branch 2 times, most recently from 1dfd000 to 4e506b7 Compare November 22, 2023 01:11

Do not edit vscode settings in the git

a6dac44

yeounoh force-pushed the refactor_spmd_dynamo branch from 4e506b7 to a6dac44 Compare November 22, 2023 01:16

yeounoh marked this pull request as ready for review November 22, 2023 01:16

yeounoh requested review from wonjoolee95 and JackCaoG November 22, 2023 01:16

yeounoh force-pushed the refactor_spmd_dynamo branch 3 times, most recently from c998142 to 0664a30 Compare November 22, 2023 21:42

yeounoh force-pushed the refactor_spmd_dynamo branch 3 times, most recently from 7cbefd4 to acdd21b Compare November 22, 2023 23:29

debugging

1009476

yeounoh force-pushed the refactor_spmd_dynamo branch from acdd21b to 1009476 Compare November 23, 2023 00:39

wonjoolee95 approved these changes Nov 27, 2023

View reviewed changes

yeounoh merged commit 3385bd6 into master Nov 27, 2023
18 checks passed

JackCaoG approved these changes Nov 27, 2023

View reviewed changes

lsy323 pushed a commit to lsy323/xla that referenced this pull request Nov 28, 2023

Refactor Dynamo (custom op) integration code (pytorch#5805)

ffcd55b

* Refactor and clean SPMD+Dynamo integration code

chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023

Refactor Dynamo (custom op) integration code (pytorch#5805)

ece58ad

* Refactor and clean SPMD+Dynamo integration code

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Refactor Dynamo (custom op) integration code (#5805)

d835894

* Refactor and clean SPMD+Dynamo integration code

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Refactor Dynamo (custom op) integration code (#5805)

47b1950

* Refactor and clean SPMD+Dynamo integration code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Dynamo (custom op) integration code #5805

Refactor Dynamo (custom op) integration code #5805

yeounoh commented Nov 15, 2023 •

edited

Loading

yeounoh commented Nov 15, 2023

JackCaoG commented Nov 15, 2023

yeounoh commented Nov 15, 2023

JackCaoG commented Nov 15, 2023

yeounoh commented Nov 22, 2023

yeounoh commented Nov 22, 2023

yeounoh commented Nov 22, 2023

yeounoh commented Nov 27, 2023

wonjoolee95 left a comment

Refactor Dynamo (custom op) integration code #5805

Refactor Dynamo (custom op) integration code #5805

Conversation

yeounoh commented Nov 15, 2023 • edited Loading

yeounoh commented Nov 15, 2023

JackCaoG commented Nov 15, 2023

yeounoh commented Nov 15, 2023

JackCaoG commented Nov 15, 2023

yeounoh commented Nov 22, 2023

yeounoh commented Nov 22, 2023

yeounoh commented Nov 22, 2023

yeounoh commented Nov 27, 2023

wonjoolee95 left a comment

Choose a reason for hiding this comment

yeounoh commented Nov 15, 2023 •

edited

Loading