xm.save() should not set sync_xla_data=True when sync'ing. #8484

mcuiaws · 2024-12-11T19:00:41Z

Setting sync_xla_data=True performs tensor graph sync as if it's a mark step, which triggers buffer aliasing to be performed. However, according to the comments in https://github.com/pytorch/xla/blame/v2.5.1/torch_xla/csrc/xla_graph_executor.cpp#L1336, it's not safe to do so unless all live tensors are being sync'd.

This fixes #8422

tengyifei

Need to resolve conflicts before merging.

tengyifei · 2024-12-17T20:56:10Z

test/test_input_output_aliases.py

@@ -143,6 +144,28 @@ def try_grad_accum(model, device, train_x, train_label, accum_steps):
        alias_count == 1.0
    ), f"Expect 1 input-output alias pair for gradient accumulation, got {alias_count}"

+  def test_xm_save_no_aliasing(self):


I think this test is added in https://github.com/pytorch/xla/pull/8467/files already

Setting sync_xla_data=True performs tensor graph sync as if it's a mark step, which triggers buffer aliasing to be performed. However, it's not safe to do so unless all live tensors are being sync'd. Also fix torch_xla.utils.serialization.save() which has the same issue. This fixes pytorch#8422

jeffhataws requested review from will-cromar, jonb377, tengyifei and ManfeiBai December 11, 2024 19:25

mcuiaws force-pushed the fix_8422 branch 2 times, most recently from c2af6fa to 3d48f2a Compare December 11, 2024 22:18

jeffhataws approved these changes Dec 12, 2024

View reviewed changes

tengyifei added the tpuci label Dec 12, 2024

mcuiaws force-pushed the fix_8422 branch from 3d48f2a to a61e237 Compare December 13, 2024 00:03

tengyifei approved these changes Dec 17, 2024

View reviewed changes

mcuiaws force-pushed the fix_8422 branch from a61e237 to 8788167 Compare December 17, 2024 21:08

tengyifei merged commit d3ed982 into pytorch:master Dec 17, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xm.save() should not set sync_xla_data=True when sync'ing. #8484

xm.save() should not set sync_xla_data=True when sync'ing. #8484

mcuiaws commented Dec 11, 2024 •

edited by jeffhataws

Loading

tengyifei left a comment

tengyifei Dec 17, 2024

mcuiaws Dec 17, 2024

xm.save() should not set sync_xla_data=True when sync'ing. #8484

xm.save() should not set sync_xla_data=True when sync'ing. #8484

Conversation

mcuiaws commented Dec 11, 2024 • edited by jeffhataws Loading

tengyifei left a comment

Choose a reason for hiding this comment

tengyifei Dec 17, 2024

Choose a reason for hiding this comment

mcuiaws Dec 17, 2024

Choose a reason for hiding this comment

mcuiaws commented Dec 11, 2024 •

edited by jeffhataws

Loading