From f5c4d49325293f9841a08e80abd286eeec6a52d1 Mon Sep 17 00:00:00 2001 From: JackCaoG <59073027+JackCaoG@users.noreply.github.com> Date: Mon, 11 Dec 2023 10:14:55 -0800 Subject: [PATCH] Update Troubleshotting doc with new PT_XLA_DEBUG (#6039) --- TROUBLESHOOTING.md | 67 +++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 63 insertions(+), 4 deletions(-) diff --git a/TROUBLESHOOTING.md b/TROUBLESHOOTING.md index e2425cbc8280..03f478c730f7 100644 --- a/TROUBLESHOOTING.md +++ b/TROUBLESHOOTING.md @@ -54,9 +54,13 @@ The **first thing** to check when model is slow is to generate a metrics report. Metrics report is extremely helpful in diagnosing issues. Please try to include it in your bug report sent to us if you have it. -## Perform A Auto-Metrics Analysis +## PyTorch/XLA Debugging Tool -We provide ways to automatically analyze the metrics report and provide a summary. Simply run your workload with `PT_XLA_DEBUG=1`. Some example output would be +You can enable the PyTorch/XLA debugging tool by setting `PT_XLA_DEBUG=1`, which provides a couple useful debugging features. + +### Perform A Auto-Metrics Analysis + +The debugging tool will analyze the metrics report and provide a summary. Some example output would be ``` pt-xla-profiler: CompileTime too frequent: 21 counts during 11 steps @@ -66,6 +70,61 @@ pt-xla-profiler: CompileTime too frequent: 23 counts during 12 steps pt-xla-profiler: TransferFromServerTime too frequent: 12 counts during 12 steps ``` +### Compilation & Execution Analysis +The debugging tool will analyze every compilation and execution for your model. Some example output would be +``` +Compilation Analysis: ================================================================================ +Compilation Analysis: Compilation Cause +Compilation Analysis: user mark_step +Compilation Analysis: Graph Info: +Compilation Analysis: Graph Hash: 537d4b0264b029688281412214d252e9 +Compilation Analysis: Number of Graph Inputs: 588 +Compilation Analysis: Number of Graph Outputs: 320 +Compilation Analysis: Python Frame Triggered Execution: +Compilation Analysis: mark_step (/workspaces/dk2/pytorch/xla/torch_xla/core/xla_model.py:840) +Compilation Analysis: broadcast_master_param (/workspaces/dk2/pytorch/xla/torch_xla/core/xla_model.py:1230) +Compilation Analysis: train_imagenet (/workspaces/dk2/pytorch/xla/test/test_train_mp_imagenet.py:261) +Compilation Analysis: _mp_fn (/workspaces/dk2/pytorch/xla/test/test_train_mp_imagenet.py:365) +Compilation Analysis: __call__ (/workspaces/dk2/pytorch/xla/torch_xla/_internal/pjrt.py:176) +Compilation Analysis: _thread_fn (/workspaces/dk2/pytorch/xla/torch_xla/_internal/pjrt.py:70) +Compilation Analysis: run (/usr/local/lib/python3.8/concurrent/futures/thread.py:57) +Compilation Analysis: _worker (/usr/local/lib/python3.8/concurrent/futures/thread.py:80) +Compilation Analysis: .......... +Compilation Analysis: -------------------------------------------------------------------------------- +Compilation Analysis: ================================================================================ + +Execution Analysis: ================================================================================ +Execution Analysis: Execution Cause +Execution Analysis: user mark_step +Execution Analysis: Graph Info: +Execution Analysis: Graph Hash: 537d4b0264b029688281412214d252e9 +Execution Analysis: Number of Graph Inputs: 588 +Execution Analysis: Number of Graph Outputs: 320 +Execution Analysis: Python Frame Triggered Execution: +Execution Analysis: mark_step (/workspaces/dk2/pytorch/xla/torch_xla/core/xla_model.py:840) +Execution Analysis: broadcast_master_param (/workspaces/dk2/pytorch/xla/torch_xla/core/xla_model.py:1230) +Execution Analysis: train_imagenet (/workspaces/dk2/pytorch/xla/test/test_train_mp_imagenet.py:261) +Execution Analysis: _mp_fn (/workspaces/dk2/pytorch/xla/test/test_train_mp_imagenet.py:365) +Execution Analysis: __call__ (/workspaces/dk2/pytorch/xla/torch_xla/_internal/pjrt.py:176) +Execution Analysis: _thread_fn (/workspaces/dk2/pytorch/xla/torch_xla/_internal/pjrt.py:70) +Execution Analysis: run (/usr/local/lib/python3.8/concurrent/futures/thread.py:57) +Execution Analysis: _worker (/usr/local/lib/python3.8/concurrent/futures/thread.py:80) +Execution Analysis: .......... +Execution Analysis: -------------------------------------------------------------------------------- +Execution Analysis: ================================================================================ +``` + +Some common causes of Compilation/Executation are +1. User manually call `mark_step`. +2. [Parallel loader](https://github.com/pytorch/xla/blob/fe4af0080af07f78ca2b614dd91b71885a3bbbb8/torch_xla/distributed/parallel_loader.py#L49-L51) call `mark_step` for every x (configurable) batch. +3. Exiting a [profiler StepTrace region](https://github.com/pytorch/xla/blob/fe4af0080af07f78ca2b614dd91b71885a3bbbb8/torch_xla/debug/profiler.py#L165-L171). +4. Dynamo decide to compile/execute the graph. +5. User trying to access(often due to logging) the value of a tensor before the `mark_step`. + +The executation caused by 1-4 are expected, and we want to avoid 5 by either reduce the frequency of accessing tensor values or manually add a `mark_step` before accessing. + +Users should expect to see this `Compilation Cause` + `Executation Cause` pairs for first couple steps. After the model stabilize users should expect to only see `Execution Cause`. To use PyTorch/XLA efficiently, we expect the same models code to be run for every step and compilation only happen once for every graph. If you keep seeing `Compilation Cause`, you should try to dump the IR/HLO following [this section](#common-debugging-environment-variables-combinations) and compare the graphs for each step and understand the source of the differences. + Following section will explain how to get and understand a more detail metrics report. ## Get A Metrics Report @@ -300,12 +359,12 @@ only be enabled for debugging. * Record the graph execution in the IR format ``` - XLA_SAVE_TENSORS_FMT="hlo" XLA_SAVE_TENSORS_FILE="/tmp/save1.hlo" + XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 XLA_SAVE_TENSORS_FMT="hlo" XLA_SAVE_TENSORS_FILE="/tmp/save1.hlo" ``` * Record the graph execution in the HLO format ``` - XLA_SAVE_TENSORS_FMT="text" XLA_SAVE_TENSORS_FILE="/tmp/save1.ir" + XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 XLA_SAVE_TENSORS_FMT="text" XLA_SAVE_TENSORS_FILE="/tmp/save1.ir" ``` * Show debugging VLOG for runtime and graph compilation/execution