Skip to content

Commit

Permalink
Update doc from commit 5ac722a
Browse files Browse the repository at this point in the history
  • Loading branch information
torchxlabot2 committed Dec 19, 2023
1 parent 7305298 commit 9c47a8e
Show file tree
Hide file tree
Showing 14 changed files with 18 additions and 18 deletions.
2 changes: 1 addition & 1 deletion master/_modules/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+gitad14582 )
master (2.2.0+git5ac722a )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/_modules/torch_xla/core/functions.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+gitad14582 )
master (2.2.0+git5ac722a )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/_modules/torch_xla/core/xla_model.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+gitad14582 )
master (2.2.0+git5ac722a )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/_modules/torch_xla/distributed/parallel_loader.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+gitad14582 )
master (2.2.0+git5ac722a )
</div>


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+gitad14582 )
master (2.2.0+git5ac722a )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/_modules/torch_xla/utils/serialization.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+gitad14582 )
master (2.2.0+git5ac722a )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/_modules/torch_xla/utils/utils.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+gitad14582 )
master (2.2.0+git5ac722a )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@


<div class="version">
master (2.2.0+gitad14582 )
master (2.2.0+git5ac722a )
</div>


Expand Down
12 changes: 6 additions & 6 deletions master/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+gitad14582 )
master (2.2.0+git5ac722a )
</div>


Expand Down Expand Up @@ -1636,7 +1636,7 @@ <h2>Basic high-level understanding of some XLA details<a class="headerlink" href
<a class="reference external image-reference" href="assets/pytorchXLA_flow.svg"><img alt="img" src="assets/pytorchXLA_flow.svg" /></a>
<p>For more details and examples, please refer to the <a class="reference external" href="https://pytorch.org/blog/understanding-lazytensor-system-performance-with-pytorch-xla-on-cloud-tpu/">LazyTensor guide</a>.</p>
<p>The operations in the IR graph are executed only when values of tensors are needed. This is referred to as evaluation or materialization of tensors. Sometimes this is also called lazy evaluation and it can lead to significant <a class="reference external" href="https://arxiv.org/pdf/2102.13267.pdf">performance improvements</a>.</p>
<p>The <em>synchronous</em> operations in Pytorch XLA, like printing, logging, checkpointing or callbacks block tracing and result in slower execution. In the case when an operation requires a specific value of an XLA tensor, e.g. <code class="docutils literal notranslate"><span class="pre">print(xla_tensor_z)</span></code>, tracing is blocked until the value of that tensor is available to the host. Note that only the part of the graph responsible for computing that tensor value is executed. These operations do not cut the IR graph, but they trigger host-device communication through <code class="docutils literal notranslate"><span class="pre">TransferFromServer</span></code>, which results in slower performance.</p>
<p>The <em>synchronous</em> operations in Pytorch XLA, like printing, logging, checkpointing or callbacks block tracing and result in slower execution. In the case when an operation requires a specific value of an XLA tensor, e.g. <code class="docutils literal notranslate"><span class="pre">print(xla_tensor_z)</span></code>, tracing is blocked until the value of that tensor is available to the host. Note that only the part of the graph responsible for computing that tensor value is executed. These operations do not cut the IR graph, but they trigger host-device communication through <code class="docutils literal notranslate"><span class="pre">TransferFromDevice</span></code>, which results in slower performance.</p>
<p>A <em>barrier</em> is a special instruction that tells XLA to execute the IR graph and materialize the tensors. This means that the PyTorch XLA tensors will be evaluated, and the results will be available to the host. The user-exposed barrier in Pytorch XLA is <a class="reference external" href="https://github.com/pytorch/xla/blob/bdceee54eca1269ee954f6cdd1868c584d0e88a4/torch_xla/core/xla_model.py#L808">xm.mark_step()</a>, which breaks the IR graph and results in code execution on the XLA devices. One of the key properties of <code class="docutils literal notranslate"><span class="pre">xm.mark_step</span></code> is that unlike synchronous operations it does not block the further tracing while the device is executing the graph. However, it does block access to the values of the tensors that are being materialized.</p>
<p>The example in the LazyTensor guide illustrates what happens in a simple case of adding two tensors. Now, suppose we have a for loop that adds XLA tensors and uses the value later:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">tensors_on_device</span><span class="p">:</span>
Expand Down Expand Up @@ -1796,8 +1796,8 @@ <h2>Profiling and performance analysis<a class="headerlink" href="#profiling-and
<p>Now, let’s examine the XL version of the model and do the same thing. We will add traces to the pipeline <a class="reference external" href="https://github.com/pytorch-tpu/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py">file</a> in the same way that we did for the 2.1 version and capture a profile.</p>
<a class="reference external image-reference" href="assets/image-4.png"><img alt="Alt text" src="assets/image-4.png" /></a>
<p>This time, in addition to the large gap in the middle, which is caused by the <code class="docutils literal notranslate"><span class="pre">pipe_watermark</span></code> tracing, there are many small gaps between the inference steps within <a class="reference external" href="https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L814-L830">this loop</a>.</p>
<p>First look closer into the large gap that is caused by <code class="docutils literal notranslate"><span class="pre">pipe_watermark</span></code>. The gap is preceded with <code class="docutils literal notranslate"><span class="pre">TransferFromServer</span></code> which indicates that something is happening on the host machine that is waiting for computation to finish before proceeding. Looking into watermark <a class="reference external" href="https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/watermark.py#L29">code</a>, we can see that tensors are transferred to cpu and converted to numpy arrays in order to be processed with <code class="docutils literal notranslate"><span class="pre">cv2</span></code> and <code class="docutils literal notranslate"><span class="pre">pywt</span></code> libraries later. Since this part is not straightforward to optimize, we will leave this as is.</p>
<p>Now if we zoom in on the loop, we can see that the graph within the loop is broken into smaller parts because the <code class="docutils literal notranslate"><span class="pre">TransferFromServer</span></code> operation happens.</p>
<p>First look closer into the large gap that is caused by <code class="docutils literal notranslate"><span class="pre">pipe_watermark</span></code>. The gap is preceded with <code class="docutils literal notranslate"><span class="pre">TransferFromDevice</span></code> which indicates that something is happening on the host machine that is waiting for computation to finish before proceeding. Looking into watermark <a class="reference external" href="https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/watermark.py#L29">code</a>, we can see that tensors are transferred to cpu and converted to numpy arrays in order to be processed with <code class="docutils literal notranslate"><span class="pre">cv2</span></code> and <code class="docutils literal notranslate"><span class="pre">pywt</span></code> libraries later. Since this part is not straightforward to optimize, we will leave this as is.</p>
<p>Now if we zoom in on the loop, we can see that the graph within the loop is broken into smaller parts because the <code class="docutils literal notranslate"><span class="pre">TransferFromDevice</span></code> operation happens.</p>
<a class="reference external image-reference" href="assets/image-3.png"><img alt="Alt text" src="assets/image-3.png" /></a>
<p>If we investigate the U-Net function and the scheduler, we can see that the U-Net code does not contain any optimization targets for PyTorch/XLA. However, there are <code class="docutils literal notranslate"><span class="pre">.item()</span></code> and <code class="docutils literal notranslate"><span class="pre">.nonzero()</span></code> calls inside the <a class="reference external" href="https://github.com/huggingface/diffusers/blob/15782fd506e8c4a7c2b288fc2e558bd77fdfa51a/src/diffusers/schedulers/scheduling_euler_discrete.py#L371">scheduler.step</a>. We can <a class="reference external" href="https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/schedulers/scheduling_euler_discrete.py#L310">rewrite</a> the function to avoid those calls. If we fix this issue and rerun a profile, we will not see much difference. However, since we have reduced the device-host communication that was introducing smaller graphs, we allowed the compiler to optimize the code better. The function <a class="reference external" href="https://github.com/huggingface/diffusers/blob/15782fd506e8c4a7c2b288fc2e558bd77fdfa51a/src/diffusers/schedulers/scheduling_euler_discrete.py#L205">scale_model_input</a> has similar issues, and we can fix these by making the changes we made above to the <code class="docutils literal notranslate"><span class="pre">step</span></code> function. Overall, since many of the gaps are caused from python level code tracing and graph building, these gaps are not possible to optimize with the current version of PyTorch XLA, but we may see improvements in the future when dynamo is enabled in PyTorch XLA.</p>
</div>
Expand Down Expand Up @@ -1893,10 +1893,10 @@ <h2>PyTorch/XLA Debugging Tool<a class="headerlink" href="#pytorch-xla-debugging
<h3>Perform A Auto-Metrics Analysis<a class="headerlink" href="#perform-a-auto-metrics-analysis" title="Permalink to this headline"></a></h3>
<p>The debugging tool will analyze the metrics report and provide a summary. Some example output would be</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">pt</span><span class="o">-</span><span class="n">xla</span><span class="o">-</span><span class="n">profiler</span><span class="p">:</span> <span class="n">CompileTime</span> <span class="n">too</span> <span class="n">frequent</span><span class="p">:</span> <span class="mi">21</span> <span class="n">counts</span> <span class="n">during</span> <span class="mi">11</span> <span class="n">steps</span>
<span class="n">pt</span><span class="o">-</span><span class="n">xla</span><span class="o">-</span><span class="n">profiler</span><span class="p">:</span> <span class="n">TransferFromServerTime</span> <span class="n">too</span> <span class="n">frequent</span><span class="p">:</span> <span class="mi">11</span> <span class="n">counts</span> <span class="n">during</span> <span class="mi">11</span> <span class="n">steps</span>
<span class="n">pt</span><span class="o">-</span><span class="n">xla</span><span class="o">-</span><span class="n">profiler</span><span class="p">:</span> <span class="n">TransferFromDeviceTime</span> <span class="n">too</span> <span class="n">frequent</span><span class="p">:</span> <span class="mi">11</span> <span class="n">counts</span> <span class="n">during</span> <span class="mi">11</span> <span class="n">steps</span>
<span class="n">pt</span><span class="o">-</span><span class="n">xla</span><span class="o">-</span><span class="n">profiler</span><span class="p">:</span> <span class="n">Op</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="ow">not</span> <span class="n">lowered</span><span class="p">:</span> <span class="n">aten</span><span class="p">::</span><span class="n">_ctc_loss</span><span class="p">,</span> <span class="n">aten</span><span class="p">::</span><span class="n">_ctc_loss_backward</span><span class="p">,</span> <span class="n">Please</span> <span class="nb">open</span> <span class="n">a</span> <span class="n">GitHub</span> <span class="n">issue</span> <span class="k">with</span> <span class="n">the</span> <span class="n">above</span> <span class="n">op</span> <span class="n">lowering</span> <span class="n">requests</span><span class="o">.</span>
<span class="n">pt</span><span class="o">-</span><span class="n">xla</span><span class="o">-</span><span class="n">profiler</span><span class="p">:</span> <span class="n">CompileTime</span> <span class="n">too</span> <span class="n">frequent</span><span class="p">:</span> <span class="mi">23</span> <span class="n">counts</span> <span class="n">during</span> <span class="mi">12</span> <span class="n">steps</span>
<span class="n">pt</span><span class="o">-</span><span class="n">xla</span><span class="o">-</span><span class="n">profiler</span><span class="p">:</span> <span class="n">TransferFromServerTime</span> <span class="n">too</span> <span class="n">frequent</span><span class="p">:</span> <span class="mi">12</span> <span class="n">counts</span> <span class="n">during</span> <span class="mi">12</span> <span class="n">steps</span>
<span class="n">pt</span><span class="o">-</span><span class="n">xla</span><span class="o">-</span><span class="n">profiler</span><span class="p">:</span> <span class="n">TransferFromDeviceTime</span> <span class="n">too</span> <span class="n">frequent</span><span class="p">:</span> <span class="mi">12</span> <span class="n">counts</span> <span class="n">during</span> <span class="mi">12</span> <span class="n">steps</span>
</pre></div>
</div>
</div>
Expand Down
2 changes: 1 addition & 1 deletion master/notes/source_of_recompilation.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+gitad14582 )
master (2.2.0+git5ac722a )
</div>


Expand Down
Binary file modified master/objects.inv
Binary file not shown.
2 changes: 1 addition & 1 deletion master/py-modindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,7 @@


<div class="version">
master (2.2.0+gitad14582 )
master (2.2.0+git5ac722a )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/search.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+gitad14582 )
master (2.2.0+git5ac722a )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/searchindex.js

Large diffs are not rendered by default.

0 comments on commit 9c47a8e

Please sign in to comment.