Skip to content

Commit

Permalink
Update doc from commit b0dca12
Browse files Browse the repository at this point in the history
  • Loading branch information
torchxlabot2 committed Oct 27, 2023
1 parent d8e179d commit f147579
Show file tree
Hide file tree
Showing 14 changed files with 59 additions and 15 deletions.
2 changes: 1 addition & 1 deletion master/_modules/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+git421ba3e )
master (2.2.0+gitb0dca12 )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/_modules/torch_xla/core/functions.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+git421ba3e )
master (2.2.0+gitb0dca12 )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/_modules/torch_xla/core/xla_model.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+git421ba3e )
master (2.2.0+gitb0dca12 )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/_modules/torch_xla/distributed/parallel_loader.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+git421ba3e )
master (2.2.0+gitb0dca12 )
</div>


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+git421ba3e )
master (2.2.0+gitb0dca12 )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/_modules/torch_xla/utils/serialization.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+git421ba3e )
master (2.2.0+gitb0dca12 )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/_modules/torch_xla/utils/utils.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+git421ba3e )
master (2.2.0+gitb0dca12 )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@


<div class="version">
master (2.2.0+git421ba3e )
master (2.2.0+gitb0dca12 )
</div>


Expand Down
50 changes: 47 additions & 3 deletions master/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+git421ba3e )
master (2.2.0+gitb0dca12 )
</div>


Expand Down Expand Up @@ -326,6 +326,8 @@
</ul>
</li>
<li><a class="reference internal" href="#gpu">GPU</a></li>
<li><a class="reference internal" href="#single-node-gpu-training">Single-node GPU training</a></li>
<li><a class="reference internal" href="#multi-node-gpu-training">Multi-node GPU training</a></li>
</ul>
</li>
<li><a class="reference internal" href="#differences-from-xrt">Differences from XRT</a><ul>
Expand Down Expand Up @@ -2289,13 +2291,53 @@ <h4>Docker<a class="headerlink" href="#docker" title="Permalink to this headline
<div class="section" id="gpu">
<h3>GPU<a class="headerlink" href="#gpu" title="Permalink to this headline"></a></h3>
<p><em>Warning: GPU support is still highly experimental!</em></p>
</div>
<div class="section" id="single-node-gpu-training">
<h3>Single-node GPU training<a class="headerlink" href="#single-node-gpu-training" title="Permalink to this headline"></a></h3>
<p>To use GPUs with PJRT, simply set <code class="docutils literal notranslate"><span class="pre">PJRT_DEVICE=GPU</span></code> and configure
<code class="docutils literal notranslate"><span class="pre">GPU_NUM_DEVICES</span></code> to the number of devices on the host. For example:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">PJRT_DEVICE</span><span class="o">=</span><span class="n">GPU</span> <span class="n">GPU_NUM_DEVICES</span><span class="o">=</span><span class="mi">4</span> <span class="n">python3</span> <span class="n">xla</span><span class="o">/</span><span class="n">test</span><span class="o">/</span><span class="n">test_train_mp_imagenet</span><span class="o">.</span><span class="n">py</span> <span class="o">--</span><span class="n">fake_data</span> <span class="o">--</span><span class="n">batch_size</span><span class="o">=</span><span class="mi">128</span> <span class="o">--</span><span class="n">num_epochs</span><span class="o">=</span><span class="mi">1</span>
</pre></div>
</div>
<p>Currently, only a single host is supported, and multi-host GPU cluster support
will be added in an future release.</p>
<p>You can also use <code class="docutils literal notranslate"><span class="pre">torchrun</span></code> to initiate the single-node multi-GPU training. For example,</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>PJRT_DEVICE=GPU torchrun --nnodes 1 --nproc-per-node ${NUM_GPU_DEVICES} xla/test/test_train_mp_imagenet.py --fake_data --batch_size=128 --num_epochs=1
</pre></div>
</div>
<p>In the above example, <code class="docutils literal notranslate"><span class="pre">--nnodes</span></code> means how many machines (physical machines or VMs) to be used (it is 1 since we do single-node training). <code class="docutils literal notranslate"><span class="pre">--nproc-per-node</span></code> means how many GPU devices to be used.</p>
</div>
<div class="section" id="multi-node-gpu-training">
<h3>Multi-node GPU training<a class="headerlink" href="#multi-node-gpu-training" title="Permalink to this headline"></a></h3>
<p><strong>Note that this feature only works for cuda 12+</strong>. Similar to how PyTorch uses multi-node training, you can run the command as below:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>PJRT_DEVICE=GPU torchrun \
--nnodes=${NUMBER_GPU_VM} \
--node_rank=${CURRENT_NODE_RANK} \
--nproc_per_node=${NUMBER_LOCAL_GPU_DEVICES} \
--rdzv_endpoint=&lt;internal_ip_address:port&gt; multinode_training.py
</pre></div>
</div>
<ul class="simple">
<li><p><code class="docutils literal notranslate"><span class="pre">--nnodes</span></code>: how many GPU machines to be used.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">--node_rank</span></code>: the index of the current GPU machines. The value can be 0, 1, …, ${NUMBER_GPU_VM}-1.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">--nproc_per_node</span></code>: the number of GPU devices to be used on the current machine.</p></li>
<li><p><cite>–rdzv_endpoint</cite>: the endpoint of the GPU machine with node_rank==0, in the form &lt;host&gt;:<span class="raw-html-m2r"><port></span>. The <code class="docutils literal notranslate"><span class="pre">host</span></code> will be the internal IP address. The port can be any available port on the machine.</p></li>
</ul>
<p>For example, if you want to train on 2 GPU machines: machine_0 and machine_1, on the first GPU machine machine_0, run</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1"># PJRT_DEVICE=GPU torchrun \</span>
<span class="o">--</span><span class="n">nnodes</span><span class="o">=</span><span class="mi">2</span> \
<span class="o">--</span><span class="n">node_rank</span><span class="o">=</span><span class="mi">0</span> \
<span class="o">--</span><span class="n">nproc_per_node</span><span class="o">=</span><span class="mi">4</span> \
<span class="o">--</span><span class="n">rdzv_endpoint</span><span class="o">=</span><span class="s2">&quot;&lt;MACHINE_0_IP_ADDRESS&gt;:12355&quot;</span> <span class="n">pytorch</span><span class="o">/</span><span class="n">xla</span><span class="o">/</span><span class="n">test</span><span class="o">/</span><span class="n">test_train_mp_imagenet</span><span class="o">.</span><span class="n">py</span> <span class="o">--</span><span class="n">fake_data</span> <span class="o">--</span><span class="n">pjrt_distributed</span> <span class="o">--</span><span class="n">batch_size</span><span class="o">=</span><span class="mi">128</span> <span class="o">--</span><span class="n">num_epochs</span><span class="o">=</span><span class="mi">1</span>
</pre></div>
</div>
<p>On the second GPU machine, run</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1"># PJRT_DEVICE=GPU torchrun \</span>
<span class="o">--</span><span class="n">nnodes</span><span class="o">=</span><span class="mi">2</span> \
<span class="o">--</span><span class="n">node_rank</span><span class="o">=</span><span class="mi">1</span> \
<span class="o">--</span><span class="n">nproc_per_node</span><span class="o">=</span><span class="mi">4</span> \
<span class="o">--</span><span class="n">rdzv_endpoint</span><span class="o">=</span><span class="s2">&quot;&lt;MACHINE_0_IP_ADDRESS&gt;:12355&quot;</span> <span class="n">pytorch</span><span class="o">/</span><span class="n">xla</span><span class="o">/</span><span class="n">test</span><span class="o">/</span><span class="n">test_train_mp_imagenet_torchrun</span><span class="o">.</span><span class="n">py</span> <span class="o">--</span><span class="n">fake_data</span> <span class="o">--</span><span class="n">pjrt_distributed</span> <span class="o">--</span><span class="n">batch_size</span><span class="o">=</span><span class="mi">128</span> <span class="o">--</span><span class="n">num_epochs</span><span class="o">=</span><span class="mi">1</span>
</pre></div>
</div>
<p>the difference between the 2 commands above are <code class="docutils literal notranslate"><span class="pre">--node_rank</span></code> and potentially <code class="docutils literal notranslate"><span class="pre">--nproc_per_node</span></code> if you want to use different number of GPU devices on each machine. All the rest are identical.</p>
</div>
</div>
<div class="section" id="differences-from-xrt">
Expand Down Expand Up @@ -3460,6 +3502,8 @@ <h3>Running Resnet50 example with SPMD<a class="headerlink" href="#running-resne
</ul>
</li>
<li><a class="reference internal" href="#gpu">GPU</a></li>
<li><a class="reference internal" href="#single-node-gpu-training">Single-node GPU training</a></li>
<li><a class="reference internal" href="#multi-node-gpu-training">Multi-node GPU training</a></li>
</ul>
</li>
<li><a class="reference internal" href="#differences-from-xrt">Differences from XRT</a><ul>
Expand Down
2 changes: 1 addition & 1 deletion master/notes/source_of_recompilation.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+git421ba3e )
master (2.2.0+gitb0dca12 )
</div>


Expand Down
Binary file modified master/objects.inv
Binary file not shown.
2 changes: 1 addition & 1 deletion master/py-modindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,7 @@


<div class="version">
master (2.2.0+git421ba3e )
master (2.2.0+gitb0dca12 )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/search.html
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@


<div class="version">
master (2.2.0+git421ba3e )
master (2.2.0+gitb0dca12 )
</div>


Expand Down
2 changes: 1 addition & 1 deletion master/searchindex.js

Large diffs are not rendered by default.

0 comments on commit f147579

Please sign in to comment.