From d1c6799b8870e513bf4f2305cbf6cda9fc3d773b Mon Sep 17 00:00:00 2001 From: youkaichao Date: Mon, 11 Nov 2024 15:21:12 -0800 Subject: [PATCH] [doc] update debugging guide (#10236) Signed-off-by: youkaichao --- docs/source/getting_started/debugging.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/getting_started/debugging.rst b/docs/source/getting_started/debugging.rst index d40222bfd4da8..060599680be25 100644 --- a/docs/source/getting_started/debugging.rst +++ b/docs/source/getting_started/debugging.rst @@ -122,6 +122,8 @@ If you are testing with multi-nodes, adjust ``--nproc-per-node`` and ``--nnodes` If the script runs successfully, you should see the message ``sanity check is successful!``. +If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as ``export NCCL_P2P_DISABLE=1`` to see if it helps. Please check `their documentation `__ for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully. + .. note:: A multi-node environment is more complicated than a single-node one. If you see errors such as ``torch.distributed.DistNetworkError``, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments: