Merge pull request #317 from aws-samples/sean-smith-patch-2

Add NCCL_DEBUG to efa-cheatsheet.md
aws-samples · May 7, 2024 · 6405b1b · 6405b1b
2 parents 687bd79 + 61807f5
commit 6405b1b
Showing 1 changed file with 1 addition and 0 deletions.
diff --git a/1.architectures/efa-cheatsheet.md b/1.architectures/efa-cheatsheet.md
@@ -7,6 +7,7 @@ versions of your libfabric.
 
 | Setting                         | Explanation                                                                                                                                                                                                                                                                                                                                           |
 | ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `NCCL_DEBUG=info`        | Set this to get debug information from NCCL, this will allow you to see if NCCL is using EFA and what versions it's using. This will print out a lot of debug information so we advise turning it off unless you suspect NCCL issues, see [NCCL_DEBUG](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug) for more info.       |
 | `FI_EFA_USE_HUGE_PAGE=0`        | Set to 0 when you see `os.fork()` causes `OSError: Cannot allocate memory`. Typically happen by multi-process PyTorch data loader. Disabling huge page causes minor performance hit, but it's needed to prevent fork fails due to the operating system running out of huge pages.                                                                     |
 | `FI_EFA_FORK_SAFE=1`            | Not needed for kernel>=5.15. Still fine to set it though no effect. See [ref](https://github.com/ofiwg/libfabric/pull/9112).                                                                                                                                                                                                                          |
 | `FI_EFA_USE_DEVICE_RDMA=1`      | Do not set for libfabric>=1.18.0 and aws-ofi-nccl>=1.7.0. It's not harmful to set it on p4/p5 on the newer software, but you just don't have to set it.                                                                                                                                                                                               |