diff --git a/1.architectures/efa-cheatsheet.md b/1.architectures/efa-cheatsheet.md index 83362089..74f235b9 100644 --- a/1.architectures/efa-cheatsheet.md +++ b/1.architectures/efa-cheatsheet.md @@ -7,6 +7,7 @@ versions of your libfabric. | Setting | Explanation | | ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `NCCL_DEBUG=info` | Set this to get debug information from NCCL, this will allow you to see if NCCL is using EFA and what versions it's using. This will print out a lot of debug information so we advise turning it off unless you suspect NCCL issues, see [NCCL_DEBUG](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug) for more info. | | `FI_EFA_USE_HUGE_PAGE=0` | Set to 0 when you see `os.fork()` causes `OSError: Cannot allocate memory`. Typically happen by multi-process PyTorch data loader. Disabling huge page causes minor performance hit, but it's needed to prevent fork fails due to the operating system running out of huge pages. | | `FI_EFA_FORK_SAFE=1` | Not needed for kernel>=5.15. Still fine to set it though no effect. See [ref](https://github.com/ofiwg/libfabric/pull/9112). | | `FI_EFA_USE_DEVICE_RDMA=1` | Do not set for libfabric>=1.18.0 and aws-ofi-nccl>=1.7.0. It's not harmful to set it on p4/p5 on the newer software, but you just don't have to set it. |