Skip to content

Commit

Permalink
Merge pull request #317 from aws-samples/sean-smith-patch-2
Browse files Browse the repository at this point in the history
Add NCCL_DEBUG to efa-cheatsheet.md
  • Loading branch information
KeitaW authored May 7, 2024
2 parents 687bd79 + 61807f5 commit 6405b1b
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions 1.architectures/efa-cheatsheet.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ versions of your libfabric.

| Setting | Explanation |
| ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `NCCL_DEBUG=info` | Set this to get debug information from NCCL, this will allow you to see if NCCL is using EFA and what versions it's using. This will print out a lot of debug information so we advise turning it off unless you suspect NCCL issues, see [NCCL_DEBUG](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug) for more info. |
| `FI_EFA_USE_HUGE_PAGE=0` | Set to 0 when you see `os.fork()` causes `OSError: Cannot allocate memory`. Typically happen by multi-process PyTorch data loader. Disabling huge page causes minor performance hit, but it's needed to prevent fork fails due to the operating system running out of huge pages. |
| `FI_EFA_FORK_SAFE=1` | Not needed for kernel>=5.15. Still fine to set it though no effect. See [ref](https://github.com/ofiwg/libfabric/pull/9112). |
| `FI_EFA_USE_DEVICE_RDMA=1` | Do not set for libfabric>=1.18.0 and aws-ofi-nccl>=1.7.0. It's not harmful to set it on p4/p5 on the newer software, but you just don't have to set it. |
Expand Down

0 comments on commit 6405b1b

Please sign in to comment.