- Amazon Linux
- Amazon Linux 2
- Redhat Enterprise Linux 7.0 and 8.0
- Ubuntu 18.04 and 20.04 LTS
- CentOS 7 and 8
For releases before v1.6.0, there were generally two slightly different releases for any version, an AWS-specific release and a general release. With v1.6.0, we have unified the code and made the AWS-specific parts a compile-time option. When a feature (or entire release) was only available in one of the two variants, we note that in the release notes.
This release requires Libfabric v1.11.0 or later and supports NCCL v2.17.1-1 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.17.1.
New Features:
- Add AWS platform specific code to
master
branch to support single-branch development and release model. - Follow Automake conventions for Makefiles.
- Remove Travis Support as the plugin is tested using internal AWS CI infrastructure.
Bug Fixes:
- Avoid topology update if NCCL_TOPO_FILE is already set
- Inline allocate_stack(..) and free_stack(..) in include/stack.h
- Shortcut parameter lookup to avoid locks in fast-path.
- Free self connecting request after network transfer completes.
- Fix TCP provider on AWS p3dn by filtering the provider list before duplicating info entries.
Testing: The plugin has been tested with following libfabric providers using unit tests bundled in the source code and nccl-tests test suite:
- efa
- tcp; ofi_rxm
There was no general 1.5.0 release; it was limited to an AWS release. This release requires Libfabric v1.11.0 or later and supports NCCL v2.16.2 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.16.1.
New Features:
- A single plugin build can now be used with multiple NCCL versions
simultaneously (from NCCL v2.4.8 forward). As a result, the
--with-nccl
argument is no longer necessary when building the plugin. - Support for Tranium-based instance types. Most users should continue to use the plugin that is included with the Neuron software stack, rather than building this plugin from scratch.
- Add support for flushing using CUDA's
cudaDeviceFlushGPUDirectRDMAWrites()
call rather than a read from the NIC. We find the default read from the NIC to perform better for most situations.
Bug Fixes:
- Improve performance of small messages by removing redundant initialization of internal structures and redundant correctness checks throughout the codebase.
- Improve performance of applications with multiple active proxy threads.
- Improved pacing of Libfabric request completion polling, which will reduce stack memory utilization in many cases.
- Fix some compiler warnings.
Testing: The plugin has been tested with following libfabric providers using unit tests bundled in the source code:
- efa
This release requires Libfabric v1.11.0 or later and supports NCCL v2.12.12 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.15.1.
New Features:
- Allow users to disable building the unit tests.
- Allow enable_debug flag to configure
- Fix EFA_NIC_DUP when only a single GPU is visible (AWS release only).
Bug Fixes:
- Fix compilation on CentOS 7.
- Update tag generation for control messages.
- Check for required MPI headers to build unit tests.
- Fix the active connection issue for non-blocking accepts (impacts NCCL versions 2.12 and above).
- Fix EFA_NIC_DUP when only a single GPU is visible (AWS release only).
Testing: The plugin has been tested with following libfabric providers using unit tests bundled in the source code:
- tcp;ofi_rxm
- sockets
- efa
This release requires Libfabric v1.11.0 or later and supports NCCL v2.12.10 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.14.0.
New Features:
- Log error-ed request entry.
- Add P4De topology (AWS release only).
Bug Fixes:
- Retry
fi_cq_readerr
until error-ed request entry is available. - Fix crash for providers supporting multi-rail devices.
- Retry
fi_cq_readerr
until error-ed request entry is available and log it (AWS release only).
Testing: The plugin has been tested with following libfabric providers using unit tests bundled in the source code:
- tcp;ofi_rxm
- sockets
- efa
- psm3
This release requires Libfabric v1.11.0 or later and supports NCCL v2.12.7 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.14.0.
New Features:
- Add support for NCCL v2.12 with backwards compatibility to previous NCCL versions.
Bug Fixes:
- Prevent deadlock in connection establishment when using rendezvour providers.
- Enable flush operations for provider that doesn't require memory registration.
- Enable successful runs of unit-tests with flush disabled.
Testing: The plugin has been tested with following libfabric providers using unit tests bundled in the source code:
- tcp;ofi_rxm
- sockets
- efa
- psm3
This release requires Libfabric v1.11.0 or later and supports NCCL v2.11.4 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.14.0.
New Features:
- Make use of FI_EFA_FORK_SAFE environment variable to allow Libfabric to detect when
MADV_DONTFORK
is not needed (#82). This feature requires Libfabric v1.13.0 or higher. When used with an older version of Libfabric, the plugin will continue to set the RDMAV_FORK_SAFE environment variable. - Do not request FI_PROGRESS_AUTO feature when listing OFI providers; this feature is unnecessary for the plugin and not requesting it improves interoperability.
Bug Fixes:
- Ensure that the buffer used for flush is page aligned and allocated with
mmap
instead ofmalloc
. This change is needed to correctly supportfork()
withMADV_DONTFORK
(#77). - Fix crash when used with a GDR-capable provider that does not require memory registration (#81).
Testing: The plugin has been tested with following libfabric providers using unit tests bundled in the source code:
- tcp;ofi_rxm
- sockets
- efa
This release requires Libfabric v1.11.0 or later and supports NCCL v2.11.4 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.13.2.
New Features:
- Print version during plugin initialization
Bug Fixes:
- Print correct error code when failing to register a memory region
Testing: The plugin has been tested with following libfabric providers using unit tests bundled in the source code:
- tcp;ofi_rxm
- sockets
- efa
This release requires Libfabric v1.11.0 or later and supports NCCL v2.9.9 while maintaining backward compatibility with older NCCL versions (up to NCCL v2.4.8). It was tested with Libfabric versions up to Libfabric v1.12.1.
Ubuntu 16.04 has reached end-of-life and is no longer supported starting with this release.
Bug Fixes:
- Fix bootstrap crash with NCCL 2.9.6 on P4D instances (AWS release only).
Testing: The plugin has been tested with following libfabric providers using unit tests bundled in the source code:
- tcp;ofi_rxm
- sockets
- efa
This release requires Libfabric v1.11.0 and supports NCCL v2.8.4 while maintaining backward compatibility with older NCCL versions (upto NCCL v2.4.8).
It introduces the following new features and bug fixes.
New Features:
- Add support for NCCL Net v4 API
Bug Fixes:
- Handle
flush
disable configuration
Testing: The plugin has been tested with following libfabric providers using unit tests bundled in the source code:
- tcp;ofi_rxm
- sockets
- efa
There was no general 1.1.1 release; it was limited to an AWS release. This release requires Libfabric v1.11.0and supports NCCL v2.7.8 while maintaining backward compatibility with older NCCL versions (upto NCCL v2.4.8).
It introduces the following new features and bug fixes.
New Features:
- Injects a static topology into NCCL for P4d hardware
- Use EFA provider supplied speed for EFA hardware.
Testing: The plugin has been tested with following libfabric providers using unit tests bundled in the source code:
- tcp;ofi_rxm
- sockets
- efa
This release requires Libfabric v1.11.0and supports NCCL v2.7.8 while maintaining backward compatibility with older NCCL versions (upto NCCL v2.4.8).
It introduces the following new features and bug fixes.
New Features:
- Detect and support multi-NIC environment
- Support GPUDirect RDMA when libfabric providers support it
- Add
flush
API support for transfers using CUDA buffers
Bug Fixes:
- Enable
RDMAV_FORK_SAFE
environment variable
Testing: The plugin has been tested with following libfabric providers using unit tests bundled in the source code:
- tcp;ofi_rxm
- sockets
- efa
This release supports NCCL v2.6.4 while maintaining backward compatibility with older NCCL versions (upto NCCL v2.4.8).
It also includes bug fixes and testing enhancements.
New Features:
- Support NCCL v2.6.4
- Add validation of memory registration APIs and getProperties API in tests.
Bug Fixes:
- Use fid_mr for memory handle
- Support disabling trace messages
Testing: The plugin has been tested with following libfabric providers using unit tests bundled in the source code:
- tcp;ofi_rxm
- sockets
- efa
This release requires Libfabric v1.9.x
and supports NCCL v2.5.6
It introduces changes to remove FI_AV_TABLE
requirement from libfabric providers
and provide several bug fixes including fixing overflow issues, memory leaks and
adding completion checks for connection establishment APIs.
New Features:
- Support NCCL v2.5.6 and require Libfabric v1.9.x
Bug Fixes:
- Remove FI_AV_TABLE requirement.
- Fix missing completion check for connect API.
- Fix resource and memory leaks.
Testing: The plugin has been tested with following libfabric providers using unit tests bundled in the source code:
- tcp;ofi_rxm
- sockets
- efa
This release introduces changes required to support NCCLv2.4 and fixes race condition during connection establishment by removing FI_SOURCE requirement.
New Features:
- Support NCCL provided MR register/deregister APIs.
Bug Fixes:
- Remove FI_SOURCE requirement for providers.
- Fix travis CI to build with NCCLv2.4.
Testing: The plugin has been tested with following libfabric providers:
- tcp;ofi_rxm
- sockets
- verbs;ofi_rxm
This release makes improvements to the building and CI infrastructure. It also includes several bug fixes. Details below:
New Features:
- Change build system to use autoconf, automake and libtool
- Add support for continuous integration using Travis CI
- Add official support for libfabric v1.7.x
Bug Fixes:
- Remove hard-coded CUDA path when linking test binaries.
- Provide request contexts to all libfabric send/recv calls
- Readme updates and other minor fixes
Testing: The plugin has been tested with following libfabric providers:
- tcp;ofi_rxm
- sockets
- verbs;ofi_rxm
- psm2
- efa;ofi_rxr
First public commit as part of preview announcement
AWS OFI NCCL supports NCCL v2.3.7+ and requires libfabric v1.6.x+. Please note that current master of libfabric is broken for rxm providers and would require PR-4641.
The plugin has been tested with following libfabric providers:
- tcp;ofi_rxm
- sockets
- verbs;ofi_rxm