Skip to content

Commit

Permalink
Update template pytorch dockerfile with cuda sync memops stanza
Browse files Browse the repository at this point in the history
  • Loading branch information
Verdi March committed Apr 12, 2024
1 parent 228c59c commit 5086512
Showing 1 changed file with 5 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,15 @@ FROM nvcr.io/nvidia/pytorch:23.12-py3
ENV DEBIAN_FRONTEND=noninteractive

# The three must-be-built packages.
# Efa-installer>=1.29.0 required for nccl>=2.19.0 to avoid libfabric NCCL error.
# Efa-installer>=1.29.1 required for nccl>=2.19.0 to avoid libfabric NCCL error.
ENV EFA_INSTALLER_VERSION=1.30.0
ENV AWS_OFI_NCCL_VERSION=1.8.1-aws
ENV NCCL_TESTS_VERSION=master

## Uncomment below when this Dockerfile builds a container image with efa-installer<1.29.1 and
# nccl>=2.19.0. See https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/efa-cheatsheet.md
#ENV FI_EFA_SET_CUDA_SYNC_MEMOPS=0

RUN apt-get update -y
RUN apt-get remove -y --allow-change-held-packages \
libmlx5-1 ibverbs-utils libibverbs-dev libibverbs1
Expand Down

0 comments on commit 5086512

Please sign in to comment.