-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add instructions to build a docker for GraphStorm-wholegraph on AWS #475
Conversation
…nstances with efa support
RUN mkdir -p ${SSHDIR} | ||
RUN ssh-keygen -t rsa -f ${SSHDIR}/id_rsa -N '' | ||
RUN cp ${SSHDIR}/id_rsa.pub ${SSHDIR}/authorized_keys | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to modify /root/.ssh/config
too.
RUN touch /root/.ssh/config;echo -e "Host *\n StrictHostKeyChecking no\n UserKnownHostsFile=/dev/null\n Port ${SSH_PORT}" > /root/.ssh/config
&& make && make install | ||
|
||
ENV PATH "/opt/amazon/efa/bin:$PATH" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need install NCCL test (Step 8) to verify the EFA+NCCL setup.
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libnccl2_2.15.1-1+cuda11.8_amd64.deb | ||
RUN dpkg -i libnccl2_2.15.1-1+cuda11.8_amd64.deb | ||
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libnccl-dev_2.15.1-1+cuda11.8_amd64.deb | ||
RUN dpkg -i libnccl-dev_2.15.1-1+cuda11.8_amd64.deb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also follow step 6 from the EC2 doc to install NCCL.
ENV dev_type=GPU | ||
# Install DGL GPU version | ||
RUN pip3 install dgl==1.0.4+cu117 -f https://data.dgl.ai/wheels/cu117/repo.html && rm -rf /root/.cache | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to fix python installation. By default docker is using conda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it has to do with python installation but I had to add:
pip install --no-cache-dir boto3 'h5py>=2.10.0' scipy tqdm 'pyarrow>=3' 'transformers==4.28.1' pandas pylint scikit-learn ogb psutil```
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.