Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue for the new NGC images #40

Open
PhdShi opened this issue Jan 5, 2023 · 4 comments
Open

Issue for the new NGC images #40

PhdShi opened this issue Jan 5, 2023 · 4 comments

Comments

@PhdShi
Copy link

PhdShi commented Jan 5, 2023

Hi! Recently I was looking at ngc images sites and noticed

Starting with the 22.11 PyTorch NGC container, miniforge is removed and all Python packages are installed 
in the default Python environment. In case you depend on Conda-specific packages, which might not be 
available on PyPI, we recommend building these packages from source. A workaround is to manually install 
a Conda package manager, and add the conda path to your PYTHONPATH for example, using export 
PYTHONPATH="/opt/conda/lib/python3.8/site-packages" if your Conda package manager was installed in 
/opt/conda.

It seems that ngc images will no longer provide the conda environment and pytorch related files will be moved to the python environment. When I docker run the new images such as nvcr.io/nvidia/pytorch:22.11-py3, I found that there is no c10d related head files in python environment in directory /usr/local/lib/python3.8/dist-packages/torch/include. But ProcessCCL.hpp must use head file <torch/csrc/distributed/c10d/Utils.hpp>.
So how do we solve this problem so that we can use torch-ccl in the latest ngc image?

@liangan1
Copy link
Contributor

liangan1 commented Jan 5, 2023

which pytorch and torch-ccl version do you use?

@PhdShi
Copy link
Author

PhdShi commented Jan 5, 2023

which pytorch and torch-ccl version do you use?
ngc images: nvcr.io/nvidia/pytorch:22.11-py3
pytorch version: 1.13.0a0+936e930
torch-ccl: 1.13

@liangan1
Copy link
Contributor

liangan1 commented Jan 9, 2023

it seems that your codebase is older than the 1.13.0 tag, and pytorch change the c10d distributed path in the pytorch/pytorch#85780, so you may have 2 choices to fix this issue:

  1. use the 1.13.0 release code
  2. try to use torch-ccl-1.12.100 release.

@PhdShi
Copy link
Author

PhdShi commented Jan 9, 2023

it seems that your codebase is older than the 1.13.0 tag, and pytorch change the c10d distributed path in the pytorch/pytorch#85780, so you may have 2 choices to fix this issue:

  1. use the 1.13.0 release code
  2. try to use torch-ccl-1.12.100 release.

Thx for your reply! My problem was solved by the first option. The second option didn't work, but that's not torch-ccl or pytorch's fault. What I mean is that the compiled pytorch provided by the ngc image no longer contains C++ header files. I had to recompile pytorch for torch-ccl to compile correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants