Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environment build failed on AWS EC2 g4dn instance with OS image (Deep Learning AMI, ami-0184e674549ab8432) #24

Open
zarzen opened this issue Jun 8, 2022 · 4 comments

Comments

@zarzen
Copy link

zarzen commented Jun 8, 2022

Hi there,

Thanks for sharing the framework! I try to experiment with the codebase. But not able to create suitable environment on AWS platform. I am using Deep Learning AMI (Ubuntu 18.04) Version 60.4 as the cloud instance system, the corresponding AMI is ami-0184e674549ab8432.

When I use command ./create-grace-env-tf1.15.sh under the root folder of the project, the installation of horovod raises the following error. But the installation process didn't stop. And I can see the horovod-0.21.0 is installed in conda environment. But the configuration is not correct.

 make[1]: Leaving directory '/tmp/pip-install-hopus90d/horovod/build/temp.linux-x86_64-cpython-37'
  Makefile:146: recipe for target 'all' failed
  make: *** [all] Error 2
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-hopus90d/horovod/setup.py", line 193, in <module>
      'horovodrun = horovod.runner.launch:run_commandline'
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/__init__.py", line 87, in setup
      return distutils.core.setup(**attrs)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 148, in setup
      return run_commands(dist)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 163, in run_commands
      dist.run_commands()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 967, in run_commands
      self.run_command(cmd)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
      super().run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 299, in run
      self.run_command('build')
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
      super().run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/command/build.py", line 136, in run
      self.run_command(cmd_name)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
      super().run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
      _build_ext.run(self)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/command/build_ext.py", line 339, in run
      self.build_extensions()
    File "/tmp/pip-install-hopus90d/horovod/setup.py", line 91, in build_extensions
      cwd=self.build_temp)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/subprocess.py", line 363, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--config', 'RelWithDebInfo', '--', '-j8', 'VERBOSE=1']' returned non-zero
 exit status 2.
  ----------------------------------------
  ERROR: Failed building wheel for horovod
  Running setup.py clean for horovod
Failed to build horovod

Running command horovodrun -cb gives following message, which indicate the PyTorch extension is not enabled.

(/home/ubuntu/grace/env-tf1.15) ubuntu@ip-172-31-82-84:~/grace$ horovodrun -cb
Horovod v0.21.0:

Available Frameworks:
    [X] TensorFlow
    [ ] PyTorch
    [ ] MXNet

Available Controllers:
    [X] MPI
    [X] Gloo

Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo

Besides, even the TensorFlow extension for Horovod seems ready, the actual training indicate it isn't work properly. When running tensorflow_mnist.py with horovodrun, only the first GPU is doing the computation. This means the distributed training isn't working.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   45C    P0    27W /  70W |    390MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   37C    P8    15W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   37C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     10049      C   python                             97MiB |
|    0   N/A  N/A     10050      C   python                             97MiB |
|    0   N/A  N/A     10051      C   python                             97MiB |
|    0   N/A  N/A     10052      C   python                             97MiB |
+-----------------------------------------------------------------------------+

Any suggestions? It would be great if you can share a docker environment, or more detailed system configurations.

@mcanini
Copy link
Member

mcanini commented Jun 8, 2022

We just pushed a fix for a similar issue to what you described. Can you try again with 95a9b6c?

@zarzen
Copy link
Author

zarzen commented Jun 8, 2022

Hi @mcanini
I just checked the scripts I am using, which includes the recent change you made. One thing I notice, though, theprotobuf installed with conda install -y protobuf is version 4.xx, which is conflict to tensorflow-1.15.
Instead, I modified the installation script to use conda install -y protobuf=3.20.

@mcanini
Copy link
Member

mcanini commented Jun 8, 2022

Great, thanks. Did that work in the end?

@zarzen
Copy link
Author

zarzen commented Jun 8, 2022

No. That change only fixes the crash of tensorflow-1.15 at import stage. But still, both tensorflow and pytorch does not work with horovod

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants