Environment build failed on AWS EC2 g4dn instance with OS image (Deep Learning AMI, ami-0184e674549ab8432) #24

zarzen · 2022-06-08T03:20:19Z

Hi there,

Thanks for sharing the framework! I try to experiment with the codebase. But not able to create suitable environment on AWS platform. I am using Deep Learning AMI (Ubuntu 18.04) Version 60.4 as the cloud instance system, the corresponding AMI is ami-0184e674549ab8432.

When I use command ./create-grace-env-tf1.15.sh under the root folder of the project, the installation of horovod raises the following error. But the installation process didn't stop. And I can see the horovod-0.21.0 is installed in conda environment. But the configuration is not correct.

 make[1]: Leaving directory '/tmp/pip-install-hopus90d/horovod/build/temp.linux-x86_64-cpython-37'
  Makefile:146: recipe for target 'all' failed
  make: *** [all] Error 2
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-hopus90d/horovod/setup.py", line 193, in <module>
      'horovodrun = horovod.runner.launch:run_commandline'
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/__init__.py", line 87, in setup
      return distutils.core.setup(**attrs)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 148, in setup
      return run_commands(dist)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 163, in run_commands
      dist.run_commands()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 967, in run_commands
      self.run_command(cmd)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
      super().run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 299, in run
      self.run_command('build')
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
      super().run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/command/build.py", line 136, in run
      self.run_command(cmd_name)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
      super().run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
      _build_ext.run(self)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/command/build_ext.py", line 339, in run
      self.build_extensions()
    File "/tmp/pip-install-hopus90d/horovod/setup.py", line 91, in build_extensions
      cwd=self.build_temp)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/subprocess.py", line 363, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--config', 'RelWithDebInfo', '--', '-j8', 'VERBOSE=1']' returned non-zero
 exit status 2.
  ----------------------------------------
  ERROR: Failed building wheel for horovod
  Running setup.py clean for horovod
Failed to build horovod

Running command horovodrun -cb gives following message, which indicate the PyTorch extension is not enabled.

(/home/ubuntu/grace/env-tf1.15) ubuntu@ip-172-31-82-84:~/grace$ horovodrun -cb
Horovod v0.21.0:

Available Frameworks:
    [X] TensorFlow
    [ ] PyTorch
    [ ] MXNet

Available Controllers:
    [X] MPI
    [X] Gloo

Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo

Besides, even the TensorFlow extension for Horovod seems ready, the actual training indicate it isn't work properly. When running tensorflow_mnist.py with horovodrun, only the first GPU is doing the computation. This means the distributed training isn't working.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   45C    P0    27W /  70W |    390MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   37C    P8    15W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   37C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     10049      C   python                             97MiB |
|    0   N/A  N/A     10050      C   python                             97MiB |
|    0   N/A  N/A     10051      C   python                             97MiB |
|    0   N/A  N/A     10052      C   python                             97MiB |
+-----------------------------------------------------------------------------+

Any suggestions? It would be great if you can share a docker environment, or more detailed system configurations.

The text was updated successfully, but these errors were encountered:

mcanini · 2022-06-08T06:37:25Z

We just pushed a fix for a similar issue to what you described. Can you try again with 95a9b6c?

zarzen · 2022-06-08T16:50:42Z

Hi @mcanini
I just checked the scripts I am using, which includes the recent change you made. One thing I notice, though, theprotobuf installed with conda install -y protobuf is version 4.xx, which is conflict to tensorflow-1.15.
Instead, I modified the installation script to use conda install -y protobuf=3.20.

mcanini · 2022-06-08T17:17:12Z

Great, thanks. Did that work in the end?

zarzen · 2022-06-08T22:01:54Z

No. That change only fixes the crash of tensorflow-1.15 at import stage. But still, both tensorflow and pytorch does not work with horovod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Environment build failed on AWS EC2 g4dn instance with OS image (Deep Learning AMI, ami-0184e674549ab8432) #24

Environment build failed on AWS EC2 g4dn instance with OS image (Deep Learning AMI, ami-0184e674549ab8432) #24

zarzen commented Jun 8, 2022 •

edited

Loading

mcanini commented Jun 8, 2022

zarzen commented Jun 8, 2022

mcanini commented Jun 8, 2022

zarzen commented Jun 8, 2022

Environment build failed on AWS EC2 g4dn instance with OS image (Deep Learning AMI, ami-0184e674549ab8432) #24

Environment build failed on AWS EC2 g4dn instance with OS image (Deep Learning AMI, ami-0184e674549ab8432) #24

Comments

zarzen commented Jun 8, 2022 • edited Loading

mcanini commented Jun 8, 2022

zarzen commented Jun 8, 2022

mcanini commented Jun 8, 2022

zarzen commented Jun 8, 2022

zarzen commented Jun 8, 2022 •

edited

Loading