Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Image OpenADK with CUDA support has an issue with CUDA installation #4765

Closed
3 tasks done
gitoabdelgawad opened this issue May 22, 2024 · 2 comments
Closed
3 tasks done
Assignees
Labels
type:bug Software flaws or errors.

Comments

@gitoabdelgawad
Copy link

gitoabdelgawad commented May 22, 2024

Checklist

  • I've read the contribution guidelines.
  • I've searched other issues and no duplicate issues were found.
  • I'm convinced that this is not my fault but a bug.

Description

Inside ghcr.io/autowarefoundation/autoware-openadk:latest-devel-cuda container
Im trying to use tensorrt_yolox package. The package includes some CUDA kernels which fails to build and shows the following warning:

--- stderr: tensorrt_yolox
CMake Warning at CMakeLists.txt:19 (message):
CUDA is not found. preprocess acceleration using CUDA will not be
available.

It seems that CMake variable CMAKE_CUDA_COMPILER is not set

Then while using tensorrt_yolox for object detection, the system crashes with the following error:

[tensorrt_yolox_node_exe-2] /home/os/elm/autoware/install/tensorrt_yolox/lib/tensorrt_yolox/tensorrt_yolox_node_exe: symbol lookup error: /home/os/elm/autoware/install/tensorrt_yolox/lib/libtensorrt_yolox.so: undefined symbol: _ZN14tensorrt_yolox50resize_bilinear_letterbox_nhwc_to_nchw32_batch_gpuEPfPhiiiiiiifP11CUstream_st
[ERROR] [tensorrt_yolox_node_exe-2]: process has died [pid 977, exit code 127, cmd '/home/os/elm/autoware/install/tensorrt_yolox/lib/tensorrt_yolox/tensorrt_yolox_node_exe --ros-args -r __node:=tensorrt_yolox --params-file /tmp/launch_params_d1ll7q3z --params-file /tmp/launch_params_cq_ya7ic -r ~/in/image:=/fr_camera/image_rect -r ~/out/objects:=roi0'].

The missing symbol is actually a CUDA kernel that failed to build previously.

Expected behavior

  1. Docker OpenADK Image should have the CUDA support and be able to properly build tensorrt_yolox. By doing that, the runtime error of the missing symbol will not be there anymore.

Actual behavior

tensorrt_yolox builds with a Warning and skips building the CUDA kernels, which leads to a runtime crash later.

Steps to reproduce

Inside ghcr.io/autowarefoundation/autoware-openadk:latest-devel-cuda container

  1. source autoware/install/setup.bash
  2. colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release --packages-select tensorrt_yolox you should notice the cmake warning mentioned above.
  3. ros2 launch tensorrt_yolox yolox_s_plus_opt.launch.xml input/image:=/img output/objects:=/roi0 Thats an example for launch an object detection model. Once you subscribe to output topic ros2 topic echo /roi0 you should get the runtime error mentioned above.

Versions

No response

Possible causes

After some investigation and trying to build the official CUDA Samples to track the issue, it appeared that some cuda libraries were missing
/usr/bin/ld: cannot find -lcudadevrt
/usr/bin/ld: cannot find -lcudart_static

After applying the following patch and rebuilding the docker image, the cuda kernels were built and object detection model was running well.

From 52d5e470d616118d0089e1ff25e5c8016a95450b Mon Sep 17 00:00:00 2001
From: Osama Abdelgawad <[email protected]>
Date: Wed, 22 May 2024 16:01:59 +0200
Subject: [PATCH] docker change

---
 docker/autoware-openadk/Dockerfile | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/docker/autoware-openadk/Dockerfile b/docker/autoware-openadk/Dockerfile
index 23d260f0..320262ff 100644
--- a/docker/autoware-openadk/Dockerfile
+++ b/docker/autoware-openadk/Dockerfile
@@ -88,9 +88,7 @@ ENV CXX="/usr/lib/ccache/g++"
 RUN --mount=type=ssh \
   ./setup-dev-env.sh -y --module all ${SETUP_ARGS} --no-cuda-drivers openadk \
   && pip uninstall -y ansible ansible-core \
-  && apt-get autoremove -y && apt-get clean -y && rm -rf /var/lib/apt/lists/* "$HOME"/.cache \
-  && find / -name 'libcu*.a' -delete \
-  && find / -name 'libnv*.a' -delete
+  && apt-get autoremove -y && apt-get clean -y && rm -rf /var/lib/apt/lists/* "$HOME"/.cache
 
 # Install rosdep dependencies
 COPY --from=src-imported /autoware/src /autoware/src
-- 
2.34.1

Additional context

No response

oguzkaganozt added a commit that referenced this issue May 23, 2024
Signed-off-by: Oguz Ozturk <[email protected]>
@idorobotics idorobotics added the type:bug Software flaws or errors. label May 23, 2024
oguzkaganozt added a commit that referenced this issue May 27, 2024
Signed-off-by: Oguz Ozturk <[email protected]>
oguzkaganozt added a commit that referenced this issue May 27, 2024
Signed-off-by: Oguz Ozturk <[email protected]>
oguzkaganozt added a commit that referenced this issue May 28, 2024
Signed-off-by: Oguz Ozturk <[email protected]>
@xmfcx
Copy link
Contributor

xmfcx commented Jun 19, 2024

@gitoabdelgawad
Copy link
Author

@gitoabdelgawad @oguzkaganozt does/did this PR fix this issue?

* [feat(docker): fix CUDA compile on devel image and improve run.sh #4849](https://github.com/autowarefoundation/autoware/pull/4849)

yes this PR fix this issue. Thanks I will close the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Software flaws or errors.
Projects
None yet
Development

No branches or pull requests

4 participants