Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Dockerfile to use devel image for compatibility #2848

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

YaserJaradeh
Copy link

@YaserJaradeh YaserJaradeh commented Dec 16, 2024

What does this PR do?

The TGI server fails to start due to missing Python headers during the compilation of Triton indexing kernels. The solution is to change the base image to nvidia/cuda:12.4.1-devel-ubuntu22.04 to match the builder image, ensuring the necessary headers are included.
This change increases the image size but resolves the startup issue.

Fixes # (issue)
This pull request addresses the issue #2838

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

@YaserJaradeh
Copy link
Author

@Narsil

KreshLaDoge

This comment was marked as outdated.

@scriptator
Copy link

This PR solves my issue descibed here, thank you!

@scriptator
Copy link

I just noticed that you changed the implementation between me building the image and my previous message. Should I test again?

@YaserJaradeh
Copy link
Author

I just noticed that you changed the implementation between me building the image and my previous message. Should I test again?

Not yet! still broken I just pushed to try to build it on my server but so far it is not working! will ping you when it is 💯

@YaserJaradeh
Copy link
Author

YaserJaradeh commented Jan 13, 2025

@KreshLaDoge I tried multiple variants of using only python3.11-dev but that didn't work! I also tried copying the headers and libraries from the pytorch building stage into the final image and that didn't work either. Furthermore, I tried with a combination of python3.11-dev, cuda command line tools, and build-essential and also I wasn't able to get working!

Any ideas about how to get it to work, or reduce the size of the final image?

@scriptator
Copy link

Do you have a way of reproducing the error more quickly than building the entire image from scratch every time? And do you manage to get to the exact compiler error? If you could share that, I might take a look.

@Narsil
Copy link
Collaborator

Narsil commented Jan 15, 2025

As previously suggested, the fix cannot be accepted as-is. It bloats the image way too much (20GB vs 12GB).

First we need to reproduce locally, then figure out why the hell triton wants to recompile something (not a kernel obviously, since it tries gcc)

@danieldk
Copy link
Member

danieldk commented Jan 24, 2025

This would be a workaround, but it does not solve the underlying bug. If you look at #2838, -L/usr/local/nvidia/lib (or another libcuda library path, this depends on CDI/container toolkit probably) is missing from the gcc invocation, whereas this is passed on a functioning Docker/Podman system.

Triton finds the directory with libcuda.so by inspecting the output of ldconfig -p, so my suspicion is that your version of Podman somehow interferes with the dynamic linker. It would be helpful if you could run the docker image with --entrypoint /bin/sh and provide the output of the following commands inside the container (example output is from a working system, ignore the Nix store paths, I think it's injected by the CDI interface):

$ cat /etc/ld.so.conf.d/*.conf                                                                                                                                                                                       
/usr/local/cuda/targets/x86_64-linux/lib                                                                                                                                                                             
/usr/local/cuda-12/targets/x86_64-linux/lib                                                                                                                                                                          
# libc default configuration                                                                                                                                                                                         
/usr/local/lib                                                                                                                                                                                                       
/nix/store/m5g35gc5z6gjqgb37jn7v5qk6h00b3fc-nvidia-x11-550.142-6.6.71/lib                                                                                                                                            
/usr/local/nvidia/lib                                                                                                                                                                                                
/usr/local/nvidia/lib64                                                                                                                                                                                              
# Multiarch support                                                                                                                                                                                                  
/usr/local/lib/x86_64-linux-gnu                                                                                                                                                                                      
/lib/x86_64-linux-gnu                                                                                                                                                                                                
/usr/lib/x86_64-linux-gnu

$ ldconfig -p | grep cuda                                                                                                                                                                                            
        libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12                                                                                                                   
        libcudadebugger.so.1 (libc6,x86-64) => /nix/store/m5g35gc5z6gjqgb37jn7v5qk6h00b3fc-nvidia-x11-550.142-6.6.71/lib/libcudadebugger.so.1                                                                        
        libcudadebugger.so.1 (libc6,x86-64) => /usr/local/nvidia/lib/libcudadebugger.so.1                                                                                                                            
        libcudadebugger.so (libc6,x86-64) => /usr/local/nvidia/lib/libcudadebugger.so                                                                                                                                
        libcuda.so.1 (libc6,x86-64) => /nix/store/m5g35gc5z6gjqgb37jn7v5qk6h00b3fc-nvidia-x11-550.142-6.6.71/lib/libcuda.so.1                                                                                        
        libcuda.so.1 (libc6,x86-64) => /usr/local/nvidia/lib/libcuda.so.1                                                                                                                                            
        libcuda.so (libc6,x86-64) => /usr/local/nvidia/lib/libcuda.so

@danieldk
Copy link
Member

Besides the above it would also be useful to post the contents of the following files on the host system:

/var/run/cdi/nvidia-container-toolkit.json

/etc/cdi/nvidia-container-toolkit.json

(Or if they don't exist, check in these directories for a JSON file for nvidia.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants