Update Dockerfile to use devel image for compatibility #2848

YaserJaradeh · 2024-12-16T13:00:10Z

What does this PR do?

The TGI server fails to start due to missing Python headers during the compilation of Triton indexing kernels. The solution is to change the base image to nvidia/cuda:12.4.1-devel-ubuntu22.04 to match the builder image, ensuring the necessary headers are included.
This change increases the image size but resolves the startup issue.

Fixes # (issue)
This pull request addresses the issue #2838

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

YaserJaradeh · 2024-12-16T14:05:58Z

@Narsil

scriptator · 2025-01-13T12:58:58Z

This PR solves my issue descibed here, thank you!

scriptator · 2025-01-13T13:08:06Z

I just noticed that you changed the implementation between me building the image and my previous message. Should I test again?

YaserJaradeh · 2025-01-13T13:09:24Z

I just noticed that you changed the implementation between me building the image and my previous message. Should I test again?

Not yet! still broken I just pushed to try to build it on my server but so far it is not working! will ping you when it is 💯

YaserJaradeh · 2025-01-13T15:38:14Z

@KreshLaDoge I tried multiple variants of using only python3.11-dev but that didn't work! I also tried copying the headers and libraries from the pytorch building stage into the final image and that didn't work either. Furthermore, I tried with a combination of python3.11-dev, cuda command line tools, and build-essential and also I wasn't able to get working!

Any ideas about how to get it to work, or reduce the size of the final image?

scriptator · 2025-01-14T08:30:13Z

Do you have a way of reproducing the error more quickly than building the entire image from scratch every time? And do you manage to get to the exact compiler error? If you could share that, I might take a look.

Narsil · 2025-01-15T16:43:26Z

As previously suggested, the fix cannot be accepted as-is. It bloats the image way too much (20GB vs 12GB).

First we need to reproduce locally, then figure out why the hell triton wants to recompile something (not a kernel obviously, since it tries gcc)

danieldk · 2025-01-24T15:38:43Z

This would be a workaround, but it does not solve the underlying bug. If you look at #2838, -L/usr/local/nvidia/lib (or another libcuda library path, this depends on CDI/container toolkit probably) is missing from the gcc invocation, whereas this is passed on a functioning Docker/Podman system.

Triton finds the directory with libcuda.so by inspecting the output of ldconfig -p, so my suspicion is that your version of Podman somehow interferes with the dynamic linker. It would be helpful if you could run the docker image with --entrypoint /bin/sh and provide the output of the following commands inside the container (example output is from a working system, ignore the Nix store paths, I think it's injected by the CDI interface):

$ cat /etc/ld.so.conf.d/*.conf                                                                                                                                                                                       
/usr/local/cuda/targets/x86_64-linux/lib                                                                                                                                                                             
/usr/local/cuda-12/targets/x86_64-linux/lib                                                                                                                                                                          
# libc default configuration                                                                                                                                                                                         
/usr/local/lib                                                                                                                                                                                                       
/nix/store/m5g35gc5z6gjqgb37jn7v5qk6h00b3fc-nvidia-x11-550.142-6.6.71/lib                                                                                                                                            
/usr/local/nvidia/lib                                                                                                                                                                                                
/usr/local/nvidia/lib64                                                                                                                                                                                              
# Multiarch support                                                                                                                                                                                                  
/usr/local/lib/x86_64-linux-gnu                                                                                                                                                                                      
/lib/x86_64-linux-gnu                                                                                                                                                                                                
/usr/lib/x86_64-linux-gnu

$ ldconfig -p | grep cuda                                                                                                                                                                                            
        libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12                                                                                                                   
        libcudadebugger.so.1 (libc6,x86-64) => /nix/store/m5g35gc5z6gjqgb37jn7v5qk6h00b3fc-nvidia-x11-550.142-6.6.71/lib/libcudadebugger.so.1                                                                        
        libcudadebugger.so.1 (libc6,x86-64) => /usr/local/nvidia/lib/libcudadebugger.so.1                                                                                                                            
        libcudadebugger.so (libc6,x86-64) => /usr/local/nvidia/lib/libcudadebugger.so                                                                                                                                
        libcuda.so.1 (libc6,x86-64) => /nix/store/m5g35gc5z6gjqgb37jn7v5qk6h00b3fc-nvidia-x11-550.142-6.6.71/lib/libcuda.so.1                                                                                        
        libcuda.so.1 (libc6,x86-64) => /usr/local/nvidia/lib/libcuda.so.1                                                                                                                                            
        libcuda.so (libc6,x86-64) => /usr/local/nvidia/lib/libcuda.so

danieldk · 2025-01-24T16:12:17Z

Besides the above it would also be useful to post the contents of the following files on the host system:

/var/run/cdi/nvidia-container-toolkit.json

/etc/cdi/nvidia-container-toolkit.json

(Or if they don't exist, check in these directories for a JSON file for nvidia.)

Update Dockerfile to use devel image for compatibility

94c675c

This comment was marked as outdated.

Sign in to view

YaserJaradeh mentioned this pull request Jan 13, 2025

Model warmup fails after adding Triton indexing kernels #2838

Open

4 tasks

YaserJaradeh added 3 commits January 13, 2025 11:44

Merge branch 'huggingface:main' into fix/dockerfile-triton

ad4dcb6

Revert to base image and add python.h

ed8bf3a

Fix typo in deb package name

c22dba8

Use Devel as the base image

59dbe11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Dockerfile to use devel image for compatibility #2848

Update Dockerfile to use devel image for compatibility #2848

YaserJaradeh commented Dec 16, 2024 •

edited

Loading

YaserJaradeh commented Dec 16, 2024

This comment was marked as outdated.

scriptator commented Jan 13, 2025

scriptator commented Jan 13, 2025

YaserJaradeh commented Jan 13, 2025

YaserJaradeh commented Jan 13, 2025 •

edited

Loading

scriptator commented Jan 14, 2025

Narsil commented Jan 15, 2025

danieldk commented Jan 24, 2025 •

edited

Loading

danieldk commented Jan 24, 2025

Update Dockerfile to use devel image for compatibility #2848

Are you sure you want to change the base?

Update Dockerfile to use devel image for compatibility #2848

Conversation

YaserJaradeh commented Dec 16, 2024 • edited Loading

What does this PR do?

Before submitting

YaserJaradeh commented Dec 16, 2024

This comment was marked as outdated.

scriptator commented Jan 13, 2025

scriptator commented Jan 13, 2025

YaserJaradeh commented Jan 13, 2025

YaserJaradeh commented Jan 13, 2025 • edited Loading

scriptator commented Jan 14, 2025

Narsil commented Jan 15, 2025

danieldk commented Jan 24, 2025 • edited Loading

danieldk commented Jan 24, 2025

YaserJaradeh commented Dec 16, 2024 •

edited

Loading

YaserJaradeh commented Jan 13, 2025 •

edited

Loading

danieldk commented Jan 24, 2025 •

edited

Loading