Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xla_extension failed encountered when trying to use exla in a Docker container #90

Open
jeryldev opened this issue Jul 31, 2024 · 16 comments

Comments

@jeryldev
Copy link

jeryldev commented Jul 31, 2024

I encounter xla_extension failed when I try to run exla while building a docker container. Here are some of the snippets from my Dockerfile:

ARG BUILDER_IMAGE="hexpm/elixir:1.14.0-erlang-24.0.1-debian-bullseye-20210902-slim"
ARG RUNNER_IMAGE="debian:bullseye-20210902-slim"

FROM ${BUILDER_IMAGE}

...

# install build dependencies
# https://github.com/elixir-nx/xla?tab=readme-ov-file#building-from-source
RUN apt-get update -y && apt-get install -y build-essential git apt-transport-https curl gnupg python3-pip gcc-9 g++-9 \
    && apt-get clean && rm -f /var/lib/apt/lists/*_*

RUN export CC=/usr/bin/gcc-9

# https://bazel.build/install/ubuntu#install-on-ubuntu
RUN curl -fsSL https://bazel.build/bazel-release.pub.gpg | gpg --dearmor >bazel-archive-keyring.gpg
RUN mv bazel-archive-keyring.gpg /usr/share/keyrings
RUN echo "deb [arch=amd64 signed-by=/usr/share/keyrings/bazel-archive-keyring.gpg] https://storage.googleapis.com/bazel-apt stable jdk1.8" | tee /etc/apt/sources.list.d/bazel.list
RUN apt-get update -y && apt-get install -y bazel-6.5.0
RUN ln -s /usr/bin/bazel-6.5.0 /usr/bin/bazel

RUN pip install numpy

...

I get this error after I run the Dockerfile

[4,467 / 5,843] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 134s local ... (16 actions, 15 running)
[4,468 / 5,843] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 136s local ... (16 actions running)
[4,469 / 5,843] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 137s local ... (16 actions running)
[4,470 / 5,843] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 139s local ... (16 actions running)
[4,470 / 5,843] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 210s local ... (16 actions running)
ERROR: /home/user/.cache/bazel/_bazel_user/ee4c0f1833dfaa435cb867c88f5a190e/external/llvm-project/mlir/BUILD.bazel:4925:11: Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp failed: (Exit 1): gcc failed: error executing command (from target @llvm-project//mlir:LLVMDialect) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 85 arguments skipped)
gcc: fatal error: Killed signal terminated program cc1plus
compilation terminated.
Target //xla/extension:xla_extension failed to build
Use --verbose_failures to see the command lines of failed build steps.
[4,487 / 5,843] checking cached actions
INFO: Elapsed time: 1131.980s, Critical Path: 278.37s
INFO: 4487 processes: 343 internal, 4144 local.
FAILED: Build did NOT complete successfully
make: *** [Makefile:26: /home/user/.cache/xla/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-cpu.tar.gz] Error 1
could not compile dependency :xla, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile xla", update it with "mix deps.update xla" or clean it with "mix deps.clean xla"
==> lai
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".

I only encounter this issue when trying to build a docker container. I do not encounter any issues when I run mix phx.server.
Do we have an official Dockerfile sample for cases where docker container setup is required?

@jonatanklosko
Copy link
Member

Is there a reason you are trying to build XLA from source, rather than use the the precompiled binaries?

We use these dockerfiles for precompilation, so those instructions should work.

@jeryldev
Copy link
Author

jeryldev commented Jul 31, 2024

Ideally, we would prefer not to build the extension from source. I noticed that the xla gets built from source when we add exla in our dependencies. Here are the dependencies we've added along with exla:

      {:bumblebee, "~> 0.5.3"},
      {:nx, "~> 0.7.3"},
      {:exla, "~> 0.7.3"},
      {:explorer, "~> 0.9.0"}

We did not add the xla dependency in our list of dependencies, but somehow, it gets added (maybe because it's part of Nx).
Do you have a sample Dockerfile which we could use as basis when using Bumblebee, Nx, and Exla, without the triggering the building of XLA from source? Our main goal for now is to be able to run Nx and Exla in a docker container. 👍

@josevalim
Copy link
Contributor

By default it will download a precompiled version. Does it print anything saying it can't use a precompiled and therefore it must compile from source?

@jeryldev
Copy link
Author

I think it did. Here are some screenshots from today after removing the precompile steps in my Dockerfile

image

image

@josevalim
Copy link
Contributor

So you have XLA_BUILD set by any chance?

@jeryldev
Copy link
Author

I did not set it anywhere (.bashprofile, Dockerfile etc). Based on the README.md it is set to false by default.

@jonatanklosko
Copy link
Member

The build should trigger only when XLA_BUILD is set, otherwise it either downloads a precompiled binary or, if not available, raises an error.

One way to check would be to add RUN [ -z "$XLA_BUILD" ] || exit 1 before the compilation step and see if it goes on.

@polvalente
Copy link

I did notice the image uses a rather outdated combo of Elixir and OTP, as well as an older Debian. If possible, I'd update to eliminate any possibility of the compilation being triggered by not finding the proper version/platform precompiled archive

@jeryldev
Copy link
Author

jeryldev commented Aug 1, 2024

It still went through 🥲

[+] Building 2.7s (14/14) FINISHED                                                                                      docker:default
 => [api internal] load build definition from Dockerfile                                                                          0.0s
 => => transferring dockerfile: 4.75kB                                                                                            0.0s
 => [api internal] load metadata for docker.io/hexpm/elixir:1.14.0-erlang-24.0.1-debian-bullseye-20210902-slim                    2.0s
 => [api auth] hexpm/elixir:pull token for registry-1.docker.io                                                                   0.0s
 => [api internal] load .dockerignore                                                                                             0.0s
 => => transferring context: 1.31kB                                                                                               0.0s
 => [api 1/8] FROM docker.io/hexpm/elixir:1.14.0-erlang-24.0.1-debian-bullseye-20210902-slim@sha256:02ed2d3f2e0360821017751464a6  0.0s
 => CACHED [api 2/8] RUN addgroup --gid 1000 user &&     adduser --disabled-password --ingroup user --uid 1000 user               0.0s
 => CACHED [api 3/8] RUN apt-get update -y && apt-get install -y build-essential git curl     && apt-get clean && rm -f /var/lib  0.0s
 => CACHED [api 4/8] RUN mkdir -p /home/user/app &&     sh -c "git config --global url."https://${GITHUB_API_TOKEN}@github.com/"  0.0s
 => CACHED [api 5/8] WORKDIR /home/user/app                                                                                       0.0s
 => CACHED [api 6/8] RUN mix local.hex --force &&     mix local.rebar --force                                                     0.0s
 => CACHED [api 7/8] RUN mix do local.hex --force, local.rebar --force                                                            0.0s
 => [api 8/8] RUN [ -z "$XLA_BUILD" ] || exit 1                                                                                   0.4s
 => [api] exporting to image                                                                                                      0.1s
 => => exporting layers                                                                                                           0.1s
 => => writing image sha256:4af8189528e21cd493cfe8a2b41e0303905e614e6fe1526f3ceab03627094dab                                      0.0s
 => => naming to docker.io/library/lai-service-api                                                                                0.0s
 => [api] resolving provenance for metadata file                                                                                  0.0s
WARN[0000] /home/jde/code/la/lai-service/docker-compose.yaml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Creating 2/2
 ✔ Network lai-service_default  Created                                                                                           0.1s 
 ✔ Container lai-service-db-1   Created                                                                                           0.2s 
[+] Running 1/1
 ✔ Container lai-service-db-1  Started                                                                                            0.5s 
Resolving Hex dependencies...
Resolution completed in 0.753s
Unchanged:
  aws_rds_castore 1.2.0
  aws_signature 0.3.2
  axon 0.6.1
  bumblebee 0.5.3
  
.....

===> Analyzing applications...
===> Compiling telemetry
===> Analyzing applications...
===> Compiling telemetry_poller
===> Analyzing applications...
===> Compiling certifi
===> Analyzing applications...
===> Compiling hackney
==> xla
Compiling 2 files (.ex)
Generated xla app
mkdir -p /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb && \
        cd /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb && \
        git init && \
        git remote add origin https://github.com/openxla/xla.git && \
        git fetch --depth 1 origin 771e38178340cbaaef8ff20f44da5407c15092cb && \
        git checkout FETCH_HEAD && \
        rm /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelversion
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint: 
hint:   git config --global init.defaultBranch <name>
hint: 
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint: 
hint:   git branch -m <name>
Initialized empty Git repository in /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.git/
warning: redirecting to https://github.com/openxla/xla.git/
From https://github.com/openxla/xla
 * branch            771e38178340cbaaef8ff20f44da5407c15092cb -> FETCH_HEAD
Note: switching to 'FETCH_HEAD'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 771e381 [XLA:GPU] Check tensor_float_32_execution_enabled() in Triton codegen too
rm -f /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/extension && \
        ln -s "/home/user/app/deps/xla/extension" /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/extension && \
        cd /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb && \
        bazel build --define "framework_shared_object=false" -c opt    //xla/extension:xla_extension && \
        mkdir -p /home/user/.cache/xla/0.6.0/cache/build/ && \
        cp -f /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/bazel-bin/xla/extension/xla_extension.tar.gz /home/user/.cache/xla/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-cpu.tar.gz
/bin/sh: 4: bazel: not found
make: *** [Makefile:26: /home/user/.cache/xla/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-cpu.tar.gz] Error 127
could not compile dependency :xla, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile xla", update it with "mix deps.update xla" or clean it with "mix deps.clean xla"
==> lai
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".

@jonatanklosko
Copy link
Member

Interesting, I don't have any idea at the moment. It would be helpful if you could minimize it into a reproducible repo, like an empty mix project with the deps and the Dockerfile :)

@georgeguimaraes
Copy link

georgeguimaraes commented Dec 12, 2024

Got something similar:

5.162 ==> xla
5.162 Compiling 5 files (.ex)
5.267 Generated xla app
5.315
5.315 17:30:36.318 [info] Downloading a precompiled XLA archive for target aarch64-linux-gnu-cpu
9.752
9.752 17:30:40.757 [info] Successfully downloaded the XLA archive
10.47 ==> exla
10.47 Unpacking /root/.cache/xla/0.8.0/download/xla_extension-0.8.0-aarch64-linux-gnu-cpu.tar.gz into /app/deps/exla/cache
15.12 g++ cache/0.9.2/objs/exla.o cache/0.9.2/objs/exla_client.o cache/0.9.2/objs/exla_mlir.o cache/0.9.2/objs/custom_calls.o cache/0.9.2/objs/exla_nif_util.o cache/0.9.2/objs/ipc.o cache/0.9.2/objs/custom_calls/eigh_f32.o cache/0.9.2/objs/custom_calls/eigh_f64.o cache/0.9.2/objs/custom_calls/lu_bf16.o cache/0.9.2/objs/custom_calls/lu_f16.o cache/0.9.2/objs/custom_calls/lu_f32.o cache/0.9.2/objs/custom_calls/lu_f64.o cache/0.9.2/objs/custom_calls/qr_bf16.o cache/0.9.2/objs/custom_calls/qr_f16.o cache/0.9.2/objs/custom_calls/qr_f32.o cache/0.9.2/objs/custom_calls/qr_f64.o cache/0.9.2/objs/exla_cuda.o -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -shared -Wl,-rpath,'$ORIGIN/xla_extension/lib'
15.14 cache/0.9.2/objs/exla.o: file not recognized: file format not recognized
15.14 collect2: error: ld returned 1 exit status
15.14 make: *** [Makefile:101: cache/libexla.so] Error 1
15.14 could not compile dependency :exla, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile exla --force", update it with "mix deps.update exla" or clean it with "mix deps.clean exla"
15.14 ==> relax
15.14 ** (Mix) Could not compile with "make" (exit status: 2).
15.14 You need to have gcc and make installed. If you are using
15.14 Ubuntu or any other Debian-based system, install the packages
15.14 "build-essential". Also install "erlang-dev" package if not
15.14 included in your Erlang/OTP version. If you're on Fedora, run
15.14 "dnf group install 'Development Tools'".
[+] Running 0/1
 ⠹ Service api  Building                                                                                                              94.2s
failed to solve: process "/bin/sh -c mix compile" did not complete successfully: exit code: 1

I'm running this in a Macbook M1 Pro. This is a bare minimal elixir repo available at https://github.com/georgeguimaraes/relax (using {:exla, "~> 0.9.2"})

All I'm running to trigger this is docker compose up --build in the repo.

@georgeguimaraes
Copy link

Changing the dependency to {:exla, "~> 0.8.0"}

makes it work:

api-1  | ==> exla
api-1  | Using libexla.so from /root/.cache/xla/exla/elixir-1.17.3-erts-15.2-xla-0.8.0-exla-0.8.0-ioo6ddg2zbm7ovoei2oc4ucrjy/libexla.so
api-1  | Compiling 23 files (.ex)
api-1  | Generated exla app
api-1  | ==> relax
api-1  | Compiling 1 file (.ex)
api-1  | Generated relax app
api-1  | Running ExUnit with seed: 697364, max_cases: 8
api-1  |
api-1  | ..
api-1  | Finished in 0.01 seconds (0.00s async, 0.01s sync)

@georgeguimaraes
Copy link

Using {:exla, "0.9.1"} makes the docker image recompile xla but it finishes and the test ran:

❯ docker compose up --build
[+] Running 0/0
[+] Running 0/1 Building                                                                                                               0.1s
[+] Building 57.3s (13/13) FINISHED                                                                                          docker:default
 => [api internal] load build definition from Dockerfile                                                                               0.0s
 => => transferring dockerfile: 577B                                                                                                   0.0s
 => [api internal] load metadata for mirror.gcr.io/hexpm/elixir:1.17.3-erlang-27.2-ubuntu-noble-20241015                               1.3s
 => [api internal] load .dockerignore                                                                                                  0.0s
 => => transferring context: 2B                                                                                                        0.0s
 => [api 1/7] FROM mirror.gcr.io/hexpm/elixir:1.17.3-erlang-27.2-ubuntu-noble-20241015@sha256:f3a173c0d868e720c77a63c83de10c4b169f939  0.0s
 => [api internal] load build context                                                                                                  0.5s
 => => transferring context: 3.14MB                                                                                                    0.5s
 => CACHED [api 2/7] RUN apt-get update -y && apt-get install -y inotify-tools build-essential erlang-dev git curl   && apt-get clean  0.0s
 => CACHED [api 3/7] WORKDIR /app                                                                                                      0.0s
 => CACHED [api 4/7] RUN mix local.hex --force &&   mix local.rebar --force                                                            0.0s
 => [api 5/7] COPY . .                                                                                                                 0.9s
 => [api 6/7] RUN mix deps.get                                                                                                         2.8s
 => [api 7/7] RUN mix compile                                                                                                         49.7s
 => [api] exporting to image                                                                                                           2.1s
 => => exporting layers                                                                                                                2.1s
 => => writing image sha256:629ef48806cb54cd54e5c420d3761de5693c4b24cc56e60c70dada4c38250f04                                           0.0s
[+] Running 2/1o docker.io/library/relax-api                                                                                           0.0s
 ✔ Service api            Built                                                                                                       57.4s
 ✔ Container relax-api-1  Recreated                                                                                                    0.1s
Attaching to api-1
api-1  | ==> complex
api-1  | Compiling 2 files (.ex)
api-1  | Generated complex app
api-1  | ==> nx
api-1  | Compiling 36 files (.ex)
api-1  | Generated nx app
api-1  | ==> nimble_pool
api-1  | Compiling 2 files (.ex)
api-1  | Generated nimble_pool app
api-1  | ==> elixir_make
api-1  | Compiling 8 files (.ex)
api-1  | Generated elixir_make app
api-1  | ==> xla
api-1  | Compiling 5 files (.ex)
api-1  | Generated xla app
api-1  | ==> exla
api-1  | Using libexla.so from /root/.cache/xla/exla/elixir-1.17.3-erts-15.2-xla-0.8.0-exla-0.9.1-t34ppw6zq2bvv4txq247gllfci/libexla.so
api-1  | Compiling 24 files (.ex)
api-1  |      warning: Nx.Defn.stream/3 is deprecated. Move the streaming loop to Elixir instead
api-1  |      │
api-1  |  356 │     Nx.Defn.stream(function, args, Keyword.put(options, :compiler, EXLA))
api-1  |~
api-1  |      │
api-1  |      └─ lib/exla.ex:356:13: EXLA.stream/3
api-1  |
api-1  | Generated exla app
api-1  | ==> relax
api-1  | Compiling 1 file (.ex)
api-1  | Generated relax app
api-1  | Running ExUnit with seed: 309870, max_cases: 8
api-1  |
api-1  | ..
api-1  | Finished in 0.01 seconds (0.00s async, 0.01s sync)
api-1  | 1 doctest, 1 test, 0 failures
api-1 exited with code 0

@georgeguimaraes
Copy link

btw you'll see in my repo that I'm using the latest Elixir, OTP, and Ubuntu available

@jonatanklosko
Copy link
Member

jonatanklosko commented Dec 16, 2024

@georgeguimaraes in your case, the issue is that you do COPY . . in the Dockerfile, which also copies deps/ and _build/ directories into the Docker build (which I expect you have). In deps/ there are EXLA platform-specific .o compilation artifacts and reusing them in the Dockerfile fails.

I was able to reproduce the error by running mix deps.get, mix compile in the repo and then docker build .. Removing deps/ and _build/ makes Docker build successfully :) The actual solution is to make COPY more specific and not include these directories.

@georgeguimaraes
Copy link

Tks @jonatanklosko! TIL :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants