Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue Building Wheel with custom UMD #15527

Closed
blozano-tt opened this issue Nov 28, 2024 · 5 comments
Closed

Issue Building Wheel with custom UMD #15527

blozano-tt opened this issue Nov 28, 2024 · 5 comments
Assignees
Labels
ci-bug bugs found in CI infra-ci infrastructure and/or CI changes P1 pywheel

Comments

@blozano-tt
Copy link
Contributor

blozano-tt commented Nov 28, 2024

I am trying to eliminate the need for libnng and libuv to be shipped with tt-metal and distributed in the wheel.

To do so, I make the two dependencies private, and force them to be STATICALLY linked into libdevice.so

tenstorrent/tt-umd#339

When testing this custom UMD in tt-metal CI...

the build-wheel workflow is still looking for shared libraries for nng and uv !!!???
https://github.com/tenstorrent/tt-metal/actions/runs/12046563342/job/33588021255

Why?

I am also led to wonder why do we need to build python wheel separately from the C++ build???

I suspect something is afoul in the workflow.

cc: @tt-rkim @afuller-TT @broskoTT

@blozano-tt
Copy link
Contributor Author

blozano-tt commented Nov 28, 2024

I took @broskoTT's tt-metal branch that had the custom UMD submodule.
https://github.com/tenstorrent/tt-metal/tree/brosko-blozano-power-team

I built the wheel locally.

I unzip the whl: unzip metal_libs-0.53.1rc2.dev4+blackhole-cp38-cp38-linux_x86_64.whl

I run ldd, and there is no linkage of uv or nng required:

$ ldd ./ttnn/_ttnn.cpython-38-x86_64-linux-gnu.so
        linux-vdso.so.1 (0x00007ffc6e89b000)
        libc++.so.1 => /lib/x86_64-linux-gnu/libc++.so.1 (0x00007f1f803bb000)
        libc++abi.so.1 => /lib/x86_64-linux-gnu/libc++abi.so.1 (0x00007f1f80384000)
        libunwind.so.1 => /lib/x86_64-linux-gnu/libunwind.so.1 (0x00007f1f80376000)
        libpython3.8.so.1.0 => /lib/x86_64-linux-gnu/libpython3.8.so.1.0 (0x00007f1f7fe20000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f1f7fe1a000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f1f7fdfe000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1f7fdd9000)
        libatomic.so.1 => /lib/x86_64-linux-gnu/libatomic.so.1 (0x00007f1f7fdcf000)
        libhwloc.so.15 => /lib/x86_64-linux-gnu/libhwloc.so.15 (0x00007f1f7fd7e000)
        libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f1f7fd71000)
        libdevice.so => /home/blozano/tt-metal/TEMP/./ttnn/build/lib/libdevice.so (0x00007f1f7f94b000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f1f7f6dd000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f1f7f58c000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f1f7f567000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f1f7f375000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f1f82ec7000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f1f7f36b000)
        libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007f1f7f33d000)
        libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f1f7f338000)
        libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007f1f7f309000)
        libltdl.so.7 => /lib/x86_64-linux-gnu/libltdl.so.7 (0x00007f1f7f2fe000)
        libnsl.so.1 => /lib/x86_64-linux-gnu/libnsl.so.1 (0x00007f1f7f2e1000)

@tt-rkim tt-rkim added infra-ci infrastructure and/or CI changes P1 pywheel ci-bug bugs found in CI labels Nov 28, 2024
@tt-rkim
Copy link
Collaborator

tt-rkim commented Nov 28, 2024

You sent me down quite the rabbit hole, Mr. Boss. But I think I have answers for you.

So I downloaded the artifact directly from the workflow, then used patchelf to ensure we weren't linking against anything else spurious on my machine by setting it to only be $ORIGIN/build/lib. This is the intent on user machines, anyway. This step is important, as it bit me later on I realized as my artifact copy was linking against stuff on different places on my machine:

rkim@e12cs07:~/gh-artifacts-logs/12046563342/eager-dist-ubuntu-20.04-grayskull/ttnn$ patchelf --set-rpath '$ORIGIN/build/lib' _ttnn.cpython-38-x86_64-linux-gnu.so

Then I took a look with ldd:

rkim@e12cs07:~/gh-artifacts-logs/12046563342/eager-dist-ubuntu-20.04-grayskull/ttnn$ ldd _ttnn.cpython-38-x86_64-linux-gnu.so
	linux-vdso.so.1 (0x00007ffd4ecbf000)
	libc++.so.1 => /lib/x86_64-linux-gnu/libc++.so.1 (0x00007f46df1fd000)
	libc++abi.so.1 => /lib/x86_64-linux-gnu/libc++abi.so.1 (0x00007f46df1c6000)
	libunwind.so.1 => /lib/x86_64-linux-gnu/libunwind.so.1 (0x00007f46df1b8000)
	libtt_metal.so => /home/rkim/gh-artifacts-logs/12046563342/eager-dist-ubuntu-20.04-grayskull/ttnn/./build/lib/libtt_metal.so (0x00007f46deed5000)
	libpython3.8.so.1.0 => /lib/x86_64-linux-gnu/libpython3.8.so.1.0 (0x00007f46de97f000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f46de979000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f46de95b000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f46de938000)
	libatomic.so.1 => /lib/x86_64-linux-gnu/libatomic.so.1 (0x00007f46de92e000)
	libhwloc.so.15 => /lib/x86_64-linux-gnu/libhwloc.so.15 (0x00007f46de8dd000)
	libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f46de8d0000)
	libdevice.so => /home/rkim/gh-artifacts-logs/12046563342/eager-dist-ubuntu-20.04-grayskull/ttnn/./build/lib/libdevice.so (0x00007f46de508000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f46de324000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f46de1d5000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f46de1ba000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f46ddfc8000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f46e199f000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f46ddfbe000)
	libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007f46ddf8e000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f46ddf89000)
	libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007f46ddf5c000)
	libltdl.so.7 => /lib/x86_64-linux-gnu/libltdl.so.7 (0x00007f46ddf51000)
	libnng.so.1 => not found

Looks like it's looking for libnng (not libuv though, odd). Looking over your logs, it doesn't mention anything about libuv, although that's probably because it just dies upon the first lib it can't find.

Anyway, like you and I suspect, something might be afoul with how we use UMD in the wheel build. I took a look at libdevice.so:

rkim@e12cs07:~/gh-artifacts-logs/12046563342/eager-dist-ubuntu-20.04-grayskull/ttnn$ ldd /home/rkim/gh-artifacts-logs/12046563342/eager-dist-ubuntu-20.04-grayskull/ttnn/./build/lib/libdevice.so
	linux-vdso.so.1 (0x00007ffe64afc000)
	libc++.so.1 => /lib/x86_64-linux-gnu/libc++.so.1 (0x00007f75a3340000)
	libc++abi.so.1 => /lib/x86_64-linux-gnu/libc++abi.so.1 (0x00007f75a3309000)
	libunwind.so.1 => /lib/x86_64-linux-gnu/libunwind.so.1 (0x00007f75a32fb000)
	libhwloc.so.15 => /lib/x86_64-linux-gnu/libhwloc.so.15 (0x00007f75a32aa000)
	libnng.so.1 => not found
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f75a32a0000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f75a327b000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f75a3275000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f75a3093000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f75a2f44000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f75a2f29000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f75a2d37000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f75a3816000)
	libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007f75a2d08000)
	libltdl.so.7 => /lib/x86_64-linux-gnu/libltdl.so.7 (0x00007f75a2cfd000)

It's using nng there, but no uv. Which means uv seems to have statically built successfully, but not libnng.so. Not sure what's going on here.

Which brings me to my next point that you brought up:

I am also led to wonder why do we need to build python wheel separately from the C++ build???

We actually don't separate it in this case. There is a separate workflow outside of APC that does the wheel C++ build within the wheel build process, but not in APC. In APC, we take the build artifacts from build-artifact and simply copy it into the wheel. Check out the TT_FROM_PRECOMPILED env var in setup.py.

Anyway, I thought I'd take a look at what's going with the build artifact. I downloaded that and set the rpath as well:

rkim@e12cs07:~/gh-artifacts-logs/12046563342/TTMetal_build_grayskull/ttnn/ttnn$ ldd _ttnn.so 
	linux-vdso.so.1 (0x00007fff4c7eb000)
	libc++.so.1 => /lib/x86_64-linux-gnu/libc++.so.1 (0x00007fa6f2bdd000)
	libc++abi.so.1 => /lib/x86_64-linux-gnu/libc++abi.so.1 (0x00007fa6f2ba6000)
	libunwind.so.1 => /lib/x86_64-linux-gnu/libunwind.so.1 (0x00007fa6f2b98000)
	libtt_metal.so => /home/rkim/gh-artifacts-logs/12046563342/TTMetal_build_grayskull/ttnn/ttnn/./../../build/lib/libtt_metal.so (0x00007fa6f28b5000)
	libpython3.8.so.1.0 => /lib/x86_64-linux-gnu/libpython3.8.so.1.0 (0x00007fa6f235f000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa6f2359000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fa6f233b000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fa6f2318000)
	libatomic.so.1 => /lib/x86_64-linux-gnu/libatomic.so.1 (0x00007fa6f230e000)
	libhwloc.so.15 => /lib/x86_64-linux-gnu/libhwloc.so.15 (0x00007fa6f22bd000)
	libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007fa6f22b0000)
	libdevice.so => /home/rkim/gh-artifacts-logs/12046563342/TTMetal_build_grayskull/ttnn/ttnn/./../../build/lib/libdevice.so (0x00007fa6f1e79000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fa6f1c95000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa6f1b46000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fa6f1b2b000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa6f1939000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fa6f5546000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fa6f192f000)
	libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007fa6f18ff000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fa6f18fa000)
	libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007fa6f18cd000)
	libltdl.so.7 => /lib/x86_64-linux-gnu/libltdl.so.7 (0x00007fa6f18c2000)
	libnng.so.1 => not found

rkim@e12cs07:~/gh-artifacts-logs/12046563342/TTMetal_build_grayskull/ttnn/ttnn$ cd ../../build/lib && ldd libdevice.so
	linux-vdso.so.1 (0x00007ffcd8f8e000)
	libc++.so.1 => /lib/x86_64-linux-gnu/libc++.so.1 (0x00007fb765cf1000)
	libc++abi.so.1 => /lib/x86_64-linux-gnu/libc++abi.so.1 (0x00007fb765cba000)
	libunwind.so.1 => /lib/x86_64-linux-gnu/libunwind.so.1 (0x00007fb765cac000)
	libhwloc.so.15 => /lib/x86_64-linux-gnu/libhwloc.so.15 (0x00007fb765c5b000)
	libnng.so.1 => not found
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb765c51000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb765c2c000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb765c26000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb765a44000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb7658f5000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb7658da000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb7656e8000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fb766236000)
	libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007fb7656b9000)
	libltdl.so.7 => /lib/x86_64-linux-gnu/libltdl.so.7 (0x00007fb7656ae000)

Odd, so libdevice.so in the raw build artifact from the C++ build also can't find it. That should then explain why we see this issue in the wheel. But then why don't the non-wheel tests (C++ tests) fail?

Firstly, we still seem to package some .so for libnng:

rkim@e12cs07:~/gh-artifacts-logs/12046563342/TTMetal_build_grayskull/build/lib$ ls -hal | grep nng
-rw-r--r-- 3 rkim rkim 559K Nov 27 08:37 libnng.so
-rw-r--r-- 3 rkim rkim 559K Nov 27 08:37 libnng.so.1
-rw-r--r-- 3 rkim rkim 559K Nov 27 08:37 libnng.so.1.8.0

So it wasn't even statically built. I don't see uv in there. I actually have seen this before when I did some recent work with UMD. For some reason, even if we try statically building with -DBUILD_SHARED_LIBS=OFF or something similar, libnng refuses to listen, and will still produce an .so. I took a quick look at nng's top-level CMakeLists.txt, and nothing looks sussy, but still, it's interesting.

But then that begs the question - how do the other tests work if under normal conditions libdevice.so can't find libnng when you attempt to statically build uv and nng?

I think this is because we set LD_LIBRARY_PATH in the workflow:

rkim@e12cs07:~/gh-artifacts-logs/12046563342/TTMetal_build_grayskull/build/lib$ cat ~/tt-metal/.github/workflows/cpp-post-commit.yaml | grep LD
      LD_LIBRARY_PATH: ${{ github.workspace }}/build/lib

This means we'll always be able to find our libs, even if the libs themselves can't find each other under normal circumstances.

So my conclusion is the following:

  • We seem to not be able to produce a static build of libnng
  • This explains why _ttnn in the wheel looks for libnng.so still, because libdevice.so is looking for it
  • libdevice.so has no RPATH or anything similar set, so if there libraries in build/lib that it uses, it normally can't find them
  • Although _ttnn has its RPATH set to a path that will eventually find build/lib, libdevice.so does not, and libdevice.so is the one actually trying to find libnng
  • The non-Docker/non-wheelized tests work because we set LD_LIBRARY_PATH, allowing a final hammer for all libraries to find each other in build/lib

I think I'll leave it up to you for our best way forward. I think a couple of options are:

  • Try to figure out what's going on libnng and static building. Maybe I'm wrong here and we are producing things properly locally, but something on CI's build-artifact job is messed up?
  • Perhaps set an RPATH for libdevice.so within the UMD build using $ORIGIN so it can find any missing libs built with CPM in build/lib.

I would personally avoid any further use of LD_LIBRARY_PATH if we can muster it. @TT-billteng 's exact concern was this. We did it before as a hack to make build-artifact possible, but it was always a piece of technical debt. I prefer RPATHs with $ORIGIN in them.

@blozano-tt
Copy link
Contributor Author

blozano-tt commented Nov 28, 2024

We seem to not be able to produce a static build of libnng
This explains why _ttnn in the wheel looks for libnng.so still, because libdevice.so is looking for it

Here is an isolated build of the UMD branch, note that there is no dynamic linkage required for either nng or uv:

$ cd tt_metal/third_party/umd
$ git checkout brosko/repeat
$ cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
$ cmake --build build
$ ldd build/lib/libdevice.so 
        linux-vdso.so.1 (0x00007fff244b2000)
        libhwloc.so.15 => /lib/x86_64-linux-gnu/libhwloc.so.15 (0x00007f57ba7e9000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f57ba7df000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f57ba7bc000)
        libnsl.so.1 => /lib/x86_64-linux-gnu/libnsl.so.1 (0x00007f57ba79f000)
        libatomic.so.1 => /lib/x86_64-linux-gnu/libatomic.so.1 (0x00007f57ba795000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f57ba78f000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f57ba51f000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f57ba3d0000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f57ba3ab000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f57ba1b9000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f57bac87000)
        libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007f57ba18c000)
        libltdl.so.7 => /lib/x86_64-linux-gnu/libltdl.so.7 (0x00007f57ba181000)

Note that we can grep for nng and uv symbols in the library as well, meaning it was successfully statically linked:

$ nm -D build/lib/libdevice.so | grep nng_recv
00000000001a6660 T nng_recv
00000000001a69d0 T nng_recv_aio
00000000001a6750 T nng_recvmsg

$ nm -D build/lib/libdevice.so | grep uv_uptime
00000000001e6a70 T uv_uptime

libdevice.so has no RPATH or anything similar set, so if there libraries in build/lib that it uses, it normally can't find them

This is irrelevant if you are static linking

There is something broken here, and I'm pretty sure its the infrastructure.

Your investigation leads me to wonder if we are bringing in stale files from a .cpmcache

What leads me to this suspicion is that the nng library's CMakeLists.txt uses the target name nng no matter if it was built as shared or static. So if there are stale files around, and the .cpmcache is not refreshed, UMD might happily link against the shared version.

@TT-billteng
Copy link
Collaborator

Didn't read the whole thing yet, but I'm leaning towards hiding libuv and nng behind a SIMULATOR build flag for the simulation flow to avoid this completely for now @vtangTT

We can re-visit this when the multi-host/galaxy work goes into production

And go eat some turkey @blozano-tt 🦃 jeez

@blozano-tt
Copy link
Contributor Author

@tt-rkim @TT-billteng

I figured this out.

The problem is that we use the CMake customary cache variable "BUILD_SHARED_LIBS" to determine if we should build our libraries as static or shared.

option(BUILD_SHARED_LIBS "Create shared libraries" ON)

This variable globally impacts all of the dependencies we bring in through CPM, including those brought in by UMD's CPM. So if it is set, then nng ends up being a shared library.

The quick hack to fix this for my current situation is:
tenstorrent/tt-umd@aa74c4d

I am currently not sure what the right thing to do is for the project.

We don't really want BUILD_SHARED_LIBS to globally impact our dependencies.

Maybe we should have something like METALIUM_BUILD_SHARED_LIBS as a project option, and then set BUILD_SHARED_LIBS

cc: @afuller-TT @broskoTT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-bug bugs found in CI infra-ci infrastructure and/or CI changes P1 pywheel
Projects
None yet
Development

No branches or pull requests

3 participants