-
-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use upstream's new split build #272
Comments
Appreciate you creating this issue. Ill dive into the issues with protobuf on Windows first but after that Id love to take a stab at this. |
yeah no problem. the fact that we have CIs that can run most of the linux + osx stuff might help somebody else jump in too! |
yes. |
I'd like to give this one a shot. For a start, I'm going to do it in place — but ideally, I think we would end up splitting the feedstocks, in order to be able to build them entirely separately. Is anyone opposed to that? |
I wonder what the advantage of that is? Is it to decouple the two builds completely? I would assume most of the time they are quite coupled. |
It's primarily about build time (and space usage). Currently, building a single variant involves spending a lot of time building the common stuff ("libtorch"), and then a similar amount of time building PyTorch for all Python versions. From what I understand, splitting it would mean you can build libtorch first, and then use it to build PyTorch variants separately — effectively halving the time of each build. |
But cant you achieve the same thing with the multi output recipe? Because from what I understand that is currently already the case |
If there are reasonable intermediate quantities (like libtorch) to build, then absolutely we can split them off into a separate feedstock. Before the recent upstream changes there was no canonical way to do it, and since the builds were manual anyway, there wasn't a big benefit for a split. But if we can (for example) only build We discussed this (at some length) in #108, which I see you've discovered already. ;-) |
you can test and validate many theories in staged-recipes. Don't let the current state stop you from exploring! The |
Ok, so after spending two days trying and hacking, it seems that upstream's split build either does not really work as of 2.5.1, or is incompatible with something our recipe does. The main issue is this:
On top of that it seems that I don't think we can actually utilize this without further changes to CMake files, and (at least for someone as unaccustomed to the upstream build system) that seems like a lot of effort for a minimal gain.
My "cleanest" approach was: diff --git a/recipe/build.sh b/recipe/build.sh
index f36cd2b..f7cad5e 100644
--- a/recipe/build.sh
+++ b/recipe/build.sh
@@ -106,15 +106,17 @@ else
fi
if [[ "$PKG_NAME" == "pytorch" ]]; then
- PIP_ACTION=install
+ PIP_ACTION="install --config-settings=--build-option=--cmake"
# Trick Cmake into thinking python hasn't changed
sed "s/3\.12/$PY_VER/g" build/CMakeCache.txt.orig > build/CMakeCache.txt
sed -i.bak "s/3;12/${PY_VER%.*};${PY_VER#*.}/g" build/CMakeCache.txt
sed -i.bak "s/cpython-312/cpython-${PY_VER%.*}${PY_VER#*.}/g" build/CMakeCache.txt
+ export BUILD_PYTHON_ONLY=1
else
# For the main script we just build a wheel for so that the C++/CUDA
# parts are built. Then they are reused in each python version.
PIP_ACTION=wheel
+ export BUILD_LIBTORCH_WHL=1
fi
# MacOS build is simple, and will not be for CUDA
@@ -217,15 +219,14 @@ $PREFIX/bin/python -m pip $PIP_ACTION . --no-deps -vvv --no-clean \
if [[ "$PKG_NAME" == "libtorch" ]]; then
mkdir -p $SRC_DIR/dist
pushd $SRC_DIR/dist
- wheel unpack ../torch-*.whl
- pushd torch-*
+ wheel unpack ../torch_no_python-*.whl
+ pushd torch_no_python-*
mv torch/bin/* ${PREFIX}/bin
mv torch/lib/* ${PREFIX}/lib
mv torch/share/* ${PREFIX}/share
for f in ATen caffe2 tensorpipe torch c10; do
mv torch/include/$f ${PREFIX}/include/$f
done
- rm ${PREFIX}/lib/libtorch_python.*
popd
popd
diff --git a/recipe/meta.yaml b/recipe/meta.yaml
index 4b2b655..6863f1e 100644
--- a/recipe/meta.yaml
+++ b/recipe/meta.yaml
@@ -1,5 +1,5 @@
{% set version = "2.5.1" %}
-{% set build = 3 %}
+{% set build = 4 %}
{% if cuda_compiler_version != "None" %}
{% set build = build + 200 %}
@@ -36,6 +36,7 @@ source:
- patches/0008-Fix-pickler-error.patch
# https://github.com/pytorch/pytorch/pull/137331
- patches/137331.patch
+ - patches/0009-Build-torch_shm_manager-as-part-of-libtorch-wheel.patch
build:
number: {{ build }}
diff --git a/recipe/patches/0009-Build-torch_shm_manager-as-part-of-libtorch-wheel.patch b/recipe/patches/0009-Build-torch_shm_manager-as-part-of-libtorch-wheel.patch
new file mode 100644
index 0000000..182770e
--- /dev/null
+++ b/recipe/patches/0009-Build-torch_shm_manager-as-part-of-libtorch-wheel.patch
@@ -0,0 +1,35 @@
+From 1206a5d7299740ddf9ccac5828d62f29bd59fa98 Mon Sep 17 00:00:00 2001
+From: =?UTF-8?q?Micha=C5=82=20G=C3=B3rny?= <[email protected]>
+Date: Thu, 14 Nov 2024 16:41:11 +0100
+Subject: [PATCH] Build torch_shm_manager as part of "libtorch" wheel
+
+---
+ torch/CMakeLists.txt | 8 ++++----
+ 1 file changed, 4 insertions(+), 4 deletions(-)
+
+diff --git a/torch/CMakeLists.txt b/torch/CMakeLists.txt
+index c74b4543..2f94d2e1 100644
+--- a/torch/CMakeLists.txt
++++ b/torch/CMakeLists.txt
+@@ -9,10 +9,6 @@ if(NOT CAFFE2_CMAKE_BUILDING_WITH_MAIN_REPO)
+ set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
+ endif()
+
+-if(NOT BUILD_PYTHON)
+- return()
+-endif()
+-
+ set(TORCH_SRC_DIR "${CMAKE_CURRENT_SOURCE_DIR}")
+ set(TORCH_ROOT "${TORCH_SRC_DIR}/..")
+
+@@ -29,6 +25,10 @@ endif()
+ set(LIBSHM_SRCDIR ${TORCH_SRC_DIR}/lib/${LIBSHM_SUBDIR})
+ add_subdirectory(${LIBSHM_SRCDIR})
+
++if(NOT BUILD_PYTHON)
++ return()
++endif()
++
+
+ # Generate files
+ set(TOOLS_PATH "${TORCH_ROOT}/tools") |
I was trying to find a build log on upstream CI, but apparently the only job actually testing split build is disabled right now per pytorch/pytorch#138750. |
Okay, I've done some more testing today, and if I build That said, it works only as long as you preserve the |
This is oddly similar to what our logs report with our existing hacks. With upstream abandoning their conda packages i wonder if this strategy should simply be abandonned? They haven't really reviewed my shoe in 1 liner PR allowing one to override ld so I don't really have hope for anything more complicated.... |
So, according to some quick testing, if I do "libwheel" and "python-only" build in a single "subpackage" (i.e. call one after the other), things work roughly as expected (~170 objects get built). However, I couldn't find a way to make work across different subpackages To be honest, the only (potential) advantage I see from upstream split logic is that it allows us to build "libtorch" without building the Python modules, i.e. save a little time during the first stage. I suppose we could gain a similar effect by "microoptimizing" the build order — i.e. by having pytorch built for Python 3.12 first, so we wouldn't be building that variant twice (but I haven't verified that). |
I'm a little concerned about making the recipe so complicated that other maintains can't join. |
That's a fair point (though TBH I think we're beyond that point already xP). By my "optimization", I merely meant reordering Python versions — but I've just noticed we're not setting these explicitly in the recipe, so that's probably not an option. |
I think the only thing we really need to avoid are massive patches (and other rebase-hazards). A couple lines of CMake modification to unvendor something should not be an issue. We can document the workflow for updating the patches so that contributors aren't stuck on that (here's an example for a feedstock where we did that; might want to move that to the knowledge base), but in any case: build time reductions make things more maintainable on the margin IMO. |
Comment:
It seems upstream introduced two new parameters to their build system:
BUILD_LIBTORCH_WHL
BUILD_PYTHON_ONLY
see: pytorch/pytorch@a25b28a
It seems that these made it to 2.4.0 and 2.4.1. It may be interesting to use these to alleviate some of the complexities of our existing build system.
However, these parameters seem very "pypi" centric where shared libraries are just shoved in the
site-packages
directory. We may want to massage things a little bit to make them work in a "conda" context.cc: @baszalmstra
The text was updated successfully, but these errors were encountered: