CI: fix workflows #1035

wkpark · 2024-02-05T20:35:25Z

~~fix regressions introduced by PR #949~~

~~PR HOTFIX: Fix regression (cpu fix) #1038 included.~~ (merged)

misc workflows fixes

restored

restore some accidentally and not carefully merged but abandoned stuff.

support cmake -B build option correctly.
use python-version=3.10 for builds
build a wheel with all available cuda versions.
~~use conda+mamba method. disable cuda-toolkits. (the cuda-toolkits is too slow and not stable in some cases.)~~
use more flexible/customizable cmake options. (set COMPUTE_CAPABILITY for example)

fixed

fix Jimver/cuda-toolkit speed issue (extracted to CI: Fix cuda toolkit speed issue. #1055) ~~cuda-toolkit seems too slow and even breaks sometimes. (Please see https://github.com/wkpark/bitsandbytes/actions/runs/7793281860/job/21252763742)~~
fix wheel names for aarch64
~~fix docker using container: image: method.~~ use docker-run-actions
update docker image version
- fixed deprecation warning for cuda12.1 docker image (cuda12.1.0 is deprecated but cuda12.1.1 is supported)

misc

drop artifact retention-days: it could be configured Settings->Actions->General.
concurrency: fix restored.
fail-fast: false restored.
use ilammy/[email protected] instead of microsoft/setup-msbuild
- microsoft/setup-msbuild is used for msbuild. ilammy/[email protected]
use cmake -G Ninja : ninja is more faster and popular. (triton use Ninja for example)
differently define COMPUTE_CAPABILITY
- for pull_request, use concise COMPUTE_CAPABILITY=50;52;60;61;62;70;72;75;80;86;87;89;90
- for refs/tags/*, use full capabilities, COMPUTE_CAPABILITY=61;75;86;89 (only popular GPUs included)
- with this method, build time reduced from ~40min to ~25min
build cuda12.1 only for pull_request, full build for refs/tags/*

aarch64 build issue

Build time using docker + aarch64 arch seems too slow. current total build time is about 30~40min.

normal build time is about ~5 ~10min, aarch64 build on docker is about ~25 ~30min
we can use cross-compiler to fix the build speed issue.
or we can reduce COMPUTE_CAPABILITY for pull_requests

outdated or resolved
~~CPU builds with aarch64 gcc/g++ are fine, but for cuda builds there is no benefit as docker itself does not have arm64 libraries installed and it will fail to link correctly.~~ resolved.

Some of @rickardp's changes, whether intended or not, maybe better done in python-package.yml. However, it causes some serious problems, and it is important to quickly fix things that were fine before, I am submitting this PR.

~~P.S.: PR #1018 is not included.~~

github-actions · 2024-02-05T20:38:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

akx

At first glance, this basically reverts a lot of Rickard's work - could you explain why the changes are needed? There also seems to be a bunch of changes that aren't really "fix regression"..?

wkpark · 2024-02-05T21:33:47Z

.github/workflows/python-package.yml

+  push:
+    branches: [ "main" ]


push needs to be done for specific branches normally. (reverted)

Having no branch filter here means CI gets run on pushes that don't have a PR too, so you can test things out in your own branch.

Regarding branch filters etc, feel free to suggest changes. The decisions were deliberate, but mine, maybe there are other considerations. My idea was that all PRs should be built here, but when creating forks you also get them built and validated. I was planning to add the concurrency part, I deliberately left it out due to the flux in pipelines and it was annoying to have the builds cancelled all the time, but I think it makes sense to add them

CMakeLists.txt

.github/workflows/python-package.yml

wkpark · 2024-02-05T21:59:35Z

At first glance, this basically reverts a lot of Rickard's work - could you explain why the changes are needed? There also seems to be a bunch of changes that aren't really "fix regression"..?

as I already mentioned, this is a quick and manual revert to fix the current regression state.

akx · 2024-02-05T22:18:02Z

The missing _cpu tag might well be a regression, but this PR seems to be putting the Mamba/Conda stuff back in (why?), adding a setuptools and wheel in the shared-libs step, etc.

Wouldn't it be better to just fix things for the better instead of a "quick revert" that someone will need to fix again later? Since there's a lot of activity in this repo just now, and we're working in a state where we don't really have the luxury to actually properly test anything, I think we should accept the fact that main might be broken.

akx · 2024-02-05T23:57:52Z

Could you maybe look into other options than mamba? I mean, this repo itself has install_cuda.py and install_cuda.sh scripts which apparently are meant to install the CUDA Toolkit – maybe we should use those?

rickardp · 2024-02-06T15:30:53Z

.github/workflows/python-package.yml

-            arch: aarch64
+        os: [windows-latest]
+        arch: [x86_64]
+        cuda-version: ['11.8.0', '12.1.1']


I deliberately skipped cuda 11 from pipelines not to make them slower before the dust had settled. The main branch is very active at the moment. I don't see a problem adding it, but we must realize that we burn CPU minutes quite fast now. I did not consider this a regression since Python packages were published outside of pipelines before (and still are, we only build them).

rickardp · 2024-02-06T15:32:08Z

Some comment from me, possibly #949 was merged too quick without aligning on the scope. Sorry for not realizing this! May I suggest discussing improvements in the RFCs, maybe in #1031?

I would argue the easiest way to manage CUDA dependencies is through Docker. What is currently lacking is dependabot-managed update for the CUDA versions. I was actually planning to fix this in a follow up PR, it involves creating Dockerfiles for the CUDA images as this is what Dependabot can work wiith. We need to continuously monitor for updates as NVIDIA deprecates CUDA versions fast, and Dependabot is the realistic way to handle this. The problem here is Windows. While there seems to be some community-based Windows containers, AFAIK hosted runners won't run Windows containers. The community GitHub action does a decent job, but it's slow, owing to the fact that it runs the horribly slow CUDA installer. Some kind of image-based approach (just like Docker for the Linux builds) would be the best solution

.github/workflows/python-package.yml

rickardp · 2024-02-06T21:27:45Z

.github/workflows/python-package.yml

      # Check out dependencies code
    - uses: actions/checkout@v4
      name: Check out NVidia cub
      with:
        repository: nvidia/cub
        ref: 1.11.0
        path: dependencies/cub
-      # Compile C++ code
-    - name: Build C++
+    - name: Setup Mambaforge


I am still not convinced mamba is the way to go.

Also, there's a lot of duplication here w.r.t the Docker flow. What is the reason we need to change this?

Same, it seems overly complicated for not much obvious benefit to me. I like to be able to do something simple like pip install -e .[dev] and use the cache on it.

.github/workflows/python-package.yml

- fix custom command - fix *_OUTPUT_DIRECTORY

* fix to support cmake -B build option * add cuda 11.8, 12.1 * use python-version==3.10 for builds. * fix wheel names for aarch64 * drop artifact retention-days: it could be configured Settings->Actions->General. * make wheel with all available cuda versions. * use docker-run-actions * update docker image version

Titus-von-Koeller · 2024-02-21T14:41:41Z

Hey @wkpark, @akx, @matthewdouglas, @rickardp,

Yes agreed, #949 was merged a bit hasty, due to an misunderstanding and too quick trigger finger on my side. Sorry for that! And thanks everyone for the collective cleanup / hotfix actions that ensued and got everything back to working order.

Based on your discussion above, I'm now not sure if this PR is still WIP or if you all agree that it's ready for review and merge? If not, what do we need to still implement or agree on in order to move forward?

akx · 2024-02-21T15:14:22Z

.github/workflows/python-package.yml

+        # fix wheel name
+        if [ "${{ matrix.arch }}" = "aarch64" ]; then
+            o=$(ls dist/*.whl)
+            n=$(echo $o | sed 's@_x86_64@_aarch64@')
+            [ "$n" != "$o" ] && mv $o $n
+        fi


This looks like a hack? Why is the wheel named x86_64... if it contains aarch64 material? IOW, are you sure it contains aarch64 material if it's named wrong?

@akx So to me it looks like the lib is built on aarch64 via Docker, but this step of packaging the artifacts for a wheel is being executed on an x86-64 host always.

I agree it seems hacky. Maybe using something like wheel tags $o --platform-tag=manylinux2014_aarch64 will feel a little better.

rickardp · 2024-02-21T17:23:03Z

Hey @wkpark, @akx, @matthewdouglas, @rickardp,

Yes agreed, #949 was merged a bit hasty, due to an misunderstanding and too quick trigger finger on my side. Sorry for that! And thanks everyone for the collective cleanup / hotfix actions that ensued and got everything back to working order.

Based on your discussion above, I'm now not sure if this PR is still WIP or if you all agree that it's ready for review and merge? If not, what do we need to still implement or agree on in order to move forward?

TBH I thought this PR was abandoned and most of the stuff had been broken out to other PRs.

I'll have another look as I am not completely sure what issues remain that this PR solves

rickardp · 2024-02-21T17:25:31Z

use ilammy/[email protected] instead of microsoft/setup-msbuild
microsoft/setup-msbuild is used for msbuild. ilammy/[email protected]

concurrency: fix restored

These are already on master I think

update docker image version
fixed deprecation warning for cuda12.1 docker image (cuda12.1.0 is deprecated but cuda12.1.1 is supported)

IMHO this is better to handle like #1052 so we can have Dependabot notifying when there are upgrade

support cmake -B build option correctly.

@wkpark I think you mentioned this a while back, but I don't see why this is necessary. To me it seems that it only complicates path handling in the cmake files. Source files and output are where they are anyway and intermediates are already git ignored

we can use cross-compiler to fix the build speed issue.

CUDA can't be cross compiled but we can use native aarch64 agents when they become publically available

use python-version=3.10 for builds

IMHO we should test on all versions we support. But there was an idea floated a while back (possibly by @akx) about splitting building wheels from testing them. The building of wheels is really quick so it won't speed anything up, but it will reduce the storage use.

rickardp · 2024-02-21T17:32:45Z

fix wheel names for aarch64

drop artifact retention-days: it could be configured Settings->Actions->General.

fail-fast: false restored.

I suggest we make these separate PRs. I think retention days we want to set separately per artifact type as some artifacts (wheels) are more useful than others (.so files). But agreed these need tuning

use cmake -G Ninja : ninja is more faster and popular. (triton use Ninja for example)

Same here. But I don't know if the speed of the build tool will matter, as most of the time is spent in nvcc. And it's an additional moving part. But maybe we can isolate the change and see its impact?

wkpark · 2024-02-23T10:42:24Z

Hey @wkpark, @akx, @matthewdouglas, @rickardp,

Yes agreed, #949 was merged a bit hasty, due to an misunderstanding and too quick trigger finger on my side. Sorry for that! And thanks everyone for the collective cleanup / hotfix actions that ensued and got everything back to working order.

Based on your discussion above, I'm now not sure if this PR is still WIP or if you all agree that it's ready for review and merge? If not, what do we need to still implement or agree on in order to move forward?

several PRs have been already extracted and merged.
We can consider this PR as a reference or ideas, I do not intend to get it merged without further discussion or agree.

akx reviewed Feb 5, 2024

View reviewed changes

wkpark commented Feb 5, 2024

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

wkpark commented Feb 5, 2024

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

wkpark commented Feb 5, 2024

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

wkpark commented Feb 5, 2024

View reviewed changes

.github/workflows/python-package.yml Outdated Show resolved Hide resolved

wkpark commented Feb 5, 2024

View reviewed changes

.github/workflows/python-package.yml Show resolved Hide resolved

wkpark changed the title ~~Fix regression~~ Fix regression + revert workflows Feb 5, 2024

wkpark force-pushed the fix-reg branch 2 times, most recently from 8d8747f to ece67ae Compare February 6, 2024 13:11

rickardp reviewed Feb 6, 2024

View reviewed changes

wkpark force-pushed the fix-reg branch from ece67ae to 5eef8bd Compare February 6, 2024 20:11

rickardp reviewed Feb 6, 2024

View reviewed changes

wkpark force-pushed the fix-reg branch from 5eef8bd to 6ab10e2 Compare February 7, 2024 13:44

wkpark changed the title ~~Fix regression + revert workflows~~ fix workflows Feb 7, 2024

wkpark changed the title ~~fix workflows~~ CI: fix workflows Feb 7, 2024

wkpark force-pushed the fix-reg branch 2 times, most recently from 8a5c804 to 95b1779 Compare February 7, 2024 15:13

wkpark mentioned this pull request Feb 7, 2024

Distribute pip wheels for the architecture they are built for #1043

Closed

rickardp mentioned this pull request Feb 7, 2024

Add concurrency lock to Python wheel workflow #1051

Merged

wkpark force-pushed the fix-reg branch 3 times, most recently from 66fd141 to 42942b4 Compare February 8, 2024 11:10

wkpark added 2 commits February 15, 2024 07:32

support cmake -B build option

56d866a

- fix custom command - fix *_OUTPUT_DIRECTORY

wkpark added 3 commits February 15, 2024 07:51

CI: reduce CAPABILITY for pull_requests

6c55541

CI: reduce cuda versions for pull_request

77f38ef

CI: build x86_64 only when pull_request to reduce build time

85f72bf

wkpark force-pushed the fix-reg branch from 42942b4 to 85f72bf Compare February 14, 2024 22:54

akx reviewed Feb 21, 2024

View reviewed changes

matthewdouglas mentioned this pull request Feb 23, 2024

(cmake) Update library output directory #1080

Merged

Titus-von-Koeller force-pushed the main branch 2 times, most recently from 9b72679 to 7800734 Compare July 27, 2024 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: fix workflows #1035

CI: fix workflows #1035

wkpark commented Feb 5, 2024 •

edited

Loading

github-actions bot commented Feb 5, 2024

akx left a comment

wkpark Feb 5, 2024 •

edited

Loading

akx Feb 5, 2024

rickardp Feb 6, 2024 •

edited

Loading

wkpark commented Feb 5, 2024

akx commented Feb 5, 2024

akx commented Feb 5, 2024

rickardp Feb 6, 2024

rickardp commented Feb 6, 2024

rickardp Feb 6, 2024

matthewdouglas Feb 7, 2024

Titus-von-Koeller commented Feb 21, 2024

akx Feb 21, 2024

matthewdouglas Feb 21, 2024

rickardp commented Feb 21, 2024

rickardp commented Feb 21, 2024 •

edited

Loading

rickardp commented Feb 21, 2024 •

edited

Loading

wkpark commented Feb 23, 2024

CI: fix workflows #1035

Are you sure you want to change the base?

CI: fix workflows #1035

Conversation

wkpark commented Feb 5, 2024 • edited Loading

misc workflows fixes

restored

fixed

misc

aarch64 build issue

github-actions bot commented Feb 5, 2024

akx left a comment

Choose a reason for hiding this comment

wkpark Feb 5, 2024 • edited Loading

Choose a reason for hiding this comment

akx Feb 5, 2024

Choose a reason for hiding this comment

rickardp Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

wkpark commented Feb 5, 2024

akx commented Feb 5, 2024

akx commented Feb 5, 2024

rickardp Feb 6, 2024

Choose a reason for hiding this comment

rickardp commented Feb 6, 2024

rickardp Feb 6, 2024

Choose a reason for hiding this comment

matthewdouglas Feb 7, 2024

Choose a reason for hiding this comment

Titus-von-Koeller commented Feb 21, 2024

akx Feb 21, 2024

Choose a reason for hiding this comment

matthewdouglas Feb 21, 2024

Choose a reason for hiding this comment

rickardp commented Feb 21, 2024

rickardp commented Feb 21, 2024 • edited Loading

rickardp commented Feb 21, 2024 • edited Loading

wkpark commented Feb 23, 2024

wkpark commented Feb 5, 2024 •

edited

Loading

wkpark Feb 5, 2024 •

edited

Loading

rickardp Feb 6, 2024 •

edited

Loading

rickardp commented Feb 21, 2024 •

edited

Loading

rickardp commented Feb 21, 2024 •

edited

Loading