Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add rocm builds and documentation #1012

Merged
merged 41 commits into from
Dec 13, 2023
Merged

Conversation

cromefire
Copy link
Contributor

Fixes #636

Proper display in Telemetry and UI not included, see #902 for that.

@cromefire cromefire changed the title Add rocm builds and documentation feat: Add rocm builds and documentation Dec 10, 2023
Copy link
Member

@wsxiaoys wsxiaoys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure that the pull request (PR) focuses on a single part. I would suggest that you only modify the Docker-related CI configuration files (.yml/.Dockerfile) in this PR to facilitate a smoother merge.

.github/workflows/release.yml Outdated Show resolved Hide resolved
.github/workflows/release.yml Outdated Show resolved Hide resolved
@cromefire
Copy link
Contributor Author

I would suggest that you only modify the Docker-related CI configuration files (.yml/.Dockerfile) in this PR to facilitate a smoother merge.

Well I mean documentation seems to be a pretty important part of it and I wouldn't think there are many issues there. Sure I can split it again, but I'm sinking many hours into testing and splitting this, there's a limit somewhere to what I can manage as well. This already consumed most of my Sunday now.

@wsxiaoys
Copy link
Member

wsxiaoys commented Dec 10, 2023

I would suggest that you only modify the Docker-related CI configuration files (.yml/.Dockerfile) in this PR to facilitate a smoother merge.

Well I mean documentation seems to be a pretty important part of it and I wouldn't think there are many issues there. Sure I can split it again, but I'm sinking many hours into testing and splitting this, there's a limit somewhere to what I can manage as well. This already consumed most of my Sunday now.

I fully understand and would like to express my gratitude once again for your contribution :)

I mainly refer to certain changes (e.g., Rust 1.73 -> 1.8, manylinux build). These are all valuable, but they require careful testing to ensure that everything still works.

The documentation part is fine to retain; I will provide comments once we reach a good point in the remaing part.

@cromefire
Copy link
Contributor Author

I mainly refer to certain changes (e.g., Rust 1.73 -> 1.8, manylinux build). These are all valuable, but they require careful testing to ensure that everything still works.

Yeah sure I can just downgrade that again. Thought as docker uses latest stable anyway, might as well update that.

@cromefire
Copy link
Contributor Author

Well the manylinux image for rocm now seems to run into a disk space limit, could you maybe try and delete some caches or so? In theory it's much smaller (< 6GB) than the cuda (~10GB) image, which is why I'm slightly confused about it... Maybe it's running an parallel on the same VM?

@wsxiaoys
Copy link
Member

wsxiaoys commented Dec 11, 2023

Well the manylinux image for rocm now seems to run into a disk space limit, could you maybe try and delete some caches or so? In theory it's much smaller (< 6GB) than the cuda (~10GB) image, which is why I'm slightly confused about it... Maybe it's running an parallel on the same VM?

I'll suggest skip manylinux build atm, and checkin docker image in this PR.

@cromefire
Copy link
Contributor Author

cromefire commented Dec 12, 2023

I think it's just a fluke in the build and the build has to be restarted (as it worked before with a bigger container). Can't do that myself though.

Need to update anyway, so we'll see if it wants to run this time.

@cromefire
Copy link
Contributor Author

Found the error: The size on docker hub is compressed and apparently ROCm is quite big, but compresses super well plus the manylinux and pytorch also added magma and MLOpen to it...

Got everything working by installing only the minimum needed on the fly. The biggest problem is that this adds ~4min to the ROCm build and I think the ROCm repo might also be slower than today sometimes...

As a solution I have prepared this: https://github.com/cromefire/hipblas-manylinux. Super basic docker build setup, but if you want me to switch to that, I'd recommend we move the repo to this org, so it's easier to maintain (basically to only thing required would be to upgrade to the latest ROCm version from time to time) as I probably won't always remember to upgrade it to the latest ROCm version.

@cromefire
Copy link
Contributor Author

Also Windows build is possible as well (HIP install for windows can be found here), but I'll probably just open an issue for that, because I don't think I have the bandwidth right now to deal with windows, maybe one day...

website/docs/faq.mdx Outdated Show resolved Hide resolved
@wsxiaoys wsxiaoys enabled auto-merge (squash) December 13, 2023 07:26
@wsxiaoys
Copy link
Member

Hey - I reverted most docker related part - given the manylinux build now works and I’m thinking of re-organize the docker image build to be around these manylinux binaries directly - will do it as a followup from my side.

The other part LGTM, thanks for your effort!

@wsxiaoys wsxiaoys disabled auto-merge December 13, 2023 07:28
@cromefire
Copy link
Contributor Author

Yeah as said though, we probably want to transfer the image repo for ROCm to this org, so you don't have to wait on me for any upgrades and it's all under your control.

@cromefire
Copy link
Contributor Author

cromefire commented Dec 13, 2023

And also please at least restore the Dockerfile themselves for manual building, as they are an easy way for people to build their own optimized images quickly. For ROCm this is important as you may have seen these AMDGPU_TARGETS, and while it includes most GPUs, it doesn't include all, so having a way for people to just build it quickly for their specific GPU would be really important, that you I always built and tested all of this as well.

@cromefire
Copy link
Contributor Author

Also because everything, including compilers, is old and crusty you'd probably actually want to have an optimized build in docker which can use all the newer tools there, including the latest compiler optimizations and mitigations, after all, you don't have to worry about compatibility.

@wsxiaoys wsxiaoys merged commit 6ef9040 into TabbyML:main Dec 13, 2023
8 checks passed
@cromefire
Copy link
Contributor Author

It would also enable you to adjust the build, so it builds the ROCm libraries from source (they are open source after all), so you could build everything just for your GPU, which would probably shrink the container from like 10GB+ to more like 1-2GB.

@wsxiaoys
Copy link
Member

For ROCm this is important as you may have seen these AMDGPU_TARGETS, and while it includes most GPUs, it doesn't include all, so having a way for people to just build it quickly for their specific GPU would be really important, that you I always built and tested all of this as well.

Since the user need to build and test it manually - I'll prefer just leave it in source code.

Also because everything, including compilers, is old and crusty you'd probably actually want to have an optimized build in docker which can use all the newer tools there, including the latest compiler optimizations and mitigations, after all, you don't have to worry about compatibility.

Let me think about this a bit more - but I'm still a bit leaning to just copy manylinux binary to containers to make build process simpler

@cromefire
Copy link
Contributor Author

Since the user need to build and test it manually - I'll prefer just leave it in source code.

Yeah, not saying you have to base your official docker builds on it, but it'd be good to at least have the Dockerfile there for people to use to build specific builds or just for development. As said I basically never compiled it on bare metal as I initially had some build issues and in docker it just worked. (And you'd also have to install ROCm on your machine directly)

@wsxiaoys
Copy link
Member

That's a fair point - essentially it served as devcontainer.

@cromefire
Copy link
Contributor Author

cromefire commented Dec 13, 2023

Something like that, that's the nice thing with that, you don't need to install anything for it, you also don't even need to mess around with devcontainers, which are still a bit finicky on JetBrains IDEs, you just say docker build and it works repeatably.

@cromefire cromefire deleted the rocm-release branch December 22, 2023 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AMD GPUs (ROCm) support
2 participants