-
-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add experimental install guide for ROCm #1550
base: main
Are you sure you want to change the base?
Conversation
thanks for this @xzuyn would it be helpful if I handled this in the docker images for you? Do you use the docker images? @ehartford does this line up with the AMD work that you've been doing? |
I'm happy to test it |
The only time I use docker is with runpod, but then I'm using an NVIDIA GPU. Even though the setup is a little janky, the ROCm setup is fairly straightforward for me to do in a venv. |
`main` branch is `0.41.3.post1`, so using `rocm` branch brings us to `0.42.0`
I updated the readme to use the Although arlo-phoenix now recommends using the official ROCm
So for now this is the latest I can confirm works for me. |
I go the other way around |
hi @xzuyn Thanks for the effort, I have been trying this on rocm and no luck yet. can we connect internally AMD? is it ok to put contact information on your profile? my email is: [email protected] |
Without this you get `NameError: name 'amdsmi' is not defined`
looks like it should work out of the box according to our docs https://github.com/axolotl-ai-cloud/axolotl/blob/main/docs/amd_hpc.qmd |
I have gotten axolotl working with DeepSpeed on ROCm (MI250X). Can assist with the install guide (the guide in production currently doesn't mention the bitsandbytes requirement and I believe flash-attn can now be directly pip installed). ROCm 6.1+ is essentially mandatory for most packages. https://huggingface.co/docs/bitsandbytes/main/en/installation?platform=ROCm#installation
|
Description
This adds a guide on how to install Axolotl for ROCm users.
Currently you need to install the packages included in
pip install -e '.[deepspeed]'
then uninstalltorch
,xformers
, andbitsandbytes
, so you can then install the ROCm versions oftorch
andbitsandbytes
. The process is a definitely janky, since you install stuff you don't want just to uninstall it afterwards.Installing the ROCm version of
torch
first to try to skip a step results in Axolotl failing to install, so the order this is in is necessary without changes to thereadme.txt
orsetup.py
.Improvements could be made to this setup by preventing
torch
,bitsandbytes
, andxformers
from being installed by modifyingsetup.py
to include[amd]
and[nvidia]
options. That way we would skip the install-then-uninstall step done before installing the packages required.Motivation and Context
I still see people on places like Reddit asking if it's possible yet to train AI stuff using AMD hardware. I want more people to know it's possible, although still experimental.
How has this been tested?
Using my personal system; Ubuntu 22.04.4, using ROCm 6.1 with an RX 7900 XTX. I've been using Axolotl (and other PyTorch based AI tools like kohya_ss) this way for months on various version of PyTorch and ROCm without major issues.
The only time I've had issues was when ROCm 6.0.2 released and caused training to only output 0 loss after I upgraded. This might've just been an issue with how I upgraded.
Screenshots (if appropriate)
Types of changes
This only adds additions to the
README.md
file.Social Handles (Optional)