-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial kernel changes to support GaLore #1137
base: main
Are you sure you want to change the base?
Initial kernel changes to support GaLore #1137
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
bitsandbytes/optim/adamw.py
Outdated
self.prefetch_state(p) | ||
|
||
if "rank" in group: | ||
self.update_step(group, p, gindex, pindex, return_updates=lor_update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main addition in this PR is the new return_updates
kwarg. This will give us the update from AdamW in lor_update
and p.data
will not be changed.
Corresponds to this step in Algorithm 1 from the paper:
lor_update = update(lor_grad)
self.update_step(group, p, gindex, pindex, return_updates=lor_update) | ||
|
||
# GaLore Projection Back | ||
p.data.add_(state["projector"].project_back(lor_update)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From Algorithm 1 in the paper:
update = project_back(lor_update)
weight.data += update
@matthewdouglas Tim said he could review your work this weekend. |
Updated with changes added for 1-state optimizers (Momentum, RMSProp, Adagrad, Lion). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a solid straightforward implementation. Good work, @matthewdouglas!
I wanted to overhaul the optimizers since changing or adding implementations is a pain when everything is separated into 1state and 2state optimizers. You probably encountered this too, Matthew. However, probably it is okay to keep it like this for the time being and refactor if we add another change.
Separating the update computation and the actual update in general could also have benefits to implement some new optimizers more easily. But I think we can leave that to future work and favor getting Galore out quickly together with the QLoRA fix.
The last remaining thing would be testing. The original tests are all green, but what would be good is a galore test. The best would be to use the original repo code and test it against the bitsandbytes implementation.
Steps for that would be to add galore-torch
to the dev dependencies and only execute the tests when the dependencies are met to prevent other devs from needing to run this if they do not have galore-torch installed.
Otherwise, the tests can mirror the other tests that already exist and check if the gradients are close to each other. For that you can probably just add the original galore and your galore to the dictionary of optimizers in test_optim.py
and see if the errors are approximately similar compared to other optimizer comparison.
Let me know if you have any other concerns with this, but I think with a test this is all ready to go.
9b72679
to
7800734
Compare
Hi @matthewdouglas, thanks for your great effort! I would like to follow up this PR and finalize our integration as soon as possible. To test galore benchmark easily, I created a new branch: https://github.com/jiaweizzhao/GaLore/tree/bitsandbytes, where I integrated your GaLore implementation into the most recent GaLoreAdamW8bit: https://github.com/jiaweizzhao/GaLore/blob/bitsandbytes/galore_torch/adamw8bit.py For environments, I installed both your bitsandbytes and modified GaLore locally. However, I tried to test a baseline but the following error comes up: 175 [rank0]: File "/data/home/jwzhao/.conda/envs/galore_new/lib/python3.8/site-packages/torch/optim/optimizer.py", line 484, in wrapper Do you have any ideas why it occurred? Seems it only happen in this old PR. I also tried latest bitsandbytes with regular GaLore and it works. |
Thanks @jiaweizzhao! This indicates there was a problem loading the CUDA library. Were you able to build this part on this branch?
If you try I plan to follow up shortly and rebase with |
Not sure why I can not correctly load CUDA library using your scripts @matthewdouglas. Maybe it is due to the machine I am using. Could you actually try to run a simple galore benchmark on your end? I have packed everything in this branch: https://github.com/jiaweizzhao/GaLore/tree/bitsandbytes. Once installed, you can simply run |
Seems the problem is I have no root access to compile from source (with pip install -e .) on my machine. Another way is if you could give me a complied galore version bitsandbytes package I can also try from my end @matthewdouglas |
@jiaweizzhao Which tool is throwing an error? With Also, if you just want to install from
|
61189fc
to
91ea416
Compare
91ea416
to
b1fb85b
Compare
This is a draft containing some of the initial changes to support GaLore. So far this covers 2-state optimizers.
Optimizer2State.update_step()
now contains an additional argumentreturn_updates
. When provided a tensor to hold the updates, they're returned here andp
is not changed. Additionally, no weight decay is applied.Needs tests, feedback welcome.
cc: @TimDettmers @jiaweizzhao @Titus-von-Koeller