-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure in FSDP Benchmark Experiment using QLoRA with Custom Fused Modules #3
Comments
Maybe this can be addressed by distributing the adapters using DDP, just as it was done with the AutoGPTQ version |
Device Mapping ErrorTurns out this error is thrown because base_layer Also, adapters
Checking Device PreparationJust before the foak patching, the model itself has been casted to
However, removing the FOAK patching and seems to reverse the problem and FSDP-QLoRA with low memory mode trains perfectly fine My guess is since the FOAK patch happens before the trainer prepares the model, the patching is performed on weights still residing on I made a temporary workaround is to cast the attention module to Temp Workaround
Testing Command
|
@achew010 are you sure that this fixes the BNB case, because I realized i was getting the exact same error with the GPTQ case. The reason is because of #26 , where now in low_mem mode, we do not move the whole model directly to GPU, and we also ignore the adapters from FSDPing, so this is the reason why the adapters stayed on CPU. So I fixed it in #29 |
@achew010 Update: The root cause is not because of the lora weights staying on cpu, you can try the following:
We can see this after my fix in #29, the base layer weights are on the GPU.
I think the real issue is because BNB QLoRA does not work with FSDP low memory mode. And I think we need to fix it from the root cause, I feel the workaround is dangerous because in FSDP the parameters are being sharded and deshareded, so putting a |
Update@fabianlim you are right, QLoRA doesn't work with FSDP and low memory mode, the weights stay in For GPTQ, the fix in #29 resolved the casting of adapters in ignored modules to I was thinking to load the QLoRA weights in |
This has been addressed by #31. |
Problem
Distributed experiments in the benchmarks fail when using BNB's
nf4
QLoRA with unsloth fused module optimizations.Cause
Distributed experiments for BNB's
nf4
QLoRA doesnt throw any errors. Suspected incompatibility of FSDP, BNB kernels and Unsloth's matmul.Stacktrace from test repo:
Setting debug environment var
CUDA_LAUNCH_BLOCKING=1
produces thisError an illegal memory access was encountered at line 90 in file /src/csrc/ops.cu
. This is traced to thedequantizeBlockwise
CUDA function.Reproduce
The text was updated successfully, but these errors were encountered: