-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Support GPTQ quantization #39
Comments
I tried adding:
to the end of that line there (and similar additions in two other places that came up), and it hasn't crashed, but it's now taking a very long time to load the model, so maybe it's doing some unwanted conversion? |
Yeah, it finally 'loaded' but then it said some weights of the model checkpoint were not used when initializing LlamaForCausalLM, and it listed a giant list of weights, which I'm guessing was all of them. The the LoRA training crashed with:
So something definitely did not go well. |
@araleza Oh no I don't think GPTQ models are supported as of yet |
Currently only QLoRA via bitsandbytes is supported, hence all the error messages. If GPTQ is a super popular request, I will add it in - the dequantization steps will just be replaced, but I will have to read up on how GPTQ does it internally. For now, is it possible to use a non GPTQ quantized model? |
I don't know actually... I've only done LoRA training with oobabooga's Training tab, and it can only do LoRA training with unquantized models, or GPTQ models (which you have to load with the Transformers loader). So I don't know how to load a quantized model of any format except GPTQ onto my GPU. Any tips for which format to use instead, but still have it fit on my 24GB GPU? |
@araleza Would it be possible to try load a non quantized model, then pass |
I'll see for a future release if I can add GPTQ support! |
I was atually just reading up upon HQQ (half quadratic quantization) https://github.com/mobiusml/hqq and maybe I'll be adding HQQ instead of GPTQ since HQQ has no need for data calibration, whilst GPTQ does. |
Sounds good. I think you've got two groups of people who want to use your software:
Supporting HQQ would help the people in group 2, like me. |
@araleza Cool I'll get on with HQQ! It seems like even Mixtral can supposedly fit on a 24GB card! But HQQ supports 8, 4, 3 and 2 bit quantization so it'll be pretty useful! |
@danielhanchen happy to pitch in with quantization (or other feature requests). let me know how best to contribute! |
@jeromeku More than happy to collaborate! I was actually taking a look at GPTQ the other day - I guess technically Unsloth can add in GPTQ during training - we we need is to port the dequantization kernels from GPTQ to float16 / bfloat16, and if that works, then GPTQ will be supported. For all, I'm using bitsandbytes's dequantization kernels. Again more than happy to collaborate if you're interested! |
@danielhanchen I can take a crack at this if you're more keen working on |
@jeromeku I'll investigate GPTQ's dequant kernels as well! But if you're interested in adding GPTQ support - I'm more than happy for a few more OSS collaborators! Essentially in terms of the main gist of things:
If you wanna take a crack at that - I'll be super grateful! In fact just step 1 or 2 is enough for a general GPTQ integration! |
@danielhanchen |
@jeromeku Great! If you need any help - ask away! I guess we can use this Github issue as a central discussion area. I'll see if I have some time on GPTQ - probably next week ish - I'm trying to work on some other stuff currently. Again thanks! |
Trying to understand design decisions / coding style of the library. What is the purpose of patching Why the use of |
@jeromeku prepatch essentially just patches some portions of each function to call their relevant efficient implementation - ie as you mentioned some Oh ye sorry on my coding style - I came from like like C++ / C background so I generally like all functions / if / for loops etc to be "enclosed" to make it "look" compartmentalized. But you can have whatever coding style you like - for eg I like spaces between eqals during variance assignments, whilst general style is If you're contributing code - I don't mind on style - that's the least of worries! :)) You can use any style you desire - it just has to work :) |
Any tools / tests you use to check the correctness of gradient implementations? |
@jeromeku Oh lol what I do is to get HF to do training, copy paste the training losses to Google Sheets, then with ur updated gradient implementation, log if the new training loss is mostly identical. Another approach is to use |
Ok, was wondering if there was a more efficient way to do this verification. Was trying to use I've adapted GPTQ code to re-implement A minimal way to check the gradient is being calculated correctly -- akin to a unit test -- without having to do a training run would be a worthwhile effort both for existing and future implementations. |
@jeromeku Actually I did technically make some functions to check gradients somewhere - I manaully made some random inputs and some random outputs, then backpropagated with |
I wrote a small test script to do gradient checking: import torch
from datasets import load_dataset
# 4bit pre quantized models we support for 4x faster downloading!
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from torch.utils.data import DataLoader
from unsloth import FastLanguageModel
DTYPE = torch.float16
def get_model(
model_id="unsloth/mistral-7b-bnb-4bit",
reference=True,
max_seq_length=2048,
dtype=torch.float16,
load_in_4bit=True,
init_lora_weights=False,
upcast=True,
):
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_id,
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)
lora_config = LoraConfig(
r=16,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
],
lora_alpha=16,
lora_dropout=0,
bias="none",
task_type="CAUSAL_LM",
init_lora_weights=init_lora_weights,
)
if reference:
model = prepare_model_for_kbit_training(
model,
use_gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": True},
)
model = get_peft_model(model, lora_config)
else:
config = lora_config.to_dict()
del config["task_type"]
model = FastLanguageModel.get_peft_model(
model,
use_gradient_checkpointing=True,
random_state=3407,
max_seq_length=max_seq_length,
upcast=upcast,
**config,
)
return model, tokenizer
ref_model, _ = get_model(dtype=DTYPE)
test_model, _ = get_model(dtype=DTYPE, reference=False)
def check_grad(model, dtype, seed=0, scale=1):
wrapped_model = model.model.model
embed_layer = wrapped_model.embed_tokens
self_attn = wrapped_model.layers[0].self_attn
mlp = wrapped_model.layers[0].mlp
torch.manual_seed(seed)
with torch.autocast(device_type="cuda", dtype=dtype):
# embeddings = embed_layer(inputs)
embeddings = torch.randn(
1, 1, embed_layer.weight.shape[1], dtype=dtype, requires_grad=True
).cuda()
print(f"Attention input dtype: {embeddings.dtype}")
attn_out, *_ = self_attn(embeddings)
print(f"Attn out dtype: {attn_out.dtype}")
mlp_out = mlp(attn_out)
torch.manual_seed(seed)
fake_grad_output = scale * torch.randn(mlp_out.shape, dtype=torch.float32).to(
mlp_out.device
)
mlp_out.backward(fake_grad_output)
return mlp_out, mlp, attn_out, fake_grad_output
mlp_out_ref, mlp_ref, attn_out_ref, fake_grad_ref = check_grad(ref_model, dtype=DTYPE)
print(
"Grad check after reference backwards:",
test_model.model.model.layers[0].mlp.down_proj.lora_B.default.weight.grad,
)
mlp_out, mlp, attn_out, fake_grad = check_grad(test_model, dtype=DTYPE)
ref_type = torch.float32
print()
print(
f"Checking fake grad (dY): {torch.allclose(fake_grad.to(ref_type), fake_grad_ref.to(ref_type))}"
)
# torch.max(torch.abs(fake_grad.to(ref_type) - fake_grad_ref.to(ref_type)))
# torch.allclose(mlp_out.to(ref_type), mlp_out_ref.to(ref_type))
print(f"Checking mlp grads:")
for (n1, m1), (n2, m2) in zip(mlp.named_parameters(), mlp_ref.named_parameters()):
if "lora" in n1 and "lora" in n2:
n1 = ".".join(n1.split(".")[:2])
print(f"{n1}")
print(
f"Mean grad:\n UNSLOTH: {m1.grad.max():.10f}\n REF: {m2.grad.mean():.10f}\nMax abs diff: {torch.max(torch.abs(m1.grad - m2.grad)):.10f}\nMean abs diff: {torch.mean(torch.abs(m1.grad - m2.grad)):.10f}"
)
print()
print("Checking attn grads:")
for (n1, m1), (n2, m2) in zip(
ref_model.model.model.layers[0].self_attn.named_parameters(),
test_model.model.model.layers[0].self_attn.named_parameters(),
):
if "lora" in n1 and "lora" in n2:
# torch.allclose(m1.grad.to(dtype), m2.grad.to(dtype))
n1 = ".".join(n1.split(".")[:2])
print(f"{n1}")
print(
f"Mean grad:\n UNSLOTH: {m1.grad.max():.10f}\n REF: {m2.grad.max():.10f}\nMax abs diff: {torch.max(torch.abs(m1.grad - m2.grad)):.10f}\nMean abs diff: {torch.mean(torch.abs(m1.grad - m2.grad)):.10f}"
)
print() Note: there are small inconsistencies between I added an
Here is the output from running the above script:
Thoughts? |
@jeromeku Great work! Some pointers:
Yep one issue is the upcasting to You can see there are error differences - mainly due to Flash Attention - Pytorch does I think the reference model you used does not have FA enabled. But ye - great work again - super useful script :))) |
What do you consider permissible range of gradient discrepancies between the I.e., there are differences (e.g., |
@jeromeku Ye one of the issues I found as well when verifying Unsloth vs normal HF - thats what I for now opted to just compare training losses directly |
Just wanted to give a quick update:
|
@jeromeku Super great work! Are you testing it on a Tesla T4 or Ampere based GPU? I found older GPUs Triton kernels to be noticeably slower. Also I found through experimentation instead of writing 1 full fused kernel for matrix mult and dequantization, to split it into 2. The dequant step should only take 1-2ms, whilst the matrix mult takes 30ms or so. The compiler can be "confused" on the dequant steps, causing it to not optimize correctly, so I found using |
@danielhanchen
|
@jeromeku Oh ok cool! If I have to guess, it's that NVCC / the Trtion compiler is not optimizing "properly" - also did u use the matmul Triton autotuner? It could be that maybe? |
@danielhanchen |
@jeromeku Ohh ok ok interesting - I'm just guessing somewhere the compiler is not optimizing the dequantization parts properly |
Did some preliminary profiling using
All were 4-bit Summary results, sorted by
It seems the custom
Will draft PR the profiling script and documentation along with current |
@jeromeku LOVEE the detailed profiling!!! Just love it!! Great work again. Interesting so the Very interesting results! Did you manage to test a GPTQ just dequantize kernel, but with Unsloth? I can see in Unsloth, matrix multiplies are taking 26% of all time, whilst GPTQ is 13% Unsloth Triton is 3% (looks like overhead?) and HF + Triton is 1.5%. The goal is to move the majority of the time over to matrix multiplies in order to leverage the GPU's Tensor Cores :)) But anyways I love the table and results and fabulous work! |
Yes -- there seems to be some overhead issues with the unsloth Just opened a draft PR with the changes. |
So I have a GPTQ llama model I downloaded (from TheBloke), and it's already 4 bit quantized. I have to pass in False for the load_in_4bit parameter of:
because if I don't, I get an error thrown saying:
But, if I pass in False for load_in_4bit, this code makes bnb_config be None:
and that makes quantization_config be None as well:
and that crashes here:
with the error message:
So I'm not sure how to LoRA train this llama model. Any thoughts?
The text was updated successfully, but these errors were encountered: