-
-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Built-in script to quantize with AutoGPTQ then push to Huggingface #491
Comments
My WIP code I've been using personally, happy to share & integrate with Axolotl 😃 Although I appreciate any code review feedback, since I'm less familiar with Python and these libraries. Adapted from https://github.com/PanQiWei/AutoGPTQ/blob/main/examples/quantization/quant_with_alpaca.py With help from ChatGPT4, here's what I have so far and has been working successfully (Change me variables are untested): # pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
import json
import random
import time
import torch
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, LlamaTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from axolotl.prompters import AlpacaPrompter
import logging
logging.basicConfig(
format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.DEBUG, datefmt="%Y-%m-%d %H:%M:%S"
)
print("Done importing...")
## CHANGE BELOW ##
pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"
dataset_name = "teknium/GPT4-LLM-Cleaned"
huggingface_username = "CHANGE_ME"
## CHANGE ABOVE
quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
)
# TEMPLATE = "<|prompt|>{instruction}</s><|answer|>"
prompter = AlpacaPrompter()
# def load_data(data_path, tokenizer, n_samples, template=TEMPLATE):
def load_data(data_path, tokenizer, n_samples):
# Load dataset
dataset = load_dataset(data_path)
if "train" in dataset:
raw_data = dataset["train"]
else:
raw_data = dataset
# Sample from the dataset if n_samples is provided and less than the dataset size
if n_samples is not None and n_samples < len(raw_data):
raw_data = raw_data.shuffle(seed=42).select(range(n_samples))
def tokenize(examples):
instructions = examples["instruction"]
outputs = examples["output"]
prompts = []
texts = []
input_ids = []
attention_mask = []
for input_text, output_text in zip(instructions, outputs):
# prompt = template.format(instruction=input_text)
# prompt = next(prompter.build_prompt(instruction=input_text, output=output_text))
prompt = next(prompter.build_prompt(instruction=input_text))
text = prompt + output_text
if len(tokenizer(prompt)["input_ids"]) >= tokenizer.model_max_length:
continue
tokenized_data = tokenizer(text)
input_ids.append(tokenized_data["input_ids"][: tokenizer.model_max_length])
attention_mask.append(tokenized_data["attention_mask"][: tokenizer.model_max_length])
prompts.append(prompt)
texts.append(text)
return {
"input_ids": input_ids,
"attention_mask": attention_mask,
"prompt": prompts,
"text": texts,
}
raw_data = raw_data.map(
tokenize,
batched=True,
batch_size=len(raw_data),
num_proc=1,
keep_in_memory=True,
load_from_cache_file=False,
# remove_columns=["instruction", "input"]
)
# Convert to PyTorch tensors
raw_data.set_format(type='torch', columns=['input_ids', 'attention_mask'])
# for sample in dataset:
# sample["input_ids"] = torch.LongTensor(sample["input_ids"])
# sample["attention_mask"] = torch.LongTensor(sample["attention_mask"])
return raw_data
def get_tokenizer():
print("Loading tokenizer...")
# tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
return tokenizer
def get_model():
print("Loading model...")
# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
print("Model loaded.")
return model
def get_quantized_model():
print("Loading quantized model...")
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
print("Model loaded.")
return model
def quantize(model, examples_for_quant):
print("Quantize...")
start = time.time()
# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(
examples_for_quant,
batch_size=1,
# batch_size=args.quant_batch_size,
# use_triton=args.use_triton,
# autotune_warmup_after_quantized=args.use_triton
)
end = time.time()
print(f"quantization took: {end - start: .4f}s")
# save quantized model
print("Saving quantized model...")
# model.save_quantized(quantized_model_dir)
model.save_quantized(quantized_model_dir, use_safetensors=True)
print("Saved.")
return model
def push_model(model):
# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
print("Pushing to Huggingface hub...")
repo_id = f"{huggingface_username}/{quantized_model_dir}"
commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True, use_safetensors=True, safe_serialization=True)
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True, safe_serialization=True)
model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True, use_safetensors=True)
# def push_tokenizer(tokenizer):
def main():
print("Starting...")
# return
# prompt = "<|prompt|>How can entrepreneurs start building their own communities even before launching their product?</s><|answer|>"
should_quantize = True
tokenizer = get_tokenizer()
if should_quantize:
print("Quantizing...")
model = get_model()
examples = load_data(dataset_name, tokenizer, 128)
# print(examples)
examples_for_quant = [
{"input_ids": example["input_ids"], "attention_mask": example["attention_mask"]}
for example in examples
]
# print(examples_for_quant)
modelq = quantize(model, examples_for_quant)
else:
print("Loading quantized model...")
modelq = get_quantized_model()
push_model(modelq)
main() |
As I understand, it’s possible to quantize also via GPTQ for llama. I personally use this latter method. I’m not sure what’s the difference between them? Perhaps, we’ll need TheBloke to clarify? |
I was unsuccessful quantizing https://github.com/qwopqwop200/GPTQ-for-LLaMa and when I posted in TheBloke Discord server and spoke to TheBloke and was recommended AutoGPTQ: https://discord.com/channels/1111983596572520458/1117037259603066980/1130502655500881970 |
Interesting to know. Last I talked with him, he used gptq for llama. I guess times have changed. |
Update: Quantization working! |
PR ready for review/testing/feedback: #545 |
🔖 Feature description
The final model uses much less VRAM & faster inference speeds (e.g. especially ExLlama).
✔️ Solution
See https://github.com/PanQiWei/AutoGPTQ/tree/main#quantization-and-inference
❓ Alternatives
No response
📝 Additional Context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: