Built-in script to quantize with AutoGPTQ then push to Huggingface #491

Glavin001 · 2023-08-27T05:53:58Z

Glavin001 · 2023-08-27T06:00:30Z

My WIP code I've been using personally, happy to share & integrate with Axolotl 😃 Although I appreciate any code review feedback, since I'm less familiar with Python and these libraries.

Adapted from https://github.com/PanQiWei/AutoGPTQ/blob/main/examples/quantization/quant_with_alpaca.py

With help from ChatGPT4, here's what I have so far and has been working successfully (Change me variables are untested):

# pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

import json
import random
import time

import torch
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, LlamaTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from axolotl.prompters import AlpacaPrompter
import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.DEBUG, datefmt="%Y-%m-%d %H:%M:%S"
)

print("Done importing...")

## CHANGE BELOW ##
pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"
dataset_name = "teknium/GPT4-LLM-Cleaned"
huggingface_username = "CHANGE_ME"
## CHANGE ABOVE

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad 
)

# TEMPLATE = "<|prompt|>{instruction}</s><|answer|>"
prompter = AlpacaPrompter()

# def load_data(data_path, tokenizer, n_samples, template=TEMPLATE):
def load_data(data_path, tokenizer, n_samples):
    # Load dataset
    dataset = load_dataset(data_path)
    
    if "train" in dataset:
        raw_data = dataset["train"]
    else:
        raw_data = dataset

    # Sample from the dataset if n_samples is provided and less than the dataset size
    if n_samples is not None and n_samples < len(raw_data):
        raw_data = raw_data.shuffle(seed=42).select(range(n_samples))

    def tokenize(examples):
        instructions = examples["instruction"]
        outputs = examples["output"]

        prompts = []
        texts = []
        input_ids = []
        attention_mask = []
        for input_text, output_text in zip(instructions, outputs):
            # prompt = template.format(instruction=input_text)
            # prompt = next(prompter.build_prompt(instruction=input_text, output=output_text))
            prompt = next(prompter.build_prompt(instruction=input_text))
            text = prompt + output_text

            if len(tokenizer(prompt)["input_ids"]) >= tokenizer.model_max_length:
                continue

            tokenized_data = tokenizer(text)

            input_ids.append(tokenized_data["input_ids"][: tokenizer.model_max_length])
            attention_mask.append(tokenized_data["attention_mask"][: tokenizer.model_max_length])
            prompts.append(prompt)
            texts.append(text)

        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "prompt": prompts,
            "text": texts,
        }

    raw_data = raw_data.map(
        tokenize,
        batched=True,
        batch_size=len(raw_data),
        num_proc=1,
        keep_in_memory=True,
        load_from_cache_file=False,
        # remove_columns=["instruction", "input"]
    )

    # Convert to PyTorch tensors
    raw_data.set_format(type='torch', columns=['input_ids', 'attention_mask'])

    # for sample in dataset:
    #     sample["input_ids"] = torch.LongTensor(sample["input_ids"])
    #     sample["attention_mask"] = torch.LongTensor(sample["attention_mask"])

    return raw_data


def get_tokenizer():
    print("Loading tokenizer...")
    # tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
    tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
    return tokenizer

def get_model():
    print("Loading model...")
    # load un-quantized model, by default, the model will always be loaded into CPU memory
    model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
    print("Model loaded.")
    return model

def get_quantized_model():
    print("Loading quantized model...")
    model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
    print("Model loaded.")
    return model

def quantize(model, examples_for_quant):
    print("Quantize...")
    start = time.time()
    # quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
    model.quantize(
        examples_for_quant,
        batch_size=1,
        # batch_size=args.quant_batch_size,
        # use_triton=args.use_triton,
        # autotune_warmup_after_quantized=args.use_triton
    )
    end = time.time()
    print(f"quantization took: {end - start: .4f}s")

    # save quantized model
    print("Saving quantized model...")
    # model.save_quantized(quantized_model_dir)
    model.save_quantized(quantized_model_dir, use_safetensors=True)
    print("Saved.")

    return model

def push_model(model):
    # push quantized model to Hugging Face Hub. 
    # to use use_auth_token=True, Login first via huggingface-cli login.
    # or pass explcit token with: use_auth_token="hf_xxxxxxx"
    # (uncomment the following three lines to enable this feature)
    # repo_id = f"YourUserName/{quantized_model_dir}"
    print("Pushing to Huggingface hub...")
    repo_id = f"{huggingface_username}/{quantized_model_dir}"
    commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
    # model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True, use_safetensors=True, safe_serialization=True)
    # model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True, safe_serialization=True)
    model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True, use_safetensors=True)

# def push_tokenizer(tokenizer):

def main():
    print("Starting...")
    # return
    # prompt = "<|prompt|>How can entrepreneurs start building their own communities even before launching their product?</s><|answer|>"

    should_quantize = True
    tokenizer = get_tokenizer()

    if should_quantize:
        print("Quantizing...")
        model = get_model()
        examples = load_data(dataset_name, tokenizer, 128)

        # print(examples)
        examples_for_quant = [
            {"input_ids": example["input_ids"], "attention_mask": example["attention_mask"]}
            for example in examples
        ]
        # print(examples_for_quant)

        modelq = quantize(model, examples_for_quant)
    else:
        print("Loading quantized model...")
        modelq = get_quantized_model()

    push_model(modelq)

main()

NanoCode012 · 2023-08-27T08:58:39Z

As I understand, it’s possible to quantize also via GPTQ for llama. I personally use this latter method.

I’m not sure what’s the difference between them?

Perhaps, we’ll need TheBloke to clarify?

Glavin001 · 2023-08-31T01:44:52Z

I was unsuccessful quantizing https://github.com/qwopqwop200/GPTQ-for-LLaMa and when I posted in TheBloke Discord server and spoke to TheBloke and was recommended AutoGPTQ: https://discord.com/channels/1111983596572520458/1117037259603066980/1130502655500881970

NanoCode012 · 2023-08-31T03:04:20Z

Interesting to know. Last I talked with him, he used gptq for llama. I guess times have changed.

Glavin001 · 2023-09-09T21:13:24Z

Update: Quantization working!
Quantize automatically with Axolotl with no custom scripts required.
✅ Rewrote AutoGPTQ's advanced quantization script to leverage Axolotl config & internal functions (loading tokenizer & models, merging models, loading datasets all from Axolotl config .yml file): https://github.com/PanQiWei/AutoGPTQ/blob/main/examples/quantization/quant_with_alpaca.py
🚧 Will add another callback to automatically merge and quantize upon completion, if enabled in config

Glavin001 · 2023-09-11T07:59:55Z

PR ready for review/testing/feedback: #545

Glavin001 added the enhancement New feature or request label Aug 27, 2023

Glavin001 linked a pull request Sep 9, 2023 that will close this issue

Add AutoGPTQ quantization script #545

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Built-in script to quantize with AutoGPTQ then push to Huggingface #491

Built-in script to quantize with AutoGPTQ then push to Huggingface #491

Glavin001 commented Aug 27, 2023 •

edited

Loading

Glavin001 commented Aug 27, 2023

NanoCode012 commented Aug 27, 2023

Glavin001 commented Aug 31, 2023

NanoCode012 commented Aug 31, 2023

Glavin001 commented Sep 9, 2023 •

edited

Loading

Glavin001 commented Sep 11, 2023

Built-in script to quantize with AutoGPTQ then push to Huggingface #491

Built-in script to quantize with AutoGPTQ then push to Huggingface #491

Comments

Glavin001 commented Aug 27, 2023 • edited Loading

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

✔️ Solution

❓ Alternatives

📝 Additional Context

Acknowledgements

Glavin001 commented Aug 27, 2023

NanoCode012 commented Aug 27, 2023

Glavin001 commented Aug 31, 2023

NanoCode012 commented Aug 31, 2023

Glavin001 commented Sep 9, 2023 • edited Loading

Glavin001 commented Sep 11, 2023

Glavin001 commented Aug 27, 2023 •

edited

Loading

Glavin001 commented Sep 9, 2023 •

edited

Loading