Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Built-in script to quantize with AutoGPTQ then push to Huggingface #491

Open
8 of 11 tasks
Glavin001 opened this issue Aug 27, 2023 · 6 comments · May be fixed by #545
Open
8 of 11 tasks

Built-in script to quantize with AutoGPTQ then push to Huggingface #491

Glavin001 opened this issue Aug 27, 2023 · 6 comments · May be fixed by #545
Labels
enhancement New feature or request

Comments

@Glavin001
Copy link
Contributor

Glavin001 commented Aug 27, 2023

⚠️ Please check that this feature request hasn't been suggested before.

  • I searched previous Ideas in Discussions didn't find any similar feature requests.
  • I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

The final model uses much less VRAM & faster inference speeds (e.g. especially ExLlama).

✔️ Solution

See https://github.com/PanQiWei/AutoGPTQ/tree/main#quantization-and-inference

  • Convert Axolotl config & dataset into quantization dataset
    • ⭐ This is why integrating with Axolotl is especially interesting, reusing the same prompt strategies/dataset handling code
  • Quantize
    • Logging progress to WandB
  • Save safetensors model
  • Push model
  • Push model README and other files

❓ Alternatives

No response

📝 Additional Context

No response

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this feature has not been requested yet.
  • I have provided enough information for the maintainers to understand and evaluate this request.
@Glavin001 Glavin001 added the enhancement New feature or request label Aug 27, 2023
@Glavin001
Copy link
Contributor Author

My WIP code I've been using personally, happy to share & integrate with Axolotl 😃 Although I appreciate any code review feedback, since I'm less familiar with Python and these libraries.

Adapted from https://github.com/PanQiWei/AutoGPTQ/blob/main/examples/quantization/quant_with_alpaca.py

With help from ChatGPT4, here's what I have so far and has been working successfully (Change me variables are untested):

# pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

import json
import random
import time

import torch
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, LlamaTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from axolotl.prompters import AlpacaPrompter
import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.DEBUG, datefmt="%Y-%m-%d %H:%M:%S"
)

print("Done importing...")

## CHANGE BELOW ##
pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"
dataset_name = "teknium/GPT4-LLM-Cleaned"
huggingface_username = "CHANGE_ME"
## CHANGE ABOVE

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad 
)

# TEMPLATE = "<|prompt|>{instruction}</s><|answer|>"
prompter = AlpacaPrompter()

# def load_data(data_path, tokenizer, n_samples, template=TEMPLATE):
def load_data(data_path, tokenizer, n_samples):
    # Load dataset
    dataset = load_dataset(data_path)
    
    if "train" in dataset:
        raw_data = dataset["train"]
    else:
        raw_data = dataset

    # Sample from the dataset if n_samples is provided and less than the dataset size
    if n_samples is not None and n_samples < len(raw_data):
        raw_data = raw_data.shuffle(seed=42).select(range(n_samples))

    def tokenize(examples):
        instructions = examples["instruction"]
        outputs = examples["output"]

        prompts = []
        texts = []
        input_ids = []
        attention_mask = []
        for input_text, output_text in zip(instructions, outputs):
            # prompt = template.format(instruction=input_text)
            # prompt = next(prompter.build_prompt(instruction=input_text, output=output_text))
            prompt = next(prompter.build_prompt(instruction=input_text))
            text = prompt + output_text

            if len(tokenizer(prompt)["input_ids"]) >= tokenizer.model_max_length:
                continue

            tokenized_data = tokenizer(text)

            input_ids.append(tokenized_data["input_ids"][: tokenizer.model_max_length])
            attention_mask.append(tokenized_data["attention_mask"][: tokenizer.model_max_length])
            prompts.append(prompt)
            texts.append(text)

        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "prompt": prompts,
            "text": texts,
        }

    raw_data = raw_data.map(
        tokenize,
        batched=True,
        batch_size=len(raw_data),
        num_proc=1,
        keep_in_memory=True,
        load_from_cache_file=False,
        # remove_columns=["instruction", "input"]
    )

    # Convert to PyTorch tensors
    raw_data.set_format(type='torch', columns=['input_ids', 'attention_mask'])

    # for sample in dataset:
    #     sample["input_ids"] = torch.LongTensor(sample["input_ids"])
    #     sample["attention_mask"] = torch.LongTensor(sample["attention_mask"])

    return raw_data


def get_tokenizer():
    print("Loading tokenizer...")
    # tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
    tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
    return tokenizer

def get_model():
    print("Loading model...")
    # load un-quantized model, by default, the model will always be loaded into CPU memory
    model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
    print("Model loaded.")
    return model

def get_quantized_model():
    print("Loading quantized model...")
    model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
    print("Model loaded.")
    return model

def quantize(model, examples_for_quant):
    print("Quantize...")
    start = time.time()
    # quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
    model.quantize(
        examples_for_quant,
        batch_size=1,
        # batch_size=args.quant_batch_size,
        # use_triton=args.use_triton,
        # autotune_warmup_after_quantized=args.use_triton
    )
    end = time.time()
    print(f"quantization took: {end - start: .4f}s")

    # save quantized model
    print("Saving quantized model...")
    # model.save_quantized(quantized_model_dir)
    model.save_quantized(quantized_model_dir, use_safetensors=True)
    print("Saved.")

    return model

def push_model(model):
    # push quantized model to Hugging Face Hub. 
    # to use use_auth_token=True, Login first via huggingface-cli login.
    # or pass explcit token with: use_auth_token="hf_xxxxxxx"
    # (uncomment the following three lines to enable this feature)
    # repo_id = f"YourUserName/{quantized_model_dir}"
    print("Pushing to Huggingface hub...")
    repo_id = f"{huggingface_username}/{quantized_model_dir}"
    commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
    # model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True, use_safetensors=True, safe_serialization=True)
    # model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True, safe_serialization=True)
    model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True, use_safetensors=True)

# def push_tokenizer(tokenizer):

def main():
    print("Starting...")
    # return
    # prompt = "<|prompt|>How can entrepreneurs start building their own communities even before launching their product?</s><|answer|>"

    should_quantize = True
    tokenizer = get_tokenizer()

    if should_quantize:
        print("Quantizing...")
        model = get_model()
        examples = load_data(dataset_name, tokenizer, 128)

        # print(examples)
        examples_for_quant = [
            {"input_ids": example["input_ids"], "attention_mask": example["attention_mask"]}
            for example in examples
        ]
        # print(examples_for_quant)

        modelq = quantize(model, examples_for_quant)
    else:
        print("Loading quantized model...")
        modelq = get_quantized_model()

    push_model(modelq)

main()

@NanoCode012
Copy link
Collaborator

As I understand, it’s possible to quantize also via GPTQ for llama. I personally use this latter method.

I’m not sure what’s the difference between them?

Perhaps, we’ll need TheBloke to clarify?

@Glavin001
Copy link
Contributor Author

I was unsuccessful quantizing https://github.com/qwopqwop200/GPTQ-for-LLaMa and when I posted in TheBloke Discord server and spoke to TheBloke and was recommended AutoGPTQ: https://discord.com/channels/1111983596572520458/1117037259603066980/1130502655500881970

image

@NanoCode012
Copy link
Collaborator

Interesting to know. Last I talked with him, he used gptq for llama. I guess times have changed.

@Glavin001
Copy link
Contributor Author

Glavin001 commented Sep 9, 2023

Update: Quantization working!
Quantize automatically with Axolotl with no custom scripts required.
✅ Rewrote AutoGPTQ's advanced quantization script to leverage Axolotl config & internal functions (loading tokenizer & models, merging models, loading datasets all from Axolotl config .yml file): https://github.com/PanQiWei/AutoGPTQ/blob/main/examples/quantization/quant_with_alpaca.py
🚧 Will add another callback to automatically merge and quantize upon completion, if enabled in config

image

@Glavin001 Glavin001 linked a pull request Sep 9, 2023 that will close this issue
4 tasks
@Glavin001
Copy link
Contributor Author

PR ready for review/testing/feedback: #545

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants