Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AutoGPTQ quantization script #545

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

Glavin001
Copy link
Contributor

@Glavin001 Glavin001 commented Sep 9, 2023

Let's wait until #521 is merged.


Closes #491

Quantize automatically with Axolotl with no custom scripts required.

Demo

image

How to try yourself

Create a quantized model with Axolotl in 3 steps:

1️⃣ Train

accelerate launch ./scripts/finetune.py ./examples/llama-2/lora.yml

2️⃣ Merge

accelerate launch ./scripts/finetune.py ./examples/llama-2/lora.yml --merge_lora

3️⃣ 🆕 Quantize

accelerate launch ./scripts/finetune.py ./examples/llama-2/lora.yml --quantize

Progress:
Look for logging lines such as:

# [2023-09-11 07:20:37,502] [INFO] [auto_gptq.modeling._base.quantize:364] [PID:3962] [RANK:0] Quantizing self_attn.k_proj in layer 4/32...

This shows 4/32 layers quantized.

Task list

  • Rewrote AutoGPTQ's advanced quantization script to leverage Axolotl config & internal functions (loading tokenizer & models, merging models, loading datasets all from Axolotl config .yml file): https://github.com/PanQiWei/AutoGPTQ/blob/main/examples/quantization/quant_with_alpaca.py
  • Will add another callback to automatically merge and quantize upon completion, if enabled in config
    • I couldn't figure out how to release the existing model from GPU memory, so couldn't run merge model directly after / in the same process, so needed to make them separate steps.
    • Add --quantize CLI option
  • Get others to test and provide initial feedback
  • Clean up PR / old code / etc

# import debugpy
# debugpy.listen(('0.0.0.0', 5678))
# debugpy.wait_for_client()
# debugpy.breakpoint()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Glavin001 clean up old code

prompter = AlpacaPrompter()

# def load_data(data_path, tokenizer, n_samples, template=TEMPLATE):
def load_data(data_path, tokenizer, n_samples):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Glavin001 Delete this. Have a new method using Axolotl built-in functions

)

# TEMPLATE = "<|prompt|>{instruction}</s><|answer|>"
prompter = AlpacaPrompter()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Glavin001 delete. Using Axolotl config and built-in functions now

# huggingface_username = "CHANGE_ME"
## CHANGE ABOVE

quantize_config = BaseQuantizeConfig(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Glavin001 : Add to Axolotl config?

cc @winglian @tmm1 @NanoCode012 : Would you recommend leaving this as default or adding to Axolotl config file as options?

print("Done importing...")

## CHANGE BELOW ##
config_path: Path = Path("./examples/llama-2/lora.yml")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Glavin001 : Replace hard-coded path with the Axolotl callback current config

configure_logging()
LOG = logging.getLogger("axolotl")

# logging.basicConfig(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help Wanted

I couldn't get any logging to work from AutoGPTQ. Would be nice to fix logging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my old code which works when not calling Axolotl's configure_logging()

print("Merged model not found. Merging...")
# model, tokenizer = load_model(cfg, inference=True)
# do_merge_lora_model_and_tokenizer(cfg=cfg, model=model, tokenizer=tokenizer)
raise NotImplementedError("Merging model is not implemented yet.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Glavin001 TODO implement this. So quantization has merged model to work with

# accelerate launch ./scripts/finetune.py ./examples/llama-2/lora.yml --merge_lora --lora_model_dir="./lora-out" --load_in_8bit=False --load_in_4bit=False
# CUDA_VISIBLE_DEVICES="1" accelerate launch ./scripts/finetune.py ./examples/llama-2/lora.yml --merge_lora --lora_model_dir="./lora-out" --load_in_8bit=False --load_in_4bit=False

# HUB_MODEL_ID="Glavin001/llama-2-7b-alpaca_2k_test" accelerate launch ./scripts/quantize.py
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Glavin001 delete test notes

cfg.wandb_project = os.environ.get("WANDB_PROJECT")

if os.environ.get("HUB_MODEL_ID") and len(os.environ.get("HUB_MODEL_ID", "")) > 0:
cfg.hub_model_id = os.environ.get("HUB_MODEL_ID")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: this is used for upcoming work of starting scripts/finetune.py and having it run without any custom / run specific / user specific info in the Axolotl config.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better off in the setup_wandb_env_vars function

@Glavin001 Glavin001 marked this pull request as ready for review September 11, 2023 07:59
@Glavin001 Glavin001 changed the title WIP Add AutoGPTQ quantization script Add AutoGPTQ quantization script Sep 11, 2023
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
# tokenizer = None
should_quantize = True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: @Glavin001 should make this based off the config


log_gpu_memory()

do_merge_lora(cfg=parsed_cfg, cli_args=parsed_cli_args)
Copy link
Contributor Author

@Glavin001 Glavin001 Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help Wanted

I kept getting:

Expected a cuda device, but got: cpu

when calling do_merge_lora

Running nvidia-smi always showed lots of GPU memory still taken up / unreleased.

@Glavin001 Glavin001 marked this pull request as draft September 12, 2023 07:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Built-in script to quantize with AutoGPTQ then push to Huggingface
2 participants