Is it possible to use AirLLM with a quantized input model? #117

Verdagon · 2024-03-10T23:14:05Z

Hi there! Thanks for this amazing library. I was able to run a 70B model on my M2 Macbook Pro!

I was able to get about one token every 100 seconds, which is almost good enough for my overnight tasks; I'm hoping i can get it down to 20 seconds per token though.

Is it possible to quantize the input model to make it faster?

I've tried quantizing with llama.cpp, but I think the output format is wrong for that. I see that pytorch has a way to quantize, but I can't figure out how to do it with AutoModel.

Any pointers in the right direction would help. Thanks!

Verdagon · 2024-03-10T23:26:48Z

I just re-read the README again and learned about the compression option!

However, it doesn't quite work, I get this error:

Traceback (most recent call last):
  File "/Users/verdagon/AirLLM/air_llm/main.py", line 12, in <module>
    model = AutoModel.from_pretrained(
  File "/Users/verdagon/AirLLM/air_llm/airllm/auto_model.py", line 49, in from_pretrained
    return AirLLMLlamaMlx(pretrained_model_name_or_path, *inputs, ** kwargs)
  File "/Users/verdagon/AirLLM/air_llm/airllm/airllm_llama_mlx.py", line 224, in __init__
    self.model_local_path, self.checkpoint_path = find_or_create_local_splitted_path(model_local_path_or_repo_id,
  File "/Users/verdagon/AirLLM/air_llm/airllm/utils.py", line 351, in find_or_create_local_splitted_path
    return Path(model_local_path_or_repo_id), split_and_save_layers(model_local_path_or_repo_id, layer_shards_saving_path,
  File "/Users/verdagon/AirLLM/air_llm/airllm/utils.py", line 303, in split_and_save_layers
    layer_state_dict = compress_layer_state_dict(layer_state_dict, compression)
  File "/Users/verdagon/AirLLM/air_llm/airllm/utils.py", line 169, in compress_layer_state_dict
    v_quant, quant_state = bnb.functional.quantize_blockwise(v.cuda(), blocksize=2048)
  File "/Users/verdagon/Library/Python/3.9/lib/python/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

I tried changing that v.cuda() to v.cpu() but it didn't help, instead I get an error down in bitsandbytes.

And reading the bitsandbytes docs, it says that bitsandbytes is a CUDA library, so I'm guessing this compression feature is only meant for CUDA computers. They are working on supporting Mac but not done yet. Unfortunate!

Hopefully there's a way to quantize the input instead.

Verdagon · 2024-04-15T19:26:02Z

Looking at the code more, it looks like AirLLM only supports pytorch and safetensors file formats. This might work if I can get something quantized into one of those.

lyogavin · 2024-04-21T14:03:06Z

will add.

lyogavin self-assigned this Apr 21, 2024

lyogavin added the enhancement New feature or request label Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to use AirLLM with a quantized input model? #117

Is it possible to use AirLLM with a quantized input model? #117

Verdagon commented Mar 10, 2024

Verdagon commented Mar 10, 2024 •

edited

Loading

Verdagon commented Apr 15, 2024

lyogavin commented Apr 21, 2024

Is it possible to use AirLLM with a quantized input model? #117

Is it possible to use AirLLM with a quantized input model? #117

Comments

Verdagon commented Mar 10, 2024

Verdagon commented Mar 10, 2024 • edited Loading

Verdagon commented Apr 15, 2024

lyogavin commented Apr 21, 2024

Verdagon commented Mar 10, 2024 •

edited

Loading