You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there! Thanks for this amazing library. I was able to run a 70B model on my M2 Macbook Pro!
I was able to get about one token every 100 seconds, which is almost good enough for my overnight tasks; I'm hoping i can get it down to 20 seconds per token though.
Is it possible to quantize the input model to make it faster?
I've tried quantizing with llama.cpp, but I think the output format is wrong for that. I see that pytorch has a way to quantize, but I can't figure out how to do it with AutoModel.
Any pointers in the right direction would help. Thanks!
The text was updated successfully, but these errors were encountered:
I just re-read the README again and learned about the compression option!
However, it doesn't quite work, I get this error:
Traceback (most recent call last):
File "/Users/verdagon/AirLLM/air_llm/main.py", line 12, in <module>
model = AutoModel.from_pretrained(
File "/Users/verdagon/AirLLM/air_llm/airllm/auto_model.py", line 49, in from_pretrained
return AirLLMLlamaMlx(pretrained_model_name_or_path, *inputs, ** kwargs)
File "/Users/verdagon/AirLLM/air_llm/airllm/airllm_llama_mlx.py", line 224, in __init__
self.model_local_path, self.checkpoint_path = find_or_create_local_splitted_path(model_local_path_or_repo_id,
File "/Users/verdagon/AirLLM/air_llm/airllm/utils.py", line 351, in find_or_create_local_splitted_path
return Path(model_local_path_or_repo_id), split_and_save_layers(model_local_path_or_repo_id, layer_shards_saving_path,
File "/Users/verdagon/AirLLM/air_llm/airllm/utils.py", line 303, in split_and_save_layers
layer_state_dict = compress_layer_state_dict(layer_state_dict, compression)
File "/Users/verdagon/AirLLM/air_llm/airllm/utils.py", line 169, in compress_layer_state_dict
v_quant, quant_state = bnb.functional.quantize_blockwise(v.cuda(), blocksize=2048)
File "/Users/verdagon/Library/Python/3.9/lib/python/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
I tried changing that v.cuda() to v.cpu() but it didn't help, instead I get an error down in bitsandbytes.
And reading the bitsandbytes docs, it says that bitsandbytes is a CUDA library, so I'm guessing this compression feature is only meant for CUDA computers. They are working on supporting Mac but not done yet. Unfortunate!
Hopefully there's a way to quantize the input instead.
Looking at the code more, it looks like AirLLM only supports pytorch and safetensors file formats. This might work if I can get something quantized into one of those.
Hi there! Thanks for this amazing library. I was able to run a 70B model on my M2 Macbook Pro!
I was able to get about one token every 100 seconds, which is almost good enough for my overnight tasks; I'm hoping i can get it down to 20 seconds per token though.
Is it possible to quantize the input model to make it faster?
I've tried quantizing with llama.cpp, but I think the output format is wrong for that. I see that pytorch has a way to quantize, but I can't figure out how to do it with AutoModel.
Any pointers in the right direction would help. Thanks!
The text was updated successfully, but these errors were encountered: