it is run #192

werruww · 2024-10-24T20:25:50Z

!pip install -U airllm

!pip install -U bitsandbytes

!pip install git+https://github.com/huggingface/transformers.git

!pip install git+https://github.com/huggingface/accelerate.git

!pip install tiktoken

!pip install transformers_stream_generator

from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

#model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)
from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

#model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

model = AutoModel.from_pretrained("Qwen/Qwen-7B",
compression='4bit',
delete_original=True # specify '8bit' for 8-bit block-wise quantization
)

or use model's local path...

#model = AirLLMLlama2("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [
#'What is the capital of China?',
'Who is Napoleon Bonaparte؟',
]

input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
#padding=True
)

generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=12,
use_cache=True,
return_dict_in_generate=True)

model.tokenizer.decode(generation_output.sequences[0])

Fetching 20 files: 100%
20/20 [00:00<00:00, 1217.93it/s]
found_layers:{'transformer.wte.': True, 'transformer.h.0.': True, 'transformer.h.1.': True, 'transformer.h.2.': True, 'transformer.h.3.': True, 'transformer.h.4.': True, 'transformer.h.5.': True, 'transformer.h.6.': True, 'transformer.h.7.': True, 'transformer.h.8.': True, 'transformer.h.9.': True, 'transformer.h.10.': True, 'transformer.h.11.': True, 'transformer.h.12.': True, 'transformer.h.13.': True, 'transformer.h.14.': True, 'transformer.h.15.': True, 'transformer.h.16.': True, 'transformer.h.17.': True, 'transformer.h.18.': True, 'transformer.h.19.': True, 'transformer.h.20.': True, 'transformer.h.21.': True, 'transformer.h.22.': True, 'transformer.h.23.': True, 'transformer.h.24.': True, 'transformer.h.25.': True, 'transformer.h.26.': True, 'transformer.h.27.': True, 'transformer.h.28.': True, 'transformer.h.29.': True, 'transformer.h.30.': True, 'transformer.h.31.': True, 'transformer.ln_f.': True, 'lm_head.': True}
saved layers already found in /root/.cache/huggingface/hub/models--Qwen--Qwen-7B/snapshots/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/splitted_model.4bit
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
not support prefetching for compression for now. loading with no prepetching mode.
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:13<00:00, 2.68it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:13<00:00, 2.69it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.83it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.80it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.86it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.84it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.83it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.84it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.82it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.84it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.86it/s]
Who is Napoleon Bonaparte؟" The answer is:\nA:\n\nNapoleon Bon

werruww · 2024-10-24T20:27:02Z

max_new_tokens=xxxxxxxxxxxxxxxxxxxxxx,

The higher the number, the more load on the graphics unit and the longer the time.

generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=12,
use_cache=True,
return_dict_in_generate=True)

model.tokenizer.decode(generation_output.sequences[0])

werruww · 2024-10-24T20:27:18Z

it run on colab t4

werruww · 2024-10-24T21:07:54Z

from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

#model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)
from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

#model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

model = AutoModel.from_pretrained("Qwen/Qwen-7B",
compression='4bit',
delete_original=True # specify '8bit' for 8-bit block-wise quantization
)

or use model's local path...

#model = AirLLMLlama2("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [
#'What is the capital of China?',
'Who invented the electric light bulb?',
]

input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
#padding=True
)

generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=5,
use_cache=True,
no_repeat_ngram_size=3, # يمنع تكرار ثلاث كلمات متتابعة
repetition_penalty=1.2, # عقوبة للتكرار لتجنب تكرار الكلمات
return_dict_in_generate=True)

model.tokenizer.decode(generation_output.sequences[0])

bitsandbytes installed
cache_utils installed
Fetching 20 files: 100%
20/20 [00:00<00:00, 1025.63it/s]
found_layers:{'transformer.wte.': True, 'transformer.h.0.': True, 'transformer.h.1.': True, 'transformer.h.2.': True, 'transformer.h.3.': True, 'transformer.h.4.': True, 'transformer.h.5.': True, 'transformer.h.6.': True, 'transformer.h.7.': True, 'transformer.h.8.': True, 'transformer.h.9.': True, 'transformer.h.10.': True, 'transformer.h.11.': True, 'transformer.h.12.': True, 'transformer.h.13.': True, 'transformer.h.14.': True, 'transformer.h.15.': True, 'transformer.h.16.': True, 'transformer.h.17.': True, 'transformer.h.18.': True, 'transformer.h.19.': True, 'transformer.h.20.': True, 'transformer.h.21.': True, 'transformer.h.22.': True, 'transformer.h.23.': True, 'transformer.h.24.': True, 'transformer.h.25.': True, 'transformer.h.26.': True, 'transformer.h.27.': True, 'transformer.h.28.': True, 'transformer.h.29.': True, 'transformer.h.30.': True, 'transformer.h.31.': True, 'transformer.ln_f.': True, 'lm_head.': True}
saved layers already found in /root/.cache/huggingface/hub/models--Qwen--Qwen-7B/snapshots/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/splitted_model.4bit
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
not support prefetching for compression for now. loading with no prepetching mode.
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.70it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.78it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.88it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.77it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s]
Who invented the electric light bulb? A. Thomas Edison

How do I prevent the question "Who invented the electric light bulb? A. Thomas Edison" from being repeated and the answer being "A. Thomas Edison" only?

werruww · 2024-10-24T21:08:06Z

How do I prevent the question "Who invented the electric light bulb? A. Thomas Edison" from being repeated and the answer being "A. Thomas Edison" only?

werruww · 2024-10-24T21:10:24Z

How do I prevent the question from being repeated?

werruww · 2024-10-24T21:22:31Z

echo=True
Is there an instance of echo=True in airllm

werruww · 2024-10-24T21:23:23Z

Echo the prompt back in the output

werruww · 2024-10-24T21:29:39Z

Fetching 20 files: 100%
20/20 [00:00<00:00, 1258.81it/s]
found_layers:{'transformer.wte.': True, 'transformer.h.0.': True, 'transformer.h.1.': True, 'transformer.h.2.': True, 'transformer.h.3.': True, 'transformer.h.4.': True, 'transformer.h.5.': True, 'transformer.h.6.': True, 'transformer.h.7.': True, 'transformer.h.8.': True, 'transformer.h.9.': True, 'transformer.h.10.': True, 'transformer.h.11.': True, 'transformer.h.12.': True, 'transformer.h.13.': True, 'transformer.h.14.': True, 'transformer.h.15.': True, 'transformer.h.16.': True, 'transformer.h.17.': True, 'transformer.h.18.': True, 'transformer.h.19.': True, 'transformer.h.20.': True, 'transformer.h.21.': True, 'transformer.h.22.': True, 'transformer.h.23.': True, 'transformer.h.24.': True, 'transformer.h.25.': True, 'transformer.h.26.': True, 'transformer.h.27.': True, 'transformer.h.28.': True, 'transformer.h.29.': True, 'transformer.h.30.': True, 'transformer.h.31.': True, 'transformer.ln_f.': True, 'lm_head.': True}
saved layers already found in /root/.cache/huggingface/hub/models--Qwen--Qwen-7B/snapshots/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/splitted_model.4bit
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
not support prefetching for compression for now. loading with no prepetching mode.
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.85it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.88it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.72it/s]What is the capital of United States? The answer is:

What is the capital of the United States? The answer is: How do I prevent this and make the answer direct without repeating the question?

werruww · 2024-10-24T21:29:56Z

What is the capital of the United States? The answer is:

How do I prevent this and make the answer direct without repeating the question?

werruww · 2024-10-24T22:00:00Z

Can those in charge of airllm modify the part of the answer to prevent the question from appearing in the answer and prevent the answer from appearing in the answer and not counting it in the number of tokens to be shown?

werruww · 2024-10-24T22:35:15Z

Without repeating the question in the answer

from airllm import AutoModel
MAX_LENGTH = 128

model = AutoModel.from_pretrained("Qwen/Qwen-7B",
compression='4bit')
input_text = ['Who invented the electric lamp?']
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=5,
use_cache=True,
return_dict_in_generate=True)

response = model.tokenizer.decode(generation_output.sequences[0])
cleaned_response = response.replace(input_text[0], "").strip() # Remove the question
print(cleaned_response)

either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.80it/s]A. Edison

werruww mentioned this issue Oct 24, 2024

errors #191

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

it is run #192

it is run #192

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

it is run #192

it is run #192

Comments

werruww commented Oct 24, 2024

could use hugging face model repo id:

could use hugging face model repo id:

or use model's local path...

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

could use hugging face model repo id:

could use hugging face model repo id:

or use model's local path...

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024

werruww commented Oct 24, 2024