Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

it is run #192

Open
werruww opened this issue Oct 24, 2024 · 11 comments
Open

it is run #192

werruww opened this issue Oct 24, 2024 · 11 comments

Comments

@werruww
Copy link

werruww commented Oct 24, 2024

!pip install -U airllm

!pip install -U bitsandbytes

!pip install git+https://github.com/huggingface/transformers.git

!pip install git+https://github.com/huggingface/accelerate.git

!pip install tiktoken

!pip install transformers_stream_generator

from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

#model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)
from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

#model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

model = AutoModel.from_pretrained("Qwen/Qwen-7B",
compression='4bit',
delete_original=True # specify '8bit' for 8-bit block-wise quantization
)

or use model's local path...

#model = AirLLMLlama2("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [
#'What is the capital of China?',
'Who is Napoleon Bonaparte؟',
]

input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
#padding=True
)

generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=12,
use_cache=True,
return_dict_in_generate=True)

model.tokenizer.decode(generation_output.sequences[0])

Fetching 20 files: 100%
 20/20 [00:00<00:00, 1217.93it/s]
found_layers:{'transformer.wte.': True, 'transformer.h.0.': True, 'transformer.h.1.': True, 'transformer.h.2.': True, 'transformer.h.3.': True, 'transformer.h.4.': True, 'transformer.h.5.': True, 'transformer.h.6.': True, 'transformer.h.7.': True, 'transformer.h.8.': True, 'transformer.h.9.': True, 'transformer.h.10.': True, 'transformer.h.11.': True, 'transformer.h.12.': True, 'transformer.h.13.': True, 'transformer.h.14.': True, 'transformer.h.15.': True, 'transformer.h.16.': True, 'transformer.h.17.': True, 'transformer.h.18.': True, 'transformer.h.19.': True, 'transformer.h.20.': True, 'transformer.h.21.': True, 'transformer.h.22.': True, 'transformer.h.23.': True, 'transformer.h.24.': True, 'transformer.h.25.': True, 'transformer.h.26.': True, 'transformer.h.27.': True, 'transformer.h.28.': True, 'transformer.h.29.': True, 'transformer.h.30.': True, 'transformer.h.31.': True, 'transformer.ln_f.': True, 'lm_head.': True}
saved layers already found in /root/.cache/huggingface/hub/models--Qwen--Qwen-7B/snapshots/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/splitted_model.4bit
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
not support prefetching for compression for now. loading with no prepetching mode.
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:13<00:00, 2.68it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:13<00:00, 2.69it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.83it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.80it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.86it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.84it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.83it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.84it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.82it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.84it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.86it/s]
Who is Napoleon Bonaparte؟" The answer is:\nA:\n\nNapoleon Bon

@werruww
Copy link
Author

werruww commented Oct 24, 2024

max_new_tokens=xxxxxxxxxxxxxxxxxxxxxx,

The higher the number, the more load on the graphics unit and the longer the time.

generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=12,
use_cache=True,
return_dict_in_generate=True)

model.tokenizer.decode(generation_output.sequences[0])

@werruww
Copy link
Author

werruww commented Oct 24, 2024

it run on colab t4

@werruww werruww mentioned this issue Oct 24, 2024
@werruww
Copy link
Author

werruww commented Oct 24, 2024

from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

#model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)
from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

#model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

model = AutoModel.from_pretrained("Qwen/Qwen-7B",
compression='4bit',
delete_original=True # specify '8bit' for 8-bit block-wise quantization
)

or use model's local path...

#model = AirLLMLlama2("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [
#'What is the capital of China?',
'Who invented the electric light bulb?',
]

input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
#padding=True
)

generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=5,
use_cache=True,
no_repeat_ngram_size=3, # يمنع تكرار ثلاث كلمات متتابعة
repetition_penalty=1.2, # عقوبة للتكرار لتجنب تكرار الكلمات
return_dict_in_generate=True)

model.tokenizer.decode(generation_output.sequences[0])

bitsandbytes installed
cache_utils installed
Fetching 20 files: 100%
 20/20 [00:00<00:00, 1025.63it/s]
found_layers:{'transformer.wte.': True, 'transformer.h.0.': True, 'transformer.h.1.': True, 'transformer.h.2.': True, 'transformer.h.3.': True, 'transformer.h.4.': True, 'transformer.h.5.': True, 'transformer.h.6.': True, 'transformer.h.7.': True, 'transformer.h.8.': True, 'transformer.h.9.': True, 'transformer.h.10.': True, 'transformer.h.11.': True, 'transformer.h.12.': True, 'transformer.h.13.': True, 'transformer.h.14.': True, 'transformer.h.15.': True, 'transformer.h.16.': True, 'transformer.h.17.': True, 'transformer.h.18.': True, 'transformer.h.19.': True, 'transformer.h.20.': True, 'transformer.h.21.': True, 'transformer.h.22.': True, 'transformer.h.23.': True, 'transformer.h.24.': True, 'transformer.h.25.': True, 'transformer.h.26.': True, 'transformer.h.27.': True, 'transformer.h.28.': True, 'transformer.h.29.': True, 'transformer.h.30.': True, 'transformer.h.31.': True, 'transformer.ln_f.': True, 'lm_head.': True}
saved layers already found in /root/.cache/huggingface/hub/models--Qwen--Qwen-7B/snapshots/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/splitted_model.4bit
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
not support prefetching for compression for now. loading with no prepetching mode.
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.70it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.78it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.88it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.77it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s]
Who invented the electric light bulb? A. Thomas Edison

How do I prevent the question "Who invented the electric light bulb? A. Thomas Edison" from being repeated and the answer being "A. Thomas Edison" only?

@werruww
Copy link
Author

werruww commented Oct 24, 2024

How do I prevent the question "Who invented the electric light bulb? A. Thomas Edison" from being repeated and the answer being "A. Thomas Edison" only?

@werruww
Copy link
Author

werruww commented Oct 24, 2024

How do I prevent the question from being repeated?

@werruww
Copy link
Author

werruww commented Oct 24, 2024

echo=True
Is there an instance of echo=True in airllm

@werruww
Copy link
Author

werruww commented Oct 24, 2024

Echo the prompt back in the output

@werruww
Copy link
Author

werruww commented Oct 24, 2024

Fetching 20 files: 100%
 20/20 [00:00<00:00, 1258.81it/s]
found_layers:{'transformer.wte.': True, 'transformer.h.0.': True, 'transformer.h.1.': True, 'transformer.h.2.': True, 'transformer.h.3.': True, 'transformer.h.4.': True, 'transformer.h.5.': True, 'transformer.h.6.': True, 'transformer.h.7.': True, 'transformer.h.8.': True, 'transformer.h.9.': True, 'transformer.h.10.': True, 'transformer.h.11.': True, 'transformer.h.12.': True, 'transformer.h.13.': True, 'transformer.h.14.': True, 'transformer.h.15.': True, 'transformer.h.16.': True, 'transformer.h.17.': True, 'transformer.h.18.': True, 'transformer.h.19.': True, 'transformer.h.20.': True, 'transformer.h.21.': True, 'transformer.h.22.': True, 'transformer.h.23.': True, 'transformer.h.24.': True, 'transformer.h.25.': True, 'transformer.h.26.': True, 'transformer.h.27.': True, 'transformer.h.28.': True, 'transformer.h.29.': True, 'transformer.h.30.': True, 'transformer.h.31.': True, 'transformer.ln_f.': True, 'lm_head.': True}
saved layers already found in /root/.cache/huggingface/hub/models--Qwen--Qwen-7B/snapshots/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/splitted_model.4bit
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
not support prefetching for compression for now. loading with no prepetching mode.
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.85it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.88it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s]
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference...
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.72it/s]What is the capital of United States? The answer is:

What is the capital of the United States? The answer is: How do I prevent this and make the answer direct without repeating the question?

@werruww
Copy link
Author

werruww commented Oct 24, 2024

What is the capital of the United States? The answer is:

How do I prevent this and make the answer direct without repeating the question?

@werruww
Copy link
Author

werruww commented Oct 24, 2024

Can those in charge of airllm modify the part of the answer to prevent the question from appearing in the answer and prevent the answer from appearing in the answer and not counting it in the number of tokens to be shown?

@werruww
Copy link
Author

werruww commented Oct 24, 2024

Without repeating the question in the answer

from airllm import AutoModel
MAX_LENGTH = 128

model = AutoModel.from_pretrained("Qwen/Qwen-7B",
compression='4bit')
input_text = ['Who invented the electric lamp?']
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=5,
use_cache=True,
return_dict_in_generate=True)

response = model.tokenizer.decode(generation_output.sequences[0])
cleaned_response = response.replace(input_text[0], "").strip() # Remove the question
print(cleaned_response)

either BetterTransformer or attn_implementation='sdpa' is available, creating model directly
running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.80it/s]A. Edison

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant