-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
windows tests report #1007
Comments
The |
As for
this is probably the same thing that manifested as |
Thanks a lot for the report @wkpark ! |
(venv) D:\src\transformers>python -m pytest tests\quantization\bnb\test_4bit.py
====================================================================================================== test session starts ======================================================================================================
platform win32 -- Python 3.10.11, pytest-7.4.2, pluggy-1.3.0
rootdir: D:\src\transformers
configfile: pyproject.toml
plugins: anyio-3.7.1, hydra-core-1.3.2, hypothesis-6.93.0, xdist-3.5.0
collected 39 items
tests\quantization\bnb\test_4bit.py ......F.FF....s...FF.FF..FFFFFFFFFFFFFF [100%]
Details=========================================================================================================== FAILURES ============================================================================================================
________________________________________________________________________________________________ Bnb4BitTest.test_original_dtype ________________________________________________________________________________________________
self = <bnb.test_4bit.Bnb4BitTest testMethod=test_original_dtype>
def test_original_dtype(self):
r"""
A simple test to check if the model succesfully stores the original dtype
"""
> self.assertTrue(hasattr(self.model_4bit.config, "_pre_quantization_dtype"))
E AssertionError: False is not true
tests\quantization\bnb\test_4bit.py:177: AssertionError
_________________________________________________________________________________________ Bnb4BitTest.test_quantization_num_parameters __________________________________________________________________________________________
self = <bnb.test_4bit.Bnb4BitTest testMethod=test_quantization_num_parameters>
def test_quantization_num_parameters(self):
r"""
Test if the number of returned parameters is correct
See: https://github.com/huggingface/transformers/issues/25978
"""
num_params_4bit = self.model_4bit.num_parameters()
num_params_fp16 = self.model_fp16.num_parameters()
> self.assertEqual(num_params_4bit, num_params_fp16)
E AssertionError: 1118429184 != 1722408960
tests\quantization\bnb\test_4bit.py:144: AssertionError
__________________________________________________________________________________________________ Bnb4BitTest.test_rwkv_4bit ___________________________________________________________________________________________________
self = <bnb.test_4bit.Bnb4BitTest testMethod=test_rwkv_4bit>
def test_rwkv_4bit(self):
r"""
A simple test to check if 4-bit RWKV inference works as expected.
"""
model_id = "RWKV/rwkv-4-169m-pile"
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
tok = AutoTokenizer.from_pretrained(model_id)
text = "Hello my name is"
input_ids = tok.encode(text, return_tensors="pt").to(0)
> _ = model.generate(input_ids, max_new_tokens=30)
tests\quantization\bnb\test_4bit.py:211:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\utils\_contextlib.py:115: in decorate_context
return func(*args, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\generation\utils.py:1522: in generate
return self.greedy_search(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\generation\utils.py:2339: in greedy_search
outputs = self(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:1518: in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:1527: in _call_impl
return forward_call(*args, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:165: in new_forward
output = old_forward(*args, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\rwkv\modeling_rwkv.py:789: in forward
rwkv_outputs = self.rwkv(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:1518: in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:1527: in _call_impl
return forward_call(*args, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:165: in new_forward
output = old_forward(*args, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\rwkv\modeling_rwkv.py:642: in forward
self._rescale_layers()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = RwkvModel(
(embeddings): Embedding(50277, 768)
(blocks): ModuleList(
(0): RwkvBlock(
(pre_ln): LayerNorm...72, out_features=768, bias=False)
)
)
)
(ln_out): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
def _rescale_layers(self):
# Layers should be rescaled for inference only.
if self.layers_are_rescaled == (not self.training):
return
if self.config.rescale_every > 0:
with torch.no_grad():
for block_id, block in enumerate(self.blocks):
if self.training:
block.attention.output.weight.mul_(2 ** int(block_id // self.config.rescale_every))
block.feed_forward.value.weight.mul_(2 ** int(block_id // self.config.rescale_every))
else:
# Deal with quantization statistics
if hasattr(block.attention.output.weight, "SCB"):
block.attention.output.weight.SCB.div_(2 ** int(block_id // self.config.rescale_every))
block.feed_forward.value.weight.SCB.div_(2 ** int(block_id // self.config.rescale_every))
elif hasattr(block.attention.output.weight, "quant_state"):
> block.attention.output.weight.quant_state[0].div_(
2 ** int(block_id // self.config.rescale_every)
)
E TypeError: 'QuantState' object is not subscriptable
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\rwkv\modeling_rwkv.py:714: TypeError
----------------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------------------
Downloading config.json: 100%|██████████| 521/521 [00:00<?, ?B/s]
Downloading pytorch_model.bin: 100%|██████████| 677M/677M [00:34<00:00, 19.9MB/s]
Downloading generation_config.json: 100%|██████████| 116/116 [00:00<?, ?B/s]
Downloading tokenizer_config.json: 100%|██████████| 264/264 [00:00<00:00, 264kB/s]
Downloading tokenizer.json: 100%|██████████| 2.11M/2.11M [00:00<00:00, 2.84MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 99.0/99.0 [00:00<?, ?B/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
_____________________________________________________________________________________________ Bnb4BitGPT2Test.test_generate_quality _____________________________________________________________________________________________
self = <bnb.test_4bit.Bnb4BitGPT2Test testMethod=test_generate_quality>
def test_generate_quality(self):
r"""
Test the generation quality of the quantized model and see that we are matching the expected output.
Given that we are operating on small numbers + the testing model is relatively small, we might not get
the same output across GPUs. So we'll generate few tokens (5-10) and check their output.
"""
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
output_sequences = self.model_4bit.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
> self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
E AssertionError: 'Hello my name is John Doe. I am a man. I am' not found in {'Hello my name is John.\nI am a friend of your father.\n', 'Hello my name is John Doe, I am a student at the University', 'Hello my name is John and I am a professional photographer. I'}
tests\quantization\bnb\test_4bit.py:222: AssertionError
----------------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------------------
You are loading your model in 8bit or 4bit but no linear modules were found in your model. this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers. Please double check your model architecture, or submit an issue on github if you think this is a bug.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
_________________________________________________________________________________________ Bnb4BitGPT2Test.test_generate_quality_config __________________________________________________________________________________________
self = <bnb.test_4bit.Bnb4BitGPT2Test testMethod=test_generate_quality_config>
def test_generate_quality_config(self):
r"""
Test that loading the model with the config is equivalent
"""
bnb_config = BitsAndBytesConfig()
bnb_config.load_in_4bit = True
model_4bit_from_config = AutoModelForCausalLM.from_pretrained(
self.model_name, quantization_config=bnb_config, device_map="auto"
)
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
output_sequences = model_4bit_from_config.generate(
input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10
)
> self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
E AssertionError: 'Hello my name is John Doe. I am a man. I am' not found in {'Hello my name is John.\nI am a friend of your father.\n', 'Hello my name is John Doe, I am a student at the University', 'Hello my name is John and I am a professional photographer. I'}
tests\quantization\bnb\test_4bit.py:240: AssertionError
----------------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------------------
You are loading your model in 8bit or 4bit but no linear modules were found in your model. this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers. Please double check your model architecture, or submit an issue on github if you think this is a bug.
You are loading your model in 8bit or 4bit but no linear modules were found in your model. this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers. Please double check your model architecture, or submit an issue on github if you think this is a bug.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
_____________________________________________________________________________________________ Bnb4BitGPT2Test.test_memory_footprint _____________________________________________________________________________________________
self = <bnb.test_4bit.Bnb4BitGPT2Test testMethod=test_memory_footprint>
def test_memory_footprint(self):
r"""
A simple test to check if the model conversion has been done correctly by checking on the
memory footprint of the converted model and the class type of the linear layers of the converted models
"""
from bitsandbytes.nn import Params4bit
mem_fp16 = self.model_fp16.get_memory_footprint()
mem_4bit = self.model_4bit.get_memory_footprint()
> self.assertAlmostEqual(mem_fp16 / mem_4bit, self.EXPECTED_RELATIVE_DIFFERENCE)
E AssertionError: 1.0 != 3.3191854854152187 within 7 places (2.3191854854152187 difference)
tests\quantization\bnb\test_4bit.py:169: AssertionError
----------------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------------------
You are loading your model in 8bit or 4bit but no linear modules were found in your model. this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers. Please double check your model architecture, or submit an issue on github if you think this is a bug.
______________________________________________________________________________________________ Bnb4BitGPT2Test.test_original_dtype ______________________________________________________________________________________________
self = <bnb.test_4bit.Bnb4BitGPT2Test testMethod=test_original_dtype>
def test_original_dtype(self):
r"""
A simple test to check if the model succesfully stores the original dtype
"""
> self.assertTrue(hasattr(self.model_4bit.config, "_pre_quantization_dtype"))
E AssertionError: False is not true
tests\quantization\bnb\test_4bit.py:177: AssertionError
----------------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------------------
You are loading your model in 8bit or 4bit but no linear modules were found in your model. this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers. Please double check your model architecture, or submit an issue on github if you think this is a bug.
________________________________________________________________________________________________ Bnb4BitGPT2Test.test_rwkv_4bit _________________________________________________________________________________________________
self = <bnb.test_4bit.Bnb4BitGPT2Test testMethod=test_rwkv_4bit>
def test_rwkv_4bit(self):
r"""
A simple test to check if 4-bit RWKV inference works as expected.
"""
model_id = "RWKV/rwkv-4-169m-pile"
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
tok = AutoTokenizer.from_pretrained(model_id)
text = "Hello my name is"
input_ids = tok.encode(text, return_tensors="pt").to(0)
> _ = model.generate(input_ids, max_new_tokens=30)
tests\quantization\bnb\test_4bit.py:211:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\utils\_contextlib.py:115: in decorate_context
return func(*args, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\generation\utils.py:1522: in generate
return self.greedy_search(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\generation\utils.py:2339: in greedy_search
outputs = self(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:1518: in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:1527: in _call_impl
return forward_call(*args, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:165: in new_forward
output = old_forward(*args, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\rwkv\modeling_rwkv.py:789: in forward
rwkv_outputs = self.rwkv(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:1518: in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py:1527: in _call_impl
return forward_call(*args, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:165: in new_forward
output = old_forward(*args, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\rwkv\modeling_rwkv.py:642: in forward
self._rescale_layers()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = RwkvModel(
(embeddings): Embedding(50277, 768)
(blocks): ModuleList(
(0): RwkvBlock(
(pre_ln): LayerNorm...72, out_features=768, bias=False)
)
)
)
(ln_out): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
def _rescale_layers(self):
# Layers should be rescaled for inference only.
if self.layers_are_rescaled == (not self.training):
return
if self.config.rescale_every > 0:
with torch.no_grad():
for block_id, block in enumerate(self.blocks):
if self.training:
block.attention.output.weight.mul_(2 ** int(block_id // self.config.rescale_every))
block.feed_forward.value.weight.mul_(2 ** int(block_id // self.config.rescale_every))
else:
# Deal with quantization statistics
if hasattr(block.attention.output.weight, "SCB"):
block.attention.output.weight.SCB.div_(2 ** int(block_id // self.config.rescale_every))
block.feed_forward.value.weight.SCB.div_(2 ** int(block_id // self.config.rescale_every))
elif hasattr(block.attention.output.weight, "quant_state"):
> block.attention.output.weight.quant_state[0].div_(
2 ** int(block_id // self.config.rescale_every)
)
E TypeError: 'QuantState' object is not subscriptable
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\rwkv\modeling_rwkv.py:714: TypeError
----------------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------------------
You are loading your model in 8bit or 4bit but no linear modules were found in your model. this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers. Please double check your model architecture, or submit an issue on github if you think this is a bug.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
___________________________________________________________________________________________ BaseSerializationTest.test_serialization ____________________________________________________________________________________________
self = <bnb.test_4bit.BaseSerializationTest testMethod=test_serialization>, quant_type = 'nf4', double_quant = True, safe_serialization = True
def test_serialization(self, quant_type="nf4", double_quant=True, safe_serialization=True):
r"""
Test whether it is possible to serialize a model in 4-bit. Uses most typical params as default.
See ExtendedSerializationTest class for more params combinations.
"""
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
self.quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=quant_type,
bnb_4bit_use_double_quant=double_quant,
bnb_4bit_compute_dtype=torch.bfloat16,
)
> model_0 = AutoModelForCausalLM.from_pretrained(
self.model_name,
quantization_config=self.quantization_config,
device_map=torch_device,
)
tests\quantization\bnb\test_4bit.py:543:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\auto\auto_factory.py:484: in from_pretrained
return model_class.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\modeling_utils.py:2937: in from_pretrained
dispatch_model(model, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[199],
[113],
[185],
...,
[138],
[ 74],
[ 26]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
----------------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------------------
Downloading tokenizer_config.json: 100%|██████████| 685/685 [00:00<00:00, 680kB/s]
Downloading config.json: 100%|██████████| 651/651 [00:00<?, ?B/s]
Downloading vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 1.60MB/s]
Downloading merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 821kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 441/441 [00:00<?, ?B/s]
Downloading pytorch_model.bin: 100%|██████████| 251M/251M [00:04<00:00, 60.3MB/s]
Downloading generation_config.json: 100%|██████████| 137/137 [00:00<?, ?B/s]
________________________________________________________________________________________ ExtendedSerializationTest.test_fp4_double_safe _________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_fp4_double_safe>
def test_fp4_double_safe(self):
> self.test_serialization(quant_type="fp4", double_quant=True, safe_serialization=True)
tests\quantization\bnb\test_4bit.py:634:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\quantization\bnb\test_4bit.py:543: in test_serialization
model_0 = AutoModelForCausalLM.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\auto\auto_factory.py:484: in from_pretrained
return model_class.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\modeling_utils.py:2937: in from_pretrained
dispatch_model(model, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[ 65],
[ 26],
[ 70],
...,
[103],
[199],
[167]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
_______________________________________________________________________________________ ExtendedSerializationTest.test_fp4_double_unsafe ________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_fp4_double_unsafe>
def test_fp4_double_unsafe(self):
> self.test_serialization(quant_type="fp4", double_quant=True, safe_serialization=False)
tests\quantization\bnb\test_4bit.py:631:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\quantization\bnb\test_4bit.py:543: in test_serialization
model_0 = AutoModelForCausalLM.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\auto\auto_factory.py:484: in from_pretrained
return model_class.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\modeling_utils.py:2937: in from_pretrained
dispatch_model(model, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[ 65],
[ 26],
[ 70],
...,
[103],
[199],
[167]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
________________________________________________________________________________________ ExtendedSerializationTest.test_fp4_single_safe _________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_fp4_single_safe>
def test_fp4_single_safe(self):
> self.test_serialization(quant_type="fp4", double_quant=False, safe_serialization=True)
tests\quantization\bnb\test_4bit.py:628:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\quantization\bnb\test_4bit.py:543: in test_serialization
model_0 = AutoModelForCausalLM.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\auto\auto_factory.py:484: in from_pretrained
return model_class.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\modeling_utils.py:2937: in from_pretrained
dispatch_model(model, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[ 65],
[ 26],
[ 70],
...,
[103],
[199],
[167]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
_______________________________________________________________________________________ ExtendedSerializationTest.test_fp4_single_unsafe ________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_fp4_single_unsafe>
def test_fp4_single_unsafe(self):
> self.test_serialization(quant_type="fp4", double_quant=False, safe_serialization=False)
tests\quantization\bnb\test_4bit.py:625:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\quantization\bnb\test_4bit.py:543: in test_serialization
model_0 = AutoModelForCausalLM.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\auto\auto_factory.py:484: in from_pretrained
return model_class.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\modeling_utils.py:2937: in from_pretrained
dispatch_model(model, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[ 65],
[ 26],
[ 70],
...,
[103],
[199],
[167]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
_______________________________________________________________________________________ ExtendedSerializationTest.test_nf4_double_unsafe ________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_nf4_double_unsafe>
def test_nf4_double_unsafe(self):
> self.test_serialization(quant_type="nf4", double_quant=True, safe_serialization=False)
tests\quantization\bnb\test_4bit.py:620:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\quantization\bnb\test_4bit.py:543: in test_serialization
model_0 = AutoModelForCausalLM.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\auto\auto_factory.py:484: in from_pretrained
return model_class.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\modeling_utils.py:2937: in from_pretrained
dispatch_model(model, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[199],
[113],
[185],
...,
[138],
[ 74],
[ 26]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
________________________________________________________________________________________ ExtendedSerializationTest.test_nf4_single_safe _________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_nf4_single_safe>
def test_nf4_single_safe(self):
> self.test_serialization(quant_type="nf4", double_quant=False, safe_serialization=True)
tests\quantization\bnb\test_4bit.py:617:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\quantization\bnb\test_4bit.py:543: in test_serialization
model_0 = AutoModelForCausalLM.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\auto\auto_factory.py:484: in from_pretrained
return model_class.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\modeling_utils.py:2937: in from_pretrained
dispatch_model(model, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[199],
[113],
[185],
...,
[138],
[ 74],
[ 26]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
_______________________________________________________________________________________ ExtendedSerializationTest.test_nf4_single_unsafe ________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_nf4_single_unsafe>
def test_nf4_single_unsafe(self):
> self.test_serialization(quant_type="nf4", double_quant=False, safe_serialization=False)
tests\quantization\bnb\test_4bit.py:614:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\quantization\bnb\test_4bit.py:543: in test_serialization
model_0 = AutoModelForCausalLM.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\auto\auto_factory.py:484: in from_pretrained
return model_class.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\modeling_utils.py:2937: in from_pretrained
dispatch_model(model, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[199],
[113],
[185],
...,
[138],
[ 74],
[ 26]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
_________________________________________________________________________________________ ExtendedSerializationTest.test_serialization __________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_serialization>, quant_type = 'nf4', double_quant = True, safe_serialization = True
def test_serialization(self, quant_type="nf4", double_quant=True, safe_serialization=True):
r"""
Test whether it is possible to serialize a model in 4-bit. Uses most typical params as default.
See ExtendedSerializationTest class for more params combinations.
"""
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
self.quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=quant_type,
bnb_4bit_use_double_quant=double_quant,
bnb_4bit_compute_dtype=torch.bfloat16,
)
> model_0 = AutoModelForCausalLM.from_pretrained(
self.model_name,
quantization_config=self.quantization_config,
device_map=torch_device,
)
tests\quantization\bnb\test_4bit.py:543:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\auto\auto_factory.py:484: in from_pretrained
return model_class.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\modeling_utils.py:2937: in from_pretrained
dispatch_model(model, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[199],
[113],
[185],
...,
[138],
[ 74],
[ 26]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
___________________________________________________________________________________________ BloomSerializationTest.test_serialization ___________________________________________________________________________________________
self = <bnb.test_4bit.BloomSerializationTest testMethod=test_serialization>, quant_type = 'nf4', double_quant = True, safe_serialization = True
def test_serialization(self, quant_type="nf4", double_quant=True, safe_serialization=True):
r"""
Test whether it is possible to serialize a model in 4-bit. Uses most typical params as default.
See ExtendedSerializationTest class for more params combinations.
"""
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
self.quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=quant_type,
bnb_4bit_use_double_quant=double_quant,
bnb_4bit_compute_dtype=torch.bfloat16,
)
> model_0 = AutoModelForCausalLM.from_pretrained(
self.model_name,
quantization_config=self.quantization_config,
device_map=torch_device,
)
tests\quantization\bnb\test_4bit.py:543:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\models\auto\auto_factory.py:484: in from_pretrained
return model_class.from_pretrained(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\modeling_utils.py:2937: in from_pretrained
dispatch_model(model, **kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[ 43],
[ 50],
[197],
...,
[148],
[117],
[ 35]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
----------------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------------------
Downloading tokenizer_config.json: 100%|██████████| 222/222 [00:00<?, ?B/s]
Downloading tokenizer.json: 100%|██████████| 14.5M/14.5M [00:00<00:00, 59.6MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 85.0/85.0 [00:00<?, ?B/s]
____________________________________________________________________________________________ GPTSerializationTest.test_serialization ____________________________________________________________________________________________
self = <bnb.test_4bit.GPTSerializationTest testMethod=test_serialization>, quant_type = 'nf4', double_quant = True, safe_serialization = True
def test_serialization(self, quant_type="nf4", double_quant=True, safe_serialization=True):
r"""
Test whether it is possible to serialize a model in 4-bit. Uses most typical params as default.
See ExtendedSerializationTest class for more params combinations.
"""
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
self.quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=quant_type,
bnb_4bit_use_double_quant=double_quant,
bnb_4bit_compute_dtype=torch.bfloat16,
)
model_0 = AutoModelForCausalLM.from_pretrained(
self.model_name,
quantization_config=self.quantization_config,
device_map=torch_device,
)
with tempfile.TemporaryDirectory() as tmpdirname:
> model_0.save_pretrained(tmpdirname, safe_serialization=safe_serialization)
tests\quantization\bnb\test_4bit.py:550:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = GPT2LMHeadModel(
(transformer): GPT2Model(
(wte): Embedding(50257, 1600)
(wpe): Embedding(1024, 1600)
(d...1600,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=1600, out_features=50257, bias=False)
)
save_directory = 'C:\\Users\\WK\\AppData\\Local\\Temp\\tmpam_q0sfp', is_main_process = True, state_dict = None, save_function = <function save at 0x00000000134056C0>, push_to_hub = False, max_shard_size = '10GB'
safe_serialization = True, variant = None, kwargs = {}
def save_pretrained(
self,
save_directory: Union[str, os.PathLike],
is_main_process: bool = True,
state_dict: Optional[dict] = None,
save_function: Callable = torch.save,
push_to_hub: bool = False,
max_shard_size: Union[int, str] = "10GB",
safe_serialization: bool = False,
variant: Optional[str] = None,
**kwargs,
):
"""
Save a model and its configuration file to a directory, so that it can be re-loaded using the
[`~PreTrainedModel.from_pretrained`] class method.
Arguments:
save_directory (`str` or `os.PathLike`):
Directory to which to save. Will be created if it doesn't exist.
is_main_process (`bool`, *optional*, defaults to `True`):
Whether the process calling this is the main process or not. Useful when in distributed training like
TPUs and need to call this function on all processes. In this case, set `is_main_process=True` only on
the main process to avoid race conditions.
state_dict (nested dictionary of `torch.Tensor`):
The state dictionary of the model to save. Will default to `self.state_dict()`, but can be used to only
save parts of the model or if special precautions need to be taken when recovering the state dictionary
of a model (like when using model parallelism).
save_function (`Callable`):
The function to use to save the state dictionary. Useful on distributed training like TPUs when one
need to replace `torch.save` by another method.
push_to_hub (`bool`, *optional*, defaults to `False`):
Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the
repository you want to push to with `repo_id` (will default to the name of `save_directory` in your
namespace).
max_shard_size (`int` or `str`, *optional*, defaults to `"10GB"`):
The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size
lower than this size. If expressed as a string, needs to be digits followed by a unit (like `"5MB"`).
<Tip warning={true}>
If a single weight of the model is bigger than `max_shard_size`, it will be in its own checkpoint shard
which will be bigger than `max_shard_size`.
</Tip>
safe_serialization (`bool`, *optional*, defaults to `False`):
Whether to save the model using `safetensors` or the traditional PyTorch way (that uses `pickle`).
variant (`str`, *optional*):
If specified, weights are saved in the format pytorch_model.<variant>.bin.
kwargs:
Additional key word arguments passed along to the [`~utils.PushToHubMixin.push_to_hub`] method.
"""
# Checks if the model has been loaded in 8-bit
if getattr(self, "is_loaded_in_8bit", False) and getattr(self, "is_8bit_serializable", False):
warnings.warn(
"You are calling `save_pretrained` to a 8-bit converted model you may likely encounter unexepected"
" behaviors. If you want to save 8-bit models, make sure to have `bitsandbytes>0.37.2` installed.",
UserWarning,
)
if getattr(self, "is_loaded_in_4bit", False):
> raise NotImplementedError(
"You are calling `save_pretrained` on a 4-bit converted model. This is currently not supported"
)
E NotImplementedError: You are calling `save_pretrained` on a 4-bit converted model. This is currently not supported
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\modeling_utils.py:1716: NotImplementedError
----------------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------------------
You are loading your model in 8bit or 4bit but no linear modules were found in your model. this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers. Please double check your model architecture, or submit an issue on github if you think this is a bug.
___________________________________________________________________________________ Bnb4BitTestBasicConfigTest.test_load_in_4_and_8_bit_fails ___________________________________________________________________________________
self = <bnb.test_4bit.Bnb4BitTestBasicConfigTest testMethod=test_load_in_4_and_8_bit_fails>
def test_load_in_4_and_8_bit_fails(self):
> with self.assertRaisesRegex(ValueError, "load_in_4bit and load_in_8bit are both True"):
E AssertionError: ValueError not raised
tests\quantization\bnb\test_4bit.py:659: AssertionError
_______________________________________________________________________________________ Bnb4BitTestBasicConfigTest.test_set_load_in_8_bit _______________________________________________________________________________________
self = <bnb.test_4bit.Bnb4BitTestBasicConfigTest testMethod=test_set_load_in_8_bit>
def test_set_load_in_8_bit(self):
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
> with self.assertRaisesRegex(ValueError, "load_in_4bit and load_in_8bit are both True"):
E AssertionError: ValueError not raised
tests\quantization\bnb\test_4bit.py:664: AssertionError
======================================================================================================= warnings summary ========================================================================================================
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\_pytest\config\__init__.py:1373
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\_pytest\config\__init__.py:1373: PytestConfigWarning: Unknown config option: doctest_glob
self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_generate_quality
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_generate_quality_config
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_with_keep_in_fp32
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_without_keep_in_fp32
tests/quantization/bnb/test_4bit.py::Pipeline4BitTest::test_pipeline
tests/quantization/bnb/test_4bit.py::Bnb4BitTestTraining::test_training
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:248: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.
warnings.warn('Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.')
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\utils\cpp_extension.py:28: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
from pkg_resources import packaging # type: ignore[attr-defined]
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\pkg_resources\__init__.py:2868: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\utils\cpp_extension.py:383: UserWarning: Error checking compiler version for clang++: Command 'clang++' returned non-zero exit status 1.
warnings.warn(f'Error checking compiler version for {compiler}: {error}')
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit
tests/quantization/bnb/test_4bit.py::Bnb4BitGPT2Test::test_rwkv_4bit
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\generation\utils.py:1259: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
warnings.warn(
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_with_keep_in_fp32
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_with_keep_in_fp32
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_without_keep_in_fp32
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_without_keep_in_fp32
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\transformers\generation\utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_with_keep_in_fp32
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_without_keep_in_fp32
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:245: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference.
warnings.warn('Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference.')
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================================== short test summary info ====================================================================================================
FAILED tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_original_dtype - AssertionError: False is not true
FAILED tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_quantization_num_parameters - AssertionError: 1118429184 != 1722408960
FAILED tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit - TypeError: 'QuantState' object is not subscriptable
FAILED tests/quantization/bnb/test_4bit.py::Bnb4BitGPT2Test::test_generate_quality - AssertionError: 'Hello my name is John Doe. I am a man. I am' not found in {'Hello my name is John.\nI am a friend of your father.\n', 'Hello my name is John Doe, I am a student at the University', 'Hello my name is John ...
FAILED tests/quantization/bnb/test_4bit.py::Bnb4BitGPT2Test::test_generate_quality_config - AssertionError: 'Hello my name is John Doe. I am a man. I am' not found in {'Hello my name is John.\nI am a friend of your father.\n', 'Hello my name is John Doe, I am a student at the University', 'Hello my name is John ...
FAILED tests/quantization/bnb/test_4bit.py::Bnb4BitGPT2Test::test_memory_footprint - AssertionError: 1.0 != 3.3191854854152187 within 7 places (2.3191854854152187 difference)
FAILED tests/quantization/bnb/test_4bit.py::Bnb4BitGPT2Test::test_original_dtype - AssertionError: False is not true
FAILED tests/quantization/bnb/test_4bit.py::Bnb4BitGPT2Test::test_rwkv_4bit - TypeError: 'QuantState' object is not subscriptable
FAILED tests/quantization/bnb/test_4bit.py::BaseSerializationTest::test_serialization - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_fp4_double_safe - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_fp4_double_unsafe - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_fp4_single_safe - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_fp4_single_unsafe - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_nf4_double_unsafe - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_nf4_single_safe - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_nf4_single_unsafe - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_serialization - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::BloomSerializationTest::test_serialization - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::GPTSerializationTest::test_serialization - NotImplementedError: You are calling `save_pretrained` on a 4-bit converted model. This is currently not supported
FAILED tests/quantization/bnb/test_4bit.py::Bnb4BitTestBasicConfigTest::test_load_in_4_and_8_bit_fails - AssertionError: ValueError not raised
FAILED tests/quantization/bnb/test_4bit.py::Bnb4BitTestBasicConfigTest::test_set_load_in_8_bit - AssertionError: ValueError not raised
=============================================================================== 21 failed, 17 passed, 1 skipped, 19 warnings in 972.20s (0:16:12) ===============================================================================
(venv) D:\src\transformers> |
Thanks a lot for running the tests ! |
(venv) D:\src\transformers>pip show transformers
WARNING: Ignoring invalid distribution -afetensors (f:\webui\webui\stable-diffusion-webui\venv\lib\site-packages)
WARNING: Ignoring invalid distribution -itsandbytes (f:\webui\webui\stable-diffusion-webui\venv\lib\site-packages)
WARNING: Ignoring invalid distribution -orch (f:\webui\webui\stable-diffusion-webui\venv\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (f:\webui\webui\stable-diffusion-webui\venv\lib\site-packages)
Name: transformers
Version: 4.38.0.dev0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Location: f:\webui\webui\stable-diffusion-webui\venv\lib\site-packages
Editable project location: D:\src\transformers
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: groundingdino, image-reward, lycoris-lora
(venv) D:\src\transformers>python -m pytest tests\quantization\bnb\test_4bit.py
====================================================================================================== test session starts ======================================================================================================
platform win32 -- Python 3.10.11, pytest-7.4.2, pluggy-1.3.0
rootdir: D:\src\transformers
configfile: pyproject.toml
plugins: anyio-3.7.1, hydra-core-1.3.2, hypothesis-6.93.0, xdist-3.5.0
collected 39 items
tests\quantization\bnb\test_4bit.py ..............s...........FFFFFFFFFFF.. [100%] Details=========================================================================================================== FAILURES ============================================================================================================
___________________________________________________________________________________________ BaseSerializationTest.test_serialization ____________________________________________________________________________________________
self = <bnb.test_4bit.BaseSerializationTest testMethod=test_serialization>, quant_type = 'nf4', double_quant = True, safe_serialization = True
def test_serialization(self, quant_type="nf4", double_quant=True, safe_serialization=True):
r"""
Test whether it is possible to serialize a model in 4-bit. Uses most typical params as default.
See ExtendedSerializationTest class for more params combinations.
"""
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
self.quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=quant_type,
bnb_4bit_use_double_quant=double_quant,
bnb_4bit_compute_dtype=torch.bfloat16,
)
> model_0 = AutoModelForCausalLM.from_pretrained(
self.model_name,
quantization_config=self.quantization_config,
device_map=torch_device,
)
tests\quantization\bnb\test_4bit.py:543:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src\transformers\models\auto\auto_factory.py:567: in from_pretrained
return model_class.from_pretrained(
src\transformers\modeling_utils.py:3560: in from_pretrained
dispatch_model(model, **device_map_kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[199],
[113],
[185],
...,
[138],
[ 74],
[ 26]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
________________________________________________________________________________________ ExtendedSerializationTest.test_fp4_double_safe _________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_fp4_double_safe>
def test_fp4_double_safe(self):
> self.test_serialization(quant_type="fp4", double_quant=True, safe_serialization=True)
tests\quantization\bnb\test_4bit.py:634:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\quantization\bnb\test_4bit.py:543: in test_serialization
model_0 = AutoModelForCausalLM.from_pretrained(
src\transformers\models\auto\auto_factory.py:567: in from_pretrained
return model_class.from_pretrained(
src\transformers\modeling_utils.py:3560: in from_pretrained
dispatch_model(model, **device_map_kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[ 65],
[ 26],
[ 70],
...,
[103],
[199],
[167]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
_______________________________________________________________________________________ ExtendedSerializationTest.test_fp4_double_unsafe ________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_fp4_double_unsafe>
def test_fp4_double_unsafe(self):
> self.test_serialization(quant_type="fp4", double_quant=True, safe_serialization=False)
tests\quantization\bnb\test_4bit.py:631:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\quantization\bnb\test_4bit.py:543: in test_serialization
model_0 = AutoModelForCausalLM.from_pretrained(
src\transformers\models\auto\auto_factory.py:567: in from_pretrained
return model_class.from_pretrained(
src\transformers\modeling_utils.py:3560: in from_pretrained
dispatch_model(model, **device_map_kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[ 65],
[ 26],
[ 70],
...,
[103],
[199],
[167]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
________________________________________________________________________________________ ExtendedSerializationTest.test_fp4_single_safe _________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_fp4_single_safe>
def test_fp4_single_safe(self):
> self.test_serialization(quant_type="fp4", double_quant=False, safe_serialization=True)
tests\quantization\bnb\test_4bit.py:628:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\quantization\bnb\test_4bit.py:543: in test_serialization
model_0 = AutoModelForCausalLM.from_pretrained(
src\transformers\models\auto\auto_factory.py:567: in from_pretrained
return model_class.from_pretrained(
src\transformers\modeling_utils.py:3560: in from_pretrained
dispatch_model(model, **device_map_kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[ 65],
[ 26],
[ 70],
...,
[103],
[199],
[167]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
_______________________________________________________________________________________ ExtendedSerializationTest.test_fp4_single_unsafe ________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_fp4_single_unsafe>
def test_fp4_single_unsafe(self):
> self.test_serialization(quant_type="fp4", double_quant=False, safe_serialization=False)
tests\quantization\bnb\test_4bit.py:625:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\quantization\bnb\test_4bit.py:543: in test_serialization
model_0 = AutoModelForCausalLM.from_pretrained(
src\transformers\models\auto\auto_factory.py:567: in from_pretrained
return model_class.from_pretrained(
src\transformers\modeling_utils.py:3560: in from_pretrained
dispatch_model(model, **device_map_kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[ 65],
[ 26],
[ 70],
...,
[103],
[199],
[167]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
_______________________________________________________________________________________ ExtendedSerializationTest.test_nf4_double_unsafe ________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_nf4_double_unsafe>
def test_nf4_double_unsafe(self):
> self.test_serialization(quant_type="nf4", double_quant=True, safe_serialization=False)
tests\quantization\bnb\test_4bit.py:620:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\quantization\bnb\test_4bit.py:543: in test_serialization
model_0 = AutoModelForCausalLM.from_pretrained(
src\transformers\models\auto\auto_factory.py:567: in from_pretrained
return model_class.from_pretrained(
src\transformers\modeling_utils.py:3560: in from_pretrained
dispatch_model(model, **device_map_kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[199],
[113],
[185],
...,
[138],
[ 74],
[ 26]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
________________________________________________________________________________________ ExtendedSerializationTest.test_nf4_single_safe _________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_nf4_single_safe>
def test_nf4_single_safe(self):
> self.test_serialization(quant_type="nf4", double_quant=False, safe_serialization=True)
tests\quantization\bnb\test_4bit.py:617:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\quantization\bnb\test_4bit.py:543: in test_serialization
model_0 = AutoModelForCausalLM.from_pretrained(
src\transformers\models\auto\auto_factory.py:567: in from_pretrained
return model_class.from_pretrained(
src\transformers\modeling_utils.py:3560: in from_pretrained
dispatch_model(model, **device_map_kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[199],
[113],
[185],
...,
[138],
[ 74],
[ 26]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
_______________________________________________________________________________________ ExtendedSerializationTest.test_nf4_single_unsafe ________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_nf4_single_unsafe>
def test_nf4_single_unsafe(self):
> self.test_serialization(quant_type="nf4", double_quant=False, safe_serialization=False)
tests\quantization\bnb\test_4bit.py:614:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\quantization\bnb\test_4bit.py:543: in test_serialization
model_0 = AutoModelForCausalLM.from_pretrained(
src\transformers\models\auto\auto_factory.py:567: in from_pretrained
return model_class.from_pretrained(
src\transformers\modeling_utils.py:3560: in from_pretrained
dispatch_model(model, **device_map_kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[199],
[113],
[185],
...,
[138],
[ 74],
[ 26]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
_________________________________________________________________________________________ ExtendedSerializationTest.test_serialization __________________________________________________________________________________________
self = <bnb.test_4bit.ExtendedSerializationTest testMethod=test_serialization>, quant_type = 'nf4', double_quant = True, safe_serialization = True
def test_serialization(self, quant_type="nf4", double_quant=True, safe_serialization=True):
r"""
Test whether it is possible to serialize a model in 4-bit. Uses most typical params as default.
See ExtendedSerializationTest class for more params combinations.
"""
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
self.quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=quant_type,
bnb_4bit_use_double_quant=double_quant,
bnb_4bit_compute_dtype=torch.bfloat16,
)
> model_0 = AutoModelForCausalLM.from_pretrained(
self.model_name,
quantization_config=self.quantization_config,
device_map=torch_device,
)
tests\quantization\bnb\test_4bit.py:543:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src\transformers\models\auto\auto_factory.py:567: in from_pretrained
return model_class.from_pretrained(
src\transformers\modeling_utils.py:3560: in from_pretrained
dispatch_model(model, **device_map_kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[199],
[113],
[185],
...,
[138],
[ 74],
[ 26]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
___________________________________________________________________________________________ BloomSerializationTest.test_serialization ___________________________________________________________________________________________
self = <bnb.test_4bit.BloomSerializationTest testMethod=test_serialization>, quant_type = 'nf4', double_quant = True, safe_serialization = True
def test_serialization(self, quant_type="nf4", double_quant=True, safe_serialization=True):
r"""
Test whether it is possible to serialize a model in 4-bit. Uses most typical params as default.
See ExtendedSerializationTest class for more params combinations.
"""
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
self.quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=quant_type,
bnb_4bit_use_double_quant=double_quant,
bnb_4bit_compute_dtype=torch.bfloat16,
)
> model_0 = AutoModelForCausalLM.from_pretrained(
self.model_name,
quantization_config=self.quantization_config,
device_map=torch_device,
)
tests\quantization\bnb\test_4bit.py:543:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src\transformers\models\auto\auto_factory.py:567: in from_pretrained
return model_class.from_pretrained(
src\transformers\modeling_utils.py:3560: in from_pretrained
dispatch_model(model, **device_map_kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[ 43],
[ 50],
[197],
...,
[148],
[117],
[ 35]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
____________________________________________________________________________________________ GPTSerializationTest.test_serialization ____________________________________________________________________________________________
self = <bnb.test_4bit.GPTSerializationTest testMethod=test_serialization>, quant_type = 'nf4', double_quant = True, safe_serialization = True
def test_serialization(self, quant_type="nf4", double_quant=True, safe_serialization=True):
r"""
Test whether it is possible to serialize a model in 4-bit. Uses most typical params as default.
See ExtendedSerializationTest class for more params combinations.
"""
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
self.quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=quant_type,
bnb_4bit_use_double_quant=double_quant,
bnb_4bit_compute_dtype=torch.bfloat16,
)
> model_0 = AutoModelForCausalLM.from_pretrained(
self.model_name,
quantization_config=self.quantization_config,
device_map=torch_device,
)
tests\quantization\bnb\test_4bit.py:543:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src\transformers\models\auto\auto_factory.py:567: in from_pretrained
return model_class.from_pretrained(
src\transformers\modeling_utils.py:3560: in from_pretrained
dispatch_model(model, **device_map_kwargs)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\big_modeling.py:371: in dispatch_model
attach_align_device_hook_on_blocks(
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:506: in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:155: in add_hook_to_module
module = hook.init_hook(module)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py:253: in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\accelerate\utils\modeling.py:320: in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:211: in to
return self._quantize(device)
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:183: in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([[ 87],
[234],
[ 84],
...,
[234],
[135],
[175]], device='cuda:0', dtype=torch.uint8), absmax = tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')
out = tensor([[0],
[0],
[0],
...,
[0],
[0],
[0]], device='cuda:0', dtype=torch.uint8), blocksize = 64, compress_statistics = True, quant_type = 'fp4', quant_storage = torch.uint8
def quantize_4bit(
A: Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type='fp4',
quant_storage=torch.uint8,
) -> Tuple[Tensor, QuantState]:
"""
Quantize tensor A in blocks of 4-bit values.
Quantizes tensor A by dividing it into blocks which are independently quantized to FP4.
Parameters
----------
A : torch.Tensor
The input tensor.
absmax : torch.Tensor
The absmax values.
out : torch.Tensor
The output tensor.
blocksize : int
The blocksize used in quantization.
quant_type : str
The 4-bit quantization data type {fp4, nf4}
Returns
-------
torch.Tensor:
Tensor with packed 4-bit values.
tuple(torch.Tensor, torch.Size, torch.dtype, int):
The quantization state to undo the quantization.
"""
if A.device.type != 'cuda':
raise NotImplementedError(f'Device type not supported for FP4 quantization: {A.device.type}')
if quant_type not in ['fp4', 'nf4']:
raise NotImplementedError(f'4-bit quantization data type {quant_type} is not implemented.')
n = A.numel()
input_shape = A.shape
if absmax is None:
blocks = n // blocksize
blocks += 1 if n % blocksize > 0 else 0
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
if out is None:
mod = dtype2bytes[quant_storage] * 2
out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
assert blocksize in [4096, 2048, 1024, 512, 256, 128, 64]
prev_device = pre_call(A.device)
is_on_gpu([A, out, absmax])
if A.dtype == torch.float32:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp32_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp32_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.float16:
if quant_type == 'fp4':
lib.cquantize_blockwise_fp16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_fp16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
elif A.dtype == torch.bfloat16:
if quant_type == 'fp4':
lib.cquantize_blockwise_bf16_fp4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
lib.cquantize_blockwise_bf16_nf4(get_ptr(None), get_ptr(A), get_ptr(absmax), get_ptr(out), ct.c_int32(blocksize), ct.c_int(n))
else:
> raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
E ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\functional.py:994: ValueError
======================================================================================================= warnings summary ========================================================================================================
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\_pytest\config\__init__.py:1373
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\_pytest\config\__init__.py:1373: PytestConfigWarning: Unknown config option: doctest_glob
self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")
tests/quantization/bnb/test_4bit.py: 10 warnings
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:248: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.
warnings.warn('Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.')
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\utils\cpp_extension.py:28: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
from pkg_resources import packaging # type: ignore[attr-defined]
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\pkg_resources\__init__.py:2868: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\utils\cpp_extension.py:383: UserWarning: Error checking compiler version for clang++: Command 'clang++' returned non-zero exit status 1.
warnings.warn(f'Error checking compiler version for {compiler}: {error}')
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_with_keep_in_fp32
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_with_keep_in_fp32
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_without_keep_in_fp32
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_without_keep_in_fp32
D:\src\transformers\src\transformers\generation\utils.py:1133: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_with_keep_in_fp32
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_without_keep_in_fp32
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:245: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference.
warnings.warn('Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference.')
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================================== short test summary info ====================================================================================================
FAILED tests/quantization/bnb/test_4bit.py::BaseSerializationTest::test_serialization - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_fp4_double_safe - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_fp4_double_unsafe - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_fp4_single_safe - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_fp4_single_unsafe - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_nf4_double_unsafe - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_nf4_single_safe - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_nf4_single_unsafe - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_serialization - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::BloomSerializationTest::test_serialization - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
FAILED tests/quantization/bnb/test_4bit.py::GPTSerializationTest::test_serialization - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
=============================================================================== 11 failed, 27 passed, 1 skipped, 22 warnings in 423.85s (0:07:03) =============================================================================== |
Interesting, the great news is that only the serialization tests are failing, can you try to update |
Can you in addition to that run the 8bit tests? 🙏 |
after updating accelerate, (venv) >pip show accelerate
Name: accelerate
Version: 0.26.1
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: [email protected]
License: Apache
Location: f:\webui\webui\stable-diffusion-webui\venv\lib\site-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: image-reward
(venv) D:\src\transformers>python -m pytest tests\quantization\bnb\test_4bit.py -k "test_serialization"
====================================================================================================== test session starts ======================================================================================================
platform win32 -- Python 3.10.11, pytest-7.4.2, pluggy-1.3.0
rootdir: D:\src\transformers
configfile: pyproject.toml
plugins: anyio-3.7.1, hydra-core-1.3.2, hypothesis-6.93.0, xdist-3.5.0
collected 39 items / 35 deselected / 4 selected
tests\quantization\bnb\test_4bit.py .... [100%]
======================================================================================================= warnings summary ========================================================================================================
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\_pytest\config\__init__.py:1373
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\_pytest\config\__init__.py:1373: PytestConfigWarning: Unknown config option: doctest_glob
self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")
tests/quantization/bnb/test_4bit.py::BaseSerializationTest::test_serialization
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
tests/quantization/bnb/test_4bit.py::BaseSerializationTest::test_serialization
tests/quantization/bnb/test_4bit.py::ExtendedSerializationTest::test_serialization
tests/quantization/bnb/test_4bit.py::BloomSerializationTest::test_serialization
tests/quantization/bnb/test_4bit.py::GPTSerializationTest::test_serialization
D:\src\transformers\src\transformers\quantizers\auto.py:147: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be prevail.
warnings.warn(warning_msg)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================= 4 passed, 35 deselected, 6 warnings in 46.51s ========================================================================================= test again (venv) D:\src\transformers>python -m pytest tests\quantization\bnb\test_4bit.py
====================================================================================================== test session starts ======================================================================================================
platform win32 -- Python 3.10.11, pytest-7.4.2, pluggy-1.3.0
rootdir: D:\src\transformers
configfile: pyproject.toml
plugins: anyio-3.7.1, hydra-core-1.3.2, hypothesis-6.93.0, xdist-3.5.0
collected 39 items
tests\quantization\bnb\test_4bit.py ..............s........................ [100%] Details======================================================================================================= warnings summary ========================================================================================================
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\_pytest\config\__init__.py:1373
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\_pytest\config\__init__.py:1373: PytestConfigWarning: Unknown config option: doctest_glob
self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")
tests/quantization/bnb/test_4bit.py: 10 warnings
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:248: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.
warnings.warn('Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.')
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\utils\cpp_extension.py:28: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
from pkg_resources import packaging # type: ignore[attr-defined]
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\pkg_resources\__init__.py:2868: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
tests/quantization/bnb/test_4bit.py::Bnb4BitTest::test_rwkv_4bit
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\torch\utils\cpp_extension.py:383: UserWarning: Error checking compiler version for clang++: Command 'clang++' returned non-zero exit status 1.
warnings.warn(f'Error checking compiler version for {compiler}: {error}')
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_with_keep_in_fp32
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_with_keep_in_fp32
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_without_keep_in_fp32
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_without_keep_in_fp32
D:\src\transformers\src\transformers\generation\utils.py:1133: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_with_keep_in_fp32
tests/quantization/bnb/test_4bit.py::Bnb4BitT5Test::test_inference_without_keep_in_fp32
F:\webui\webui\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\nn\modules.py:245: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference.
warnings.warn('Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference.')
tests/quantization/bnb/test_4bit.py: 11 warnings
D:\src\transformers\src\transformers\quantizers\auto.py:147: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be prevail.
warnings.warn(warning_msg) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================== 38 passed, 1 skipped, 33 warnings in 441.37s (0:07:21) ===================================================================================== all tests passed for 4bit!!😎 |
short test summary info for mixed_int8 test (venv) D:\src\transformers>python -m pytest tests\quantization\bnb\test_mixed_int8.py
====================================================================================================== test session starts ======================================================================================================
platform win32 -- Python 3.10.11, pytest-7.4.2, pluggy-1.3.0
rootdir: D:\src\transformers
configfile: pyproject.toml
plugins: anyio-3.7.1, hydra-core-1.3.2, hypothesis-6.93.0, xdist-3.5.0
collected 43 items
tests\quantization\bnb\test_mixed_int8.py .....................sssss...FF..FFFF......
(snip)...
=========================================================================================================== FAILURES ============================================================================================================
____________________________________________________________________________________________ MixedInt8GPT2Test.test_generate_quality ____________________________________________________________________________________________
self = <bnb.test_mixed_int8.MixedInt8GPT2Test testMethod=test_generate_quality>
def test_generate_quality(self):
r"""
Test the generation quality of the quantized model and see that we are matching the expected output.
Given that we are operating on small numbers + the testing model is relatively small, we might not get
the same output across GPUs. So we'll generate few tokens (5-10) and check their output.
"""
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
output_sequences = self.model_8bit.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
> self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
E AssertionError: 'Hello my name is John Doe, and I am a member of the' not found in {"Hello my name is John Doe, and I'm a big fan of", "Hello my name is John Doe, and I'm a fan of the"}
tests\quantization\bnb\test_mixed_int8.py:264: AssertionError
----------------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------------------
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
(snip)...
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================================== short test summary info ====================================================================================================
FAILED tests/quantization/bnb/test_mixed_int8.py::MixedInt8GPT2Test::test_generate_quality - AssertionError: 'Hello my name is John Doe, and I am a member of the' not found in {"Hello my name is John Doe, and I'm a big fan of", "Hello my name is John Doe, and I'm a fan of the"}
FAILED tests/quantization/bnb/test_mixed_int8.py::MixedInt8GPT2Test::test_generate_quality_config - AssertionError: 'Hello my name is John Doe, and I am a member of the' not found in {"Hello my name is John Doe, and I'm a big fan of", "Hello my name is John Doe, and I'm a fan of the"}
FAILED tests/quantization/bnb/test_mixed_int8.py::MixedInt8GPT2Test::test_int8_from_pretrained - AssertionError: 'Hello my name is John Doe, and I am a member of the' not found in {"Hello my name is John Doe, and I'm a big fan of", "Hello my name is John Doe, and I'm a fan of the"}
FAILED tests/quantization/bnb/test_mixed_int8.py::MixedInt8GPT2Test::test_int8_serialization - AssertionError: 'Hello my name is John Doe, and I am a member of the' not found in {"Hello my name is John Doe, and I'm a big fan of", "Hello my name is John Doe, and I'm a fan of the"}
FAILED tests/quantization/bnb/test_mixed_int8.py::MixedInt8GPT2Test::test_int8_serialization_regression - AssertionError: 'Hello my name is John Doe, and I am a member of the' not found in {"Hello my name is John Doe, and I'm a big fan of", "Hello my name is John Doe, and I'm a fan of the"}
FAILED tests/quantization/bnb/test_mixed_int8.py::MixedInt8GPT2Test::test_int8_serialization_sharded - AssertionError: 'Hello my name is John Doe, and I am a member of the' not found in {"Hello my name is John Doe, and I'm a big fan of", "Hello my name is John Doe, and I'm a fan of the"}
=============================================================================== 6 failed, 32 passed, 5 skipped, 19 warnings in 720.00s (0:11:59) ================================================================================
|
AMAZING @wkpark ! 🎉 |
For the record, |
@akx thanks ! I meant for the transformers repository not for the slow tests in bnb repository (I think you meant here the slow tests for bnb no?) |
I was able to build with CUDA 12.0 and run the tests on Windows. Hardware: Software: Build configuration:
I've observed the same crash on the tests in test_optim. When I skip those tests and the benchmark/slow ones, here is my result:
As for the optimizer tests, these complete with 3 failures and 2 skips prior to a crash:
|
Are those tests only failing due to slight deviations from the tolerances? If this is the case, then this is expected due to the unfortunately quite flaky tests (something we'll work on fixing soon). In that case, we could close this issue and be super happy that this whole Windows journey went so well! Thanks again to anyone involved, especially @wkpark and @matthewdouglas ❤️ |
@Titus-von-Koeller Yes, the failures were related to some tolerances and the stochastic nature of some of the tests. I get similar results on my Linux machine. I do think the crash on the 32bit optimizer tests was related to the 6GB vRAM that I have on my Windows machine. It seems those tests need closer to ~12GB to run. Stabilizing these tests is a good separate issue across platform, but I think we're good closing this one. |
System Info
OS: Windows10
Python: 3.10
Torch: 2.1.2
GPU: 4060 TI 16GB
Cuda: 11.8
bitsandbytes: latest snapshot
Reproduction
this is just a report for current windows support
Expected behavior
This is a test result of
tests\tests_functional.py
: 31 failed, 592 passed, 9 skipped in 767.86s (0:12:47)tests\test_autograd.py
: 2240 passed, 704 warnings in 119.18s (0:01:59)tests\test_linear4bit.py
: 32 passed in 2.90stests\test_linear8bitlt.py
: 18 passed in 14.60stests\test_optim.py
: system crash after test done. (about 19 error, collected 177 items)Details
test_nvidia_transform
: 8 failed, 88 passed, 536 deselected in 11.29stest_gemv_4bit
: 23 failed, 169 passed, 440 deselected in 615.68s (0:10:15)The text was updated successfully, but these errors were encountered: