You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
w4a8 per-channel goes well, but w4a8 per-group OOM
Smoothing model.layers.0
24-12-03 17:29:19 | D | - model.layers.0.self_attn.attn_k
24-12-03 17:29:19 | D | + w: None
24-12-03 17:29:19 | D | + x: None
24-12-03 17:29:19 | D | + y: uint4
24-12-03 17:29:19 | D | + tensor_type: TensorType.Outputs, objective: SearchBasedCalibObjective.OutputsError, granularity: SearchBasedCalibGranularity.Layer
24-12-03 17:29:19 | D | + finished parsing calibration arguments, ram usage: 11.2
24-12-03 17:29:19 | D | + x - AbsMax
24-12-03 17:29:19 | D | + x = [min=0.3623, max=12.6562]
24-12-03 17:29:19 | D | + y - AbsMax
24-12-03 17:29:19 | D | + y = [min=0.2998, max=6.1250]
24-12-03 17:29:19 | D | + finished reseting calibrator, ram usage: 11.2
24-12-03 17:29:19 | E | === Error ===
24-12-03 17:29:19 | E | Traceback (most recent call last):
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/ptq.py", line 384, in
24-12-03 17:29:19 | E | main(config, logging_level=tools.logging.DEBUG)
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/ptq.py", line 352, in main
24-12-03 17:29:19 | E | model = ptq(
24-12-03 17:29:19 | E | ^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/ptq.py", line 190, in ptq
24-12-03 17:29:19 | E | smooth_cache = smooth_llm(model, config, tokenizer=tokenizer)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
24-12-03 17:29:19 | E | return func(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/quant/smooth.py", line 197, in smooth_llm
24-12-03 17:29:19 | E | smooth_llm_layer(
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
24-12-03 17:29:19 | E | return func(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/quant/smooth.py", line 55, in smooth_llm_layer
24-12-03 17:29:19 | E | smooth_cache[cache_key] = smooth_attention(
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
24-12-03 17:29:19 | E | return func(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/smooth.py", line 1088, in smooth_attention
24-12-03 17:29:19 | E | ).calibrate(
24-12-03 17:29:19 | E | ^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/smooth.py", line 837, in calibrate
24-12-03 17:29:19 | E | return super().calibrate(
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/search.py", line 662, in calibrate
24-12-03 17:29:19 | E | result = self._calibrate_opts(
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/search.py", line 1033, in _calibrate_opts
24-12-03 17:29:19 | E | y = eval_module(*ipt.args, **ipt.kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/nn/struct/base.py", line 61, in call
24-12-03 17:29:19 | E | return self.module(*args, **kwds)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
24-12-03 17:29:19 | E | return self._call_impl(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
24-12-03 17:29:19 | E | return forward_call(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
24-12-03 17:29:19 | E | output = module._old_forward(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/transformers/models/llama/modeling_llama.py", line 636, in forward
24-12-03 17:29:19 | E | query_states = self.q_proj(hidden_states)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
24-12-03 17:29:19 | E | return self._call_impl(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
24-12-03 17:29:19 | E | return forward_call(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
24-12-03 17:29:19 | E | output = module._old_forward(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py", line 117, in forward
24-12-03 17:29:19 | E | return F.linear(input, self.weight, self.bias)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 0 has a total capacity of 79.21 GiB of which 515.19 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 1.65 GiB is allocated by PyTorch, and 851.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
24-12-03 17:29:19 | E |
The text was updated successfully, but these errors were encountered:
Hi,
w4a8 per-channel goes well, but w4a8 per-group OOM
24-12-03 17:29:19 | D | - model.layers.0.self_attn.attn_k
24-12-03 17:29:19 | D | + w: None
24-12-03 17:29:19 | D | + x: None
24-12-03 17:29:19 | D | + y: uint4
24-12-03 17:29:19 | D | + tensor_type: TensorType.Outputs, objective: SearchBasedCalibObjective.OutputsError, granularity: SearchBasedCalibGranularity.Layer
24-12-03 17:29:19 | D | + finished parsing calibration arguments, ram usage: 11.2
24-12-03 17:29:19 | D | + x - AbsMax
24-12-03 17:29:19 | D | + x = [min=0.3623, max=12.6562]
24-12-03 17:29:19 | D | + y - AbsMax
24-12-03 17:29:19 | D | + y = [min=0.2998, max=6.1250]
24-12-03 17:29:19 | D | + finished reseting calibrator, ram usage: 11.2
24-12-03 17:29:19 | E | === Error ===
24-12-03 17:29:19 | E | Traceback (most recent call last):
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/ptq.py", line 384, in
24-12-03 17:29:19 | E | main(config, logging_level=tools.logging.DEBUG)
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/ptq.py", line 352, in main
24-12-03 17:29:19 | E | model = ptq(
24-12-03 17:29:19 | E | ^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/ptq.py", line 190, in ptq
24-12-03 17:29:19 | E | smooth_cache = smooth_llm(model, config, tokenizer=tokenizer)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
24-12-03 17:29:19 | E | return func(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/quant/smooth.py", line 197, in smooth_llm
24-12-03 17:29:19 | E | smooth_llm_layer(
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
24-12-03 17:29:19 | E | return func(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/quant/smooth.py", line 55, in smooth_llm_layer
24-12-03 17:29:19 | E | smooth_cache[cache_key] = smooth_attention(
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
24-12-03 17:29:19 | E | return func(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/smooth.py", line 1088, in smooth_attention
24-12-03 17:29:19 | E | ).calibrate(
24-12-03 17:29:19 | E | ^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/smooth.py", line 837, in calibrate
24-12-03 17:29:19 | E | return super().calibrate(
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/search.py", line 662, in calibrate
24-12-03 17:29:19 | E | result = self._calibrate_opts(
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/search.py", line 1033, in _calibrate_opts
24-12-03 17:29:19 | E | y = eval_module(*ipt.args, **ipt.kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/nn/struct/base.py", line 61, in call
24-12-03 17:29:19 | E | return self.module(*args, **kwds)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
24-12-03 17:29:19 | E | return self._call_impl(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
24-12-03 17:29:19 | E | return forward_call(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
24-12-03 17:29:19 | E | output = module._old_forward(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/transformers/models/llama/modeling_llama.py", line 636, in forward
24-12-03 17:29:19 | E | query_states = self.q_proj(hidden_states)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
24-12-03 17:29:19 | E | return self._call_impl(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
24-12-03 17:29:19 | E | return forward_call(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
24-12-03 17:29:19 | E | output = module._old_forward(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py", line 117, in forward
24-12-03 17:29:19 | E | return F.linear(input, self.weight, self.bias)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 0 has a total capacity of 79.21 GiB of which 515.19 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 1.65 GiB is allocated by PyTorch, and 851.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
24-12-03 17:29:19 | E |
The text was updated successfully, but these errors were encountered: