OOM when use deepcompressor quantize llama2 w4a8 per-group with H100 80G #33

Andy0422 · 2024-12-03T10:52:28Z

Hi,

w4a8 per-channel goes well, but w4a8 per-group OOM

Smoothing model.layers.0
24-12-03 17:29:19 | D | - model.layers.0.self_attn.attn_k
24-12-03 17:29:19 | D | + w: None
24-12-03 17:29:19 | D | + x: None
24-12-03 17:29:19 | D | + y: uint4
24-12-03 17:29:19 | D | + tensor_type: TensorType.Outputs, objective: SearchBasedCalibObjective.OutputsError, granularity: SearchBasedCalibGranularity.Layer
24-12-03 17:29:19 | D | + finished parsing calibration arguments, ram usage: 11.2
24-12-03 17:29:19 | D | + x - AbsMax
24-12-03 17:29:19 | D | + x = [min=0.3623, max=12.6562]
24-12-03 17:29:19 | D | + y - AbsMax
24-12-03 17:29:19 | D | + y = [min=0.2998, max=6.1250]
24-12-03 17:29:19 | D | + finished reseting calibrator, ram usage: 11.2
24-12-03 17:29:19 | E | === Error ===
24-12-03 17:29:19 | E | Traceback (most recent call last):
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/ptq.py", line 384, in
24-12-03 17:29:19 | E | main(config, logging_level=tools.logging.DEBUG)
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/ptq.py", line 352, in main
24-12-03 17:29:19 | E | model = ptq(
24-12-03 17:29:19 | E | ^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/ptq.py", line 190, in ptq
24-12-03 17:29:19 | E | smooth_cache = smooth_llm(model, config, tokenizer=tokenizer)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
24-12-03 17:29:19 | E | return func(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/quant/smooth.py", line 197, in smooth_llm
24-12-03 17:29:19 | E | smooth_llm_layer(
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
24-12-03 17:29:19 | E | return func(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/quant/smooth.py", line 55, in smooth_llm_layer
24-12-03 17:29:19 | E | smooth_cache[cache_key] = smooth_attention(
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
24-12-03 17:29:19 | E | return func(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/smooth.py", line 1088, in smooth_attention
24-12-03 17:29:19 | E | ).calibrate(
24-12-03 17:29:19 | E | ^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/smooth.py", line 837, in calibrate
24-12-03 17:29:19 | E | return super().calibrate(
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/search.py", line 662, in calibrate
24-12-03 17:29:19 | E | result = self._calibrate_opts(
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/search.py", line 1033, in _calibrate_opts
24-12-03 17:29:19 | E | y = eval_module(*ipt.args, **ipt.kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/nn/struct/base.py", line 61, in call
24-12-03 17:29:19 | E | return self.module(*args, **kwds)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
24-12-03 17:29:19 | E | return self._call_impl(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
24-12-03 17:29:19 | E | return forward_call(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
24-12-03 17:29:19 | E | output = module._old_forward(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/transformers/models/llama/modeling_llama.py", line 636, in forward
24-12-03 17:29:19 | E | query_states = self.q_proj(hidden_states)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
24-12-03 17:29:19 | E | return self._call_impl(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
24-12-03 17:29:19 | E | return forward_call(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
24-12-03 17:29:19 | E | output = module._old_forward(*args, **kwargs)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py", line 117, in forward
24-12-03 17:29:19 | E | return F.linear(input, self.weight, self.bias)
24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-12-03 17:29:19 | E | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 0 has a total capacity of 79.21 GiB of which 515.19 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 1.65 GiB is allocated by PyTorch, and 851.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
24-12-03 17:29:19 | E |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM when use deepcompressor quantize llama2 w4a8 per-group with H100 80G #33

OOM when use deepcompressor quantize llama2 w4a8 per-group with H100 80G #33

Andy0422 commented Dec 3, 2024

OOM when use deepcompressor quantize llama2 w4a8 per-group with H100 80G #33

OOM when use deepcompressor quantize llama2 w4a8 per-group with H100 80G #33

Comments

Andy0422 commented Dec 3, 2024