Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM when use deepcompressor quantize llama2 w4a8 per-group with H100 80G #33

Open
Andy0422 opened this issue Dec 3, 2024 · 0 comments

Comments

@Andy0422
Copy link

Andy0422 commented Dec 3, 2024

Hi,

w4a8 per-channel goes well, but w4a8 per-group OOM

  • Smoothing model.layers.0
    24-12-03 17:29:19 | D | - model.layers.0.self_attn.attn_k
    24-12-03 17:29:19 | D | + w: None
    24-12-03 17:29:19 | D | + x: None
    24-12-03 17:29:19 | D | + y: uint4
    24-12-03 17:29:19 | D | + tensor_type: TensorType.Outputs, objective: SearchBasedCalibObjective.OutputsError, granularity: SearchBasedCalibGranularity.Layer
    24-12-03 17:29:19 | D | + finished parsing calibration arguments, ram usage: 11.2
    24-12-03 17:29:19 | D | + x - AbsMax
    24-12-03 17:29:19 | D | + x = [min=0.3623, max=12.6562]
    24-12-03 17:29:19 | D | + y - AbsMax
    24-12-03 17:29:19 | D | + y = [min=0.2998, max=6.1250]
    24-12-03 17:29:19 | D | + finished reseting calibrator, ram usage: 11.2
    24-12-03 17:29:19 | E | === Error ===
    24-12-03 17:29:19 | E | Traceback (most recent call last):
    24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/ptq.py", line 384, in
    24-12-03 17:29:19 | E | main(config, logging_level=tools.logging.DEBUG)
    24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/ptq.py", line 352, in main
    24-12-03 17:29:19 | E | model = ptq(
    24-12-03 17:29:19 | E | ^^^^
    24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/ptq.py", line 190, in ptq
    24-12-03 17:29:19 | E | smooth_cache = smooth_llm(model, config, tokenizer=tokenizer)
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    24-12-03 17:29:19 | E | return func(*args, **kwargs)
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/quant/smooth.py", line 197, in smooth_llm
    24-12-03 17:29:19 | E | smooth_llm_layer(
    24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    24-12-03 17:29:19 | E | return func(*args, **kwargs)
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/app/llm/quant/smooth.py", line 55, in smooth_llm_layer
    24-12-03 17:29:19 | E | smooth_cache[cache_key] = smooth_attention(
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    24-12-03 17:29:19 | E | return func(*args, **kwargs)
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/smooth.py", line 1088, in smooth_attention
    24-12-03 17:29:19 | E | ).calibrate(
    24-12-03 17:29:19 | E | ^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/smooth.py", line 837, in calibrate
    24-12-03 17:29:19 | E | return super().calibrate(
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/search.py", line 662, in calibrate
    24-12-03 17:29:19 | E | result = self._calibrate_opts(
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/calib/search.py", line 1033, in _calibrate_opts
    24-12-03 17:29:19 | E | y = eval_module(*ipt.args, **ipt.kwargs)
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/home/wei.zhao/work/deepcompressor/deepcompressor/nn/struct/base.py", line 61, in call
    24-12-03 17:29:19 | E | return self.module(*args, **kwds)
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    24-12-03 17:29:19 | E | return self._call_impl(*args, **kwargs)
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    24-12-03 17:29:19 | E | return forward_call(*args, **kwargs)
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    24-12-03 17:29:19 | E | output = module._old_forward(*args, **kwargs)
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/transformers/models/llama/modeling_llama.py", line 636, in forward
    24-12-03 17:29:19 | E | query_states = self.q_proj(hidden_states)
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    24-12-03 17:29:19 | E | return self._call_impl(*args, **kwargs)
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    24-12-03 17:29:19 | E | return forward_call(*args, **kwargs)
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    24-12-03 17:29:19 | E | output = module._old_forward(*args, **kwargs)
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py", line 117, in forward
    24-12-03 17:29:19 | E | return F.linear(input, self.weight, self.bias)
    24-12-03 17:29:19 | E | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    24-12-03 17:29:19 | E | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 0 has a total capacity of 79.21 GiB of which 515.19 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 1.65 GiB is allocated by PyTorch, and 851.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
    24-12-03 17:29:19 | E |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant