Gibbish Output from 4bit EXL2 quantization #15

fgdfgfthgr-fox · 2023-09-13T08:34:10Z

Hi there,
After a few hours of waiting, I successfully quantized llama2-13b base model into EXL2 quantization. With an average of 4 bits per weight.
However, as I tries to inference using the webui, I encountered this:

I am using Radeon VII GPU. (AMD GPU, with Rocm 0.6.0)
Here is the terminal output during quantization:
quant outputs.txt
Here is the job.json and measurement.json file in the output folder after quantization.
convert_output.zip
The calibration data file used were wikitext-2-v1

fgdfgfthgr-fox · 2023-09-13T08:39:39Z

(textgen) fgdfgfthgr@fgdfgfthgr-MS-7C95:/mnt/7018F20D48B6C548/exllamav2$ python test_inference.py -m '/mnt/7018F20D48B6C548/text-generation-webui/models/exl2_llama2_13b-4bit' -p "Once upon a time,"
Successfully preprocessed all matching files.
 -- Model: /mnt/7018F20D48B6C548/text-generation-webui/models/exl2_llama2_13b-4bit
 -- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Warmup...
 -- Generating (greedy sampling)...

Once upon a time,ttttt...t.t.t.t.t.t.t.t.t.t.t.t.t.t.t.t.t.t.t.t.t.t1t1t1t.t111111101010101010101010101010101010101010101010101010101010101010101

Prompt processed in 0.13 seconds, 5 tokens, 39.24 tokens/second
Response generated in 4.28 seconds, 128 tokens, 29.91 tokens/second

Similar issue when using the example inference.

turboderp · 2023-09-13T10:34:56Z

I'm not sure what's going on here. The perplexity after the measurement step is way too high, and it's essentially produced by the full-precision model on a small sample of the calibration data. I can't imagine that would fail like this without Torch being completely broken in general on your system. (To be clear, I'm assuming Torch is not completely broken so it's something else.)

Which suggests there's a problem with loading the calibration data. And I've never seen that deprecation warning before, which is produced right where the text is read from the Parquet file. If somehow what it gets from the file is garbled, that might explain it. Which split did you use (test/train/val)? And what exact versions of pandas and fastparquet?

fgdfgfthgr-fox · 2023-09-13T11:41:55Z

Which split did you use (test/train/val)? And what exact versions of pandas and fastparquet?

I used the 0000.parquet from the training split. It's 6mb in size.
pandas version: 2.1.0
fastparquet: 2023.8.0
pytorch: 2.2.0.dev20230912+rocm5.6

fgdfgfthgr-fox · 2023-09-14T11:34:32Z

Just checked if my pytorch being broken or not:

Exllama 1 with gptq model: works fine
Exllama 2 with gptq model: gibbish

Both running in the same conda environment.
So maybe the problem isn't in the quant?
Also, I noticed something strange when loading the model (exllama 2, using either ooba's webui or the example inference script). When I try to load the model the first time, it will fail and output the following:

2023-09-14 23:28:13 INFO:Loading gptq-llama2-13b-32g...
Traceback (most recent call last):
  File "/mnt/7018F20D48B6C548/text-generation-webui/modules/ui_model_menu.py", line 194, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(shared.model_name, loader)
  File "/mnt/7018F20D48B6C548/text-generation-webui/modules/models.py", line 77, in load_model
    output = load_func_map[loader](model_name)
  File "/mnt/7018F20D48B6C548/text-generation-webui/modules/models.py", line 335, in ExLlamav2_loader
    from modules.exllamav2 import Exllamav2Model
  File "/mnt/7018F20D48B6C548/text-generation-webui/modules/exllamav2.py", line 5, in <module>
    from exllamav2 import (
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/site-packages/exllamav2/__init__.py", line 3, in <module>
    from exllamav2.model import ExLlamaV2
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/site-packages/exllamav2/model.py", line 12, in <module>
    from exllamav2.linear import ExLlamaV2Linear
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/site-packages/exllamav2/linear.py", line 4, in <module>
    from exllamav2 import ext
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/site-packages/exllamav2/ext.py", line 121, in <module>
    exllamav2_ext = load \
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1691, in _jit_compile
    hipify_result = hipify_python.hipify(
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/hipify/hipify_python.py", line 1106, in hipify
    path.is_file()
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/pathlib.py", line 1322, in is_file
    return S_ISREG(self.stat().st_mode)
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/pathlib.py", line 1097, in stat
    return self._accessor.stat(self, follow_symlinks=follow_symlinks)
PermissionError: [Errno 13] Permission denied: '/proc/1/cwd'

But then if I just click reload, it will load in just a few seconds without any error. Werid. Consider I put my model file in the HDD and it should usually take quite a while.

turboderp · 2023-09-14T15:49:04Z

I'm starting to think it might be ROCm related after all. I tried switching to all the same library versions as you, and using the same dataset, I was able to get the deprecation warning. It seems they've made some changes to pandas recently, and I've updated the code accordingly to get rid of the warning, but it was still correctly loading and tokenizing the data, regardless.

Then I would assume there's something unusual about the model you were converting, but given that it's also just failing to do inference on GPTQ models, it must be a ROCm-related issue after all.

As for the second error... well, it might load the model very quickly a second time if it gets cached by the OS first time around, so that's not too strange on its own. But the permission error is weird. I've never seen that before. But it looks like more evidence that there's something about the new extension that isn't playing well with ROCm. Maybe @ardfork or someone else with more ROCm experience has seen sort of error before?

I guess one thing always worth trying is deleting the extension cache from ~/.cache/torch_extensions.

ardfork · 2023-09-14T16:35:16Z

Also, I noticed something strange when loading the model (exllama 2, using either ooba's webui or the example inference script). When I try to load the model the first time, it will fail and output the following

This error happen on 0.0.0 (before my patch), because the extra_include_paths was not correctly set, it tries to add all files in your computer to be hipified, it only stops because it doesn't have permission on some file.

But then if I just click reload, it will load in just a few seconds without any error.

As for why it works after I do not have the answer, probably missing some information or some weird ooba behaviors.

I'm starting to think it might be ROCm related after all.

I do not have the time, the will or GPU power to try to reproduce this issue. But since inference and perplexity works correctly, there shouldn't be a reason for quantization to fail.

ardfork · 2023-09-14T20:15:07Z

Aren't you using the same GPU as #33? Issue might be the same.

fgdfgfthgr-fox · 2023-09-15T02:37:15Z

Aren't you using the same GPU as #33? Issue might be the same.

Yeah... very likely the same issue.

turboderp · 2023-10-20T11:07:02Z

Closing this as it appears to be stale. If there are still issues on ROCm, please reopen this or submit a new issue.

ardfork mentioned this issue Sep 29, 2023

ROCM: Garbadge output #33

Closed

turboderp closed this as completed Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gibbish Output from 4bit EXL2 quantization #15

Gibbish Output from 4bit EXL2 quantization #15

fgdfgfthgr-fox commented Sep 13, 2023

fgdfgfthgr-fox commented Sep 13, 2023

turboderp commented Sep 13, 2023 •

edited

Loading

fgdfgfthgr-fox commented Sep 13, 2023

fgdfgfthgr-fox commented Sep 14, 2023

turboderp commented Sep 14, 2023

ardfork commented Sep 14, 2023

ardfork commented Sep 14, 2023

fgdfgfthgr-fox commented Sep 15, 2023

turboderp commented Oct 20, 2023

Gibbish Output from 4bit EXL2 quantization #15

Gibbish Output from 4bit EXL2 quantization #15

Comments

fgdfgfthgr-fox commented Sep 13, 2023

fgdfgfthgr-fox commented Sep 13, 2023

turboderp commented Sep 13, 2023 • edited Loading

fgdfgfthgr-fox commented Sep 13, 2023

fgdfgfthgr-fox commented Sep 14, 2023

turboderp commented Sep 14, 2023

ardfork commented Sep 14, 2023

ardfork commented Sep 14, 2023

fgdfgfthgr-fox commented Sep 15, 2023

turboderp commented Oct 20, 2023

turboderp commented Sep 13, 2023 •

edited

Loading