Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gibbish Output from 4bit EXL2 quantization #15

Closed
fgdfgfthgr-fox opened this issue Sep 13, 2023 · 9 comments
Closed

Gibbish Output from 4bit EXL2 quantization #15

fgdfgfthgr-fox opened this issue Sep 13, 2023 · 9 comments

Comments

@fgdfgfthgr-fox
Copy link

Hi there,
After a few hours of waiting, I successfully quantized llama2-13b base model into EXL2 quantization. With an average of 4 bits per weight.
However, as I tries to inference using the webui, I encountered this:
Capture 1
I am using Radeon VII GPU. (AMD GPU, with Rocm 0.6.0)
Here is the terminal output during quantization:
quant outputs.txt
Here is the job.json and measurement.json file in the output folder after quantization.
convert_output.zip
The calibration data file used were wikitext-2-v1

@fgdfgfthgr-fox
Copy link
Author

(textgen) fgdfgfthgr@fgdfgfthgr-MS-7C95:/mnt/7018F20D48B6C548/exllamav2$ python test_inference.py -m '/mnt/7018F20D48B6C548/text-generation-webui/models/exl2_llama2_13b-4bit' -p "Once upon a time,"
Successfully preprocessed all matching files.
 -- Model: /mnt/7018F20D48B6C548/text-generation-webui/models/exl2_llama2_13b-4bit
 -- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Warmup...
 -- Generating (greedy sampling)...

Once upon a time,ttttt...t.t.t.t.t.t.t.t.t.t.t.t.t.t.t.t.t.t.t.t.t.t1t1t1t.t111111101010101010101010101010101010101010101010101010101010101010101

Prompt processed in 0.13 seconds, 5 tokens, 39.24 tokens/second
Response generated in 4.28 seconds, 128 tokens, 29.91 tokens/second

Similar issue when using the example inference.

@turboderp
Copy link
Member

turboderp commented Sep 13, 2023

I'm not sure what's going on here. The perplexity after the measurement step is way too high, and it's essentially produced by the full-precision model on a small sample of the calibration data. I can't imagine that would fail like this without Torch being completely broken in general on your system. (To be clear, I'm assuming Torch is not completely broken so it's something else.)

Which suggests there's a problem with loading the calibration data. And I've never seen that deprecation warning before, which is produced right where the text is read from the Parquet file. If somehow what it gets from the file is garbled, that might explain it. Which split did you use (test/train/val)? And what exact versions of pandas and fastparquet?

@fgdfgfthgr-fox
Copy link
Author

Which split did you use (test/train/val)? And what exact versions of pandas and fastparquet?

I used the 0000.parquet from the training split. It's 6mb in size.
pandas version: 2.1.0
fastparquet: 2023.8.0
pytorch: 2.2.0.dev20230912+rocm5.6

@fgdfgfthgr-fox
Copy link
Author

Just checked if my pytorch being broken or not:

Exllama 1 with gptq model: works fine
Exllama 2 with gptq model: gibbish

Both running in the same conda environment.
So maybe the problem isn't in the quant?
Also, I noticed something strange when loading the model (exllama 2, using either ooba's webui or the example inference script). When I try to load the model the first time, it will fail and output the following:

2023-09-14 23:28:13 INFO:Loading gptq-llama2-13b-32g...
Traceback (most recent call last):
  File "/mnt/7018F20D48B6C548/text-generation-webui/modules/ui_model_menu.py", line 194, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(shared.model_name, loader)
  File "/mnt/7018F20D48B6C548/text-generation-webui/modules/models.py", line 77, in load_model
    output = load_func_map[loader](model_name)
  File "/mnt/7018F20D48B6C548/text-generation-webui/modules/models.py", line 335, in ExLlamav2_loader
    from modules.exllamav2 import Exllamav2Model
  File "/mnt/7018F20D48B6C548/text-generation-webui/modules/exllamav2.py", line 5, in <module>
    from exllamav2 import (
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/site-packages/exllamav2/__init__.py", line 3, in <module>
    from exllamav2.model import ExLlamaV2
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/site-packages/exllamav2/model.py", line 12, in <module>
    from exllamav2.linear import ExLlamaV2Linear
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/site-packages/exllamav2/linear.py", line 4, in <module>
    from exllamav2 import ext
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/site-packages/exllamav2/ext.py", line 121, in <module>
    exllamav2_ext = load \
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1691, in _jit_compile
    hipify_result = hipify_python.hipify(
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/hipify/hipify_python.py", line 1106, in hipify
    path.is_file()
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/pathlib.py", line 1322, in is_file
    return S_ISREG(self.stat().st_mode)
  File "/home/fgdfgfthgr/anaconda3/envs/textgen/lib/python3.10/pathlib.py", line 1097, in stat
    return self._accessor.stat(self, follow_symlinks=follow_symlinks)
PermissionError: [Errno 13] Permission denied: '/proc/1/cwd'

But then if I just click reload, it will load in just a few seconds without any error. Werid. Consider I put my model file in the HDD and it should usually take quite a while.

@turboderp
Copy link
Member

I'm starting to think it might be ROCm related after all. I tried switching to all the same library versions as you, and using the same dataset, I was able to get the deprecation warning. It seems they've made some changes to pandas recently, and I've updated the code accordingly to get rid of the warning, but it was still correctly loading and tokenizing the data, regardless.

Then I would assume there's something unusual about the model you were converting, but given that it's also just failing to do inference on GPTQ models, it must be a ROCm-related issue after all.

As for the second error... well, it might load the model very quickly a second time if it gets cached by the OS first time around, so that's not too strange on its own. But the permission error is weird. I've never seen that before. But it looks like more evidence that there's something about the new extension that isn't playing well with ROCm. Maybe @ardfork or someone else with more ROCm experience has seen sort of error before?

I guess one thing always worth trying is deleting the extension cache from ~/.cache/torch_extensions.

@ardfork
Copy link
Contributor

ardfork commented Sep 14, 2023

Also, I noticed something strange when loading the model (exllama 2, using either ooba's webui or the example inference script). When I try to load the model the first time, it will fail and output the following

This error happen on 0.0.0 (before my patch), because the extra_include_paths was not correctly set, it tries to add all files in your computer to be hipified, it only stops because it doesn't have permission on some file.

But then if I just click reload, it will load in just a few seconds without any error.

As for why it works after I do not have the answer, probably missing some information or some weird ooba behaviors.

I'm starting to think it might be ROCm related after all.

I do not have the time, the will or GPU power to try to reproduce this issue. But since inference and perplexity works correctly, there shouldn't be a reason for quantization to fail.

@ardfork
Copy link
Contributor

ardfork commented Sep 14, 2023

Aren't you using the same GPU as #33? Issue might be the same.

@fgdfgfthgr-fox
Copy link
Author

Aren't you using the same GPU as #33? Issue might be the same.

Yeah... very likely the same issue.

@turboderp
Copy link
Member

Closing this as it appears to be stale. If there are still issues on ROCm, please reopen this or submit a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants