quantkit

A tool for downloading and converting HuggingFace models without drama.

Install

If you're on a machine with an NVIDIA/CUDA GPU and want AWQ/GPTQ support:

pip3 install llm-quantkit[cuda]

Otherwise, the default install works.

pip3 install llm-quantkit

Requirements

If you need a device specific torch, install it first.

This project depends on torch, awq, exl2, gptq, and hqq libraries.
Some of these dependencies do not support Python 3.12 yet.
Supported Pythons: 3.8, 3.9, 3.10, and 3.11

Usage

Usage: quantkit [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  download    Download model from huggingface.
  safetensor  Download and/or convert a pytorch model to safetensor format.
  awq         Download and/or convert a model to AWQ format.
  exl2        Download and/or convert a model to EXL2 format.
  gguf        Download and/or convert a model to GGUF format.
  gptq        Download and/or convert a model to GPTQ format.
  hqq         Download and/or convert a model to HQQ format.
  compressor  Download and/or convert a model with llm-compressor.

The first argument after command should be an HF repo id (mistralai/Mistral-7B-v0.1) or a local directory with model files in it already.

The download command defaults to downloading into the HF cache and producing symlinks in the output dir, but there is a --no-cache option which places the model files in the output directory.

AWQ defaults to 4 bits, group size 128, zero-point True.
GPTQ defaults are 4 bits, group size 128, activation-order False.
EXL2 defaults to 8 head bits but there is no default bitrate.
GGUF defaults to no imatrix but there is no default quant-type.
HQQ defaults to 4 bits, group size 64, zero_point=True.
llm-compressor defaults to w8a8 int8.

Examples

Download a model from HF and don't use HF cache:

quantkit download teknium/Hermes-Trismegistus-Mistral-7B --no-cache

Only download the safetensors version of a model (useful for models that have torch and safetensor):

quantkit download mistralai/Mistral-7B-v0.1 --no-cache --safetensors-only -out mistral7b

Download from specific revision of a huggingface repo:

uantkit download turboderp/TinyLlama-1B-32k-exl2 --branch 6.0bpw --no-cache -out TinyLlama-1B-32k-exl2-b6

Download and convert a model to safetensor, deleting the original pytorch bins:

quantkit safetensor migtissera/Tess-10.7B-v1.5b --delete-original

Download and convert a model to GGUF (Q5_K):

quantkit gguf TinyLlama/TinyLlama-1.1B-Chat-v1.0 -out TinyLlama-1.1B-Q5_K.gguf Q5_K

Download and convert a model to GGUF using an imatrix, offloading 200 layers:

quantkit gguf TinyLlama/TinyLlama-1.1B-Chat-v1.0 -out TinyLlama-1.1B-IQ4_XS.gguf IQ4_XS --built-in-imatrix -ngl 200

Download and convert a model to AWQ:

quantkit awq mistralai/Mistral-7B-v0.1 -out Mistral-7B-v0.1-AWQ

Convert a model to GPTQ (4 bits / group-size 32):

quantkit gptq mistral7b -out Mistral-7B-v0.1-GPTQ -b 4 --group-size 32

Convert a model to exllamav2:

quantkit exl2 mistralai/Mistral-7B-v0.1 -out Mistral-7B-v0.1-exl2-b8-h8 -b 8 -hb 8

Convert a model to HQQ:

quantkit hqq mistralai/Mistral-7B-v0.1 -out Mistral-7B-HQQ-w4-gs64

Hardware Requirements

Here's what has worked for me in testing. Drop a PR or Issue with updates for what is possible on various size cards.
GGUF conversion doesn't need a GPU except for iMatrix and Exllamav2 requires that the largest layer fits on single GPU.

Model Size	Quant	VRAM	Successful
7B	AWQ	24GB	✅
7B	EXL2	24GB	✅
7B	GGUF	24GB	✅
7B	GPTQ	24GB	✅
7B	HQQ	24GB	✅
13B	AWQ	24GB	✅
13B	EXL2	24GB	✅
13B	GGUF	24GB	✅
13B	GPTQ	24GB	❌
13B	HQQ	24GB	?
34B	AWQ	24GB	❌
34B	EXL2	24GB	✅
34B	GGUF	24GB	✅
34B	GPTQ	24GB	❌
34B	HQQ	24GB	?
70B	AWQ	24GB	❌
70B	EXL2	24GB	✅
70B	GGUF	24GB	✅
70B	GPTQ	24GB	❌
70B	HQQ	24GB	?

Notes

Still in beta. Llama.cpp offloading is probably not going to work on your platform unless you uninstall llama-cpp-conv and reinstall it with the proper build flags. Look at the llama-cpp-python documentation and follow the relevant command but replace llama-cpp-python with llama-cpp-conv.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github/workflows		.github/workflows
quantkit		quantkit
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

quantkit

Install

Requirements

Usage

Examples

Hardware Requirements

Notes

About

Releases

Packages

Contributors 2

Languages

License

xhedit/quantkit

Folders and files

Latest commit

History

Repository files navigation

quantkit

Install

Requirements

Usage

Examples

Hardware Requirements

Notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages