Skip to content

Commit

Permalink
Merge branch 'TimDettmers:main' into galore
Browse files Browse the repository at this point in the history
  • Loading branch information
matthewdouglas authored Apr 5, 2024
2 parents eceed12 + 0c64a0d commit 16cc220
Show file tree
Hide file tree
Showing 14 changed files with 170 additions and 24 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,10 @@ Bug fixes:
- Addressed a race condition in kEstimateQuantiles, enhancing the reliability of quantile estimation in concurrent environments (@pnunna93, #1061).
- Fixed various minor issues, including typos in code comments and documentation, to improve code clarity and prevent potential confusion (@Brian Vaughan, #1063).

#### Backwards Compatibility
- After upgrading from `v0.42` to `v0.43`, when using 4bit quantization, models may generate slightly different outputs (approximately up to the 2nd decimal place) due to a fix in the code. For anyone interested in the details, [see this comment](https://github.com/TimDettmers/bitsandbytes/discussions/1094#discussioncomment-8984069).


#### Internal and Build System Enhancements:
- Implemented several enhancements to the internal and build systems, including adjustments to the CI workflows, portability improvements, and build artifact management. These changes contribute to a more robust and flexible development process, ensuring the library's ongoing quality and maintainability (@rickardp, @akx, @wkpark, @matthewdouglas; #949, #1053, #1045, #1037).

Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# `bitsandbytes`

[![Downloads](https://static.pepy.tech/badge/bitsandbytes)](https://pepy.tech/project/bitsandbytes) [![Downloads](https://static.pepy.tech/badge/bitsandbytes/month)](https://pepy.tech/project/bitsandbytes) [![Downloads](https://static.pepy.tech/badge/bitsandbytes/week)](https://pepy.tech/project/bitsandbytes)

The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions.

The library includes quantization primitives for 8-bit & 4-bit operations, through `bitsandbytes.nn.Linear8bitLt` and `bitsandbytes.nn.Linear4bit` and 8-bit optimizers through `bitsandbytes.optim` module.
Expand Down
2 changes: 1 addition & 1 deletion bitsandbytes/diagnostics/cuda.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ def find_cuda_libraries_in_path_list(paths_list_candidate: str) -> Iterable[Path
for pth in dir.glob(lib_pattern):
if pth.is_file():
yield pth
except PermissionError:
except (OSError, PermissionError):
pass


Expand Down
7 changes: 4 additions & 3 deletions bitsandbytes/functional.py
Original file line number Diff line number Diff line change
Expand Up @@ -1087,11 +1087,12 @@ def get_4bit_type(typename, device=None, blocksize=64):
if data is None:
raise NotImplementedError(f"Typename {typename} not supported")

data = Tensor(data)
data /= data.abs().max()
data = torch.tensor(data, device=device)
data.div_(data.abs().max())

assert data.numel() == 16

return data.to(device)
return data


def quantize_fp4(
Expand Down
2 changes: 1 addition & 1 deletion csrc/ops.cu
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ template <typename T, int STOCHASTIC, int DATA_TYPE> void quantizeBlockwise(floa
num_blocks = n % blocksize == 0 ? num_blocks : num_blocks + 1;

if(blocksize == 4096)
kQuantizeBlockwise<T, 4096, 4, STOCHASTIC, 0><<<num_blocks, 1024>>>(code, A, absmax, out, rand, rand_offset, n);
kQuantizeBlockwise<T, 4096, 4, STOCHASTIC, DATA_TYPE><<<num_blocks, 1024>>>(code, A, absmax, out, rand, rand_offset, n);
else if(blocksize == 2048)
kQuantizeBlockwise<T, 2048, 4, 0, DATA_TYPE><<<num_blocks, 512>>>(code, A, absmax, out, rand, rand_offset, n);
else if(blocksize == 1024)
Expand Down
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
title: 8-bit optimizers
- local: algorithms
title: Algorithms
- local: fsdp_qlora
title: FSDP-QLoRA
- local: integrations
title: Integrations
- local: errors
Expand Down
106 changes: 106 additions & 0 deletions docs/source/fsdp_qlora.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# FSDP-QLoRA

FSDP-QLoRA combines data parallelism (FSDP enables sharding model parameters, optimizer states, and gradients across GPUs), 4-bit quantization, and LoRA to train LLMs up to 70B parameters on a dual 24GB GPU system. This technique was released by [Answer.AI](https://www.answer.ai/posts/2024-03-06-fsdp-qlora) in collaboration with bitsandbytes to make training LLMs more efficient and accessible for everyone.

This guide provides a brief guide on how bitsandbytes supports storing quantized weights to enable FSDP-QLoRA, and how to run training with the Hugging Face libraries.

> [!TIP]
> Other changes required for bitsandbytes to support FSDP-QLoRA, such as reconstructing the weights from the quantization metadata and preventing quantizing already quantized weights when they're moved from a CPU to GPU, are documented in this [Pull Request](https://github.com/TimDettmers/bitsandbytes/pull/970) and described in the [Enabling 70B Finetuning on Consumer GPUs](https://www.answer.ai/posts/2024-03-14-fsdp-qlora-deep-dive) blog post. We highly recommend reading these resources for a better understanding of FSDP-QLoRA!
## Quantized data storage

FSDP only supports sharding float data types which can be problematic because quantized weights are typically stored as integer data types (uint8). bitsandbytes doesn't have this problem because it uses `StoreChar` to read and write quantized weights regardless of the data type storage. This makes it simple to add a `quant_storage` parameter to the [`~nn.Linear4bit`] and [`~nn.Params4bit`] classes and set it to `torch.uint8` to maintain backward compatibility with the codebase.

```py
import torch
import bitsandbytes as bnb

model = bnb.nn.Linear4bit(
input_features,
output_features,
quant_type="fp4",
quant_storage=torch.uint8,
)
```

With the `quant_storage` parameter, you can select any of the FSDP supported data types to shard [`~nn.Linear4bit`] with such as bfloat16, float16 or float32.

## Training

bitsandbytes is deeply integrated with the Hugging Face ecosystem, making it easy to use with libraries like [Transformers](https://hf/co/docs/transformers), [PEFT](https://hf/co/docs/peft), and [TRL](https://hf/co/docs/trl).

Before you begin, make sure you have the latest libraries installed.

```bash
pip install -U bitsandbytes accelerate transformers peft trl
```

> [!TIP]
> PEFT provides a configuration file ([fsdp_config_qlora.yaml](https://github.com/huggingface/peft/blob/main/examples/sft/configs/fsdp_config_qlora.yaml)), launch command ([run_peft_qlora_fsdp.sh](https://github.com/huggingface/peft/blob/main/examples/sft/run_peft_qlora_fsdp.sh)), and training script ([train.py](https://github.com/huggingface/peft/blob/main/examples/sft/train.py)) for FSDP-QLoRA. To learn more, check out the [Use PEFT QLoRA and FSDP for finetuning large models on multiple GPUs](https://huggingface.co/docs/peft/main/en/accelerate/fsdp#use-peft-qlora-and-fsdp-for-finetuning-large-models-on-multiple-gpus) documentation.
The important change that enables FSDP-QLoRA training is the `bnb_4bit_quant_storage` parameter in the [`~transformers.BitsAndBytesConfig`] class. This allows you to set the storage data type of the quantized weights to a float data type.

```py
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_storage=torch.bfloat16,
)
```

Pass the [`~transformers.BitsAndBytesConfig`] to a model to set it up for FSDP-QLoRA. You should set the `torch_dtype` parameter to match `bnb_4bit_quant_storage` so that the [`~nn.Linear4bit`] layers are wrapped identically to the `Linear` layers. If the storage types do not match, then each [`~nn.Linear4bit`] layer is wrapped individually.

```py
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b",
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
)
```

Configure the [`~peft.LoraConfig`] class for QLoRA training by setting `target_modules="all-linear"`.

```py
from peft import LoraConfig

peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
target_modules="all-linear",
)
```

Now you can pass everything to the [`~trl.SFTTrainer`] for training.

```py
from trl import SFTTrainer

trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
)
trainer.train()
```

## Resources

To learn more about FSDP and QLoRA, check out the following resources:

- The [AnswerDotAI/fsdp_qlora](https://github.com/AnswerDotAI/fsdp_qlora) repository.
- The introductory [You can now train a 70b language model at home](https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html) blog post by Answer.AI.
- For an introduction to FSDP, read the [Introducing PyTorch Fully Sharded Data Parallel (FSDP) API](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api) blog post.
- For more details about QLoRA, take a look at the [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes) blog post.
2 changes: 1 addition & 1 deletion docs/source/installation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ Then locally install the CUDA version you need with this script from bitsandbyte
```bash
wget https://raw.githubusercontent.com/TimDettmers/bitsandbytes/main/install_cuda.sh
# Syntax cuda_install CUDA_VERSION INSTALL_PREFIX EXPORT_TO_BASH
# CUDA_VERSION in {110, 111, 112, 113, 114, 115, 116, 117, 118, 120, 121, 122, 123}
# CUDA_VERSION in {110, 111, 112, 113, 114, 115, 116, 117, 118, 120, 121, 122, 123, 124}
# EXPORT_TO_BASH in {0, 1} with 0=False and 1=True

# For example, the following installs CUDA 11.7 to ~/local/cuda-11.7 and exports the path to your .bashrc
Expand Down
1 change: 1 addition & 0 deletions install_cuda.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
"121": "https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run",
"122": "https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_535.104.05_linux.run",
"123": "https://developer.download.nvidia.com/compute/cuda/12.3.2/local_installers/cuda_12.3.2_545.23.08_linux.run",
"124": "https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run",
}


Expand Down
7 changes: 5 additions & 2 deletions install_cuda.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ URL120=https://developer.download.nvidia.com/compute/cuda/12.0.1/local_installer
URL121=https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run
URL122=https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_535.104.05_linux.run
URL123=https://developer.download.nvidia.com/compute/cuda/12.3.2/local_installers/cuda_12.3.2_545.23.08_linux.run

URL124=https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run

CUDA_VERSION=$1
BASE_PATH=$2
Expand Down Expand Up @@ -57,8 +57,11 @@ if [[ -n "$CUDA_VERSION" ]]; then
elif [[ "$CUDA_VERSION" -eq "123" ]]; then
URL=$URL123
FOLDER=cuda-12.3
elif [[ "$CUDA_VERSION" -eq "124" ]]; then
URL=$URL124
FOLDER=cuda-12.4
else
echo "argument error: No cuda version passed as input. Choose among versions 92 to 123"
echo "argument error: No cuda version passed as input. Choose among versions 110 to 124"
fi
else
echo "argument error: No cuda version passed as input. Choose among versions 92 to 123"
Expand Down
8 changes: 4 additions & 4 deletions requirements-ci.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Requirements used for GitHub actions
pytest==7.2.2
einops==0.6.0
lion-pytorch==0.0.6
pytest==8.1.1
einops==0.7.0
lion-pytorch==0.1.4
scipy==1.10.1; python_version < "3.9"
scipy==1.11.4; python_version >= "3.9"
scipy==1.12.0; python_version >= "3.9"
14 changes: 7 additions & 7 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Requirements used for local development
setuptools>=63
pytest~=7.2.2
einops~=0.6.0
wheel~=0.40.0
lion-pytorch~=0.0.6
scipy~=1.11.4
pandas~=2.2.0
matplotlib~=3.8.2
pytest~=8.1.1
einops~=0.7.0
wheel~=0.43.0
lion-pytorch~=0.1.4
scipy~=1.12.0
pandas~=2.2.1
matplotlib~=3.8.3
9 changes: 9 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import gc

import pytest
import torch

Expand All @@ -20,6 +22,13 @@ def pytest_runtest_call(item):
raise


@pytest.hookimpl(trylast=True)
def pytest_runtest_teardown(item, nextitem):
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()


@pytest.fixture(scope="session")
def requires_cuda() -> bool:
cuda_available = torch.cuda.is_available()
Expand Down
28 changes: 23 additions & 5 deletions tests/test_functional.py
Original file line number Diff line number Diff line change
Expand Up @@ -1928,7 +1928,9 @@ def test_bench_dequantization():


@pytest.mark.parametrize("dtype", [torch.float32, torch.float16, torch.bfloat16], ids=describe_dtype)
def test_fp4_quant(dtype):
@pytest.mark.parametrize("quant_type", ["fp4", "nf4"])
@pytest.mark.parametrize("blocksize", [64, 128, 256, 512, 1024, 2048, 4096])
def test_4bit_quant(dtype, quant_type, blocksize):
vals = list(product([0, 1], repeat=4))

code = {}
Expand All @@ -1953,17 +1955,33 @@ def test_fp4_quant(dtype):
code[idx] = result

A1 = torch.randn(1024, 1024, device="cuda", dtype=dtype)
qa, SA = F.quantize_fp4(A1, blocksize=64)
A2 = F.dequantize_fp4(qa, SA)
qa, SA = F.quantize_4bit(A1, blocksize=blocksize, quant_type=quant_type)
A2 = F.dequantize_4bit(qa, SA, blocksize=blocksize, quant_type=quant_type)

err = (A1 - A2).abs().float()
relerr = (err / (A1.abs().float() + 1e-8)).mean()
idx = err > 1.0
err = err.mean()

assert A2.dtype == dtype
assert err.item() < 0.1
assert relerr.item() < 0.28

# With larger block sizes, we can expect this to blow up.
# At blocksize>=1024, don't even bother looking at relerr.
if blocksize <= 64:
assert err.item() < 0.1
assert relerr.item() < 0.28
elif blocksize <= 256:
assert err.item() < 0.11
assert relerr.item() < 0.30
elif blocksize <= 512:
assert err.item() < 0.12
assert relerr.item() < 0.31
elif quant_type == "fp4":
# 1024 => 0.48, 2048 => 0.52, 4096 => 0.56
assert err.item() < 0.08 + math.log2(blocksize) * 4e-2
else:
# 1024 => 0.8, 2048 => 0.88, 4096 => 0.96
assert err.item() < math.log2(blocksize) * 8e-2


@pytest.mark.parametrize("quant_type", ["fp4", "nf4"])
Expand Down

0 comments on commit 16cc220

Please sign in to comment.