Skip to content

Commit

Permalink
drafting + refactoring new docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Titus-von-Koeller committed Feb 1, 2024
1 parent 84b5fc0 commit 725d29a
Show file tree
Hide file tree
Showing 12 changed files with 231 additions and 170 deletions.
3 changes: 0 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,7 @@ The bitsandbytes is a lightweight wrapper around CUDA custom functions, in parti



Resources:
- [8-bit Optimizer Paper](https://arxiv.org/abs/2110.02861) -- [Video](https://www.youtube.com/watch?v=IxrlHAJtqKE) -- [Docs](https://bitsandbytes.readthedocs.io/en/latest/)

- [LLM.int8() Paper](https://arxiv.org/abs/2208.07339) -- [LLM.int8() Software Blog Post](https://huggingface.co/blog/hf-bitsandbytes-integration) -- [LLM.int8() Emergent Features Blog Post](https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/)

## TL;DR
**Requirements**
Expand Down
16 changes: 16 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
title: Quickstart
- local: installation
title: Installation
- local: moduletree
title: Module Tree
- title: Features & Integrations
sections:
- local: quantization
Expand All @@ -14,3 +16,17 @@
title: Optimizers
- local: integrations
title: Integrations
- local: qlora
title: QLoRA
- title: Support & Learning
sections:
- local: resources
title: Papers, related resources & how to cite
- local: faqs
title: FAQs (Frequently Asked Questions)
- title: Contributors Guidelines
sections:
- local: contributing
title: Contributing
# - local: code_of_conduct
# title: Code of Conduct
6 changes: 6 additions & 0 deletions docs/source/contributing.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Contributors guidelines
... stil under construction ... (feel free to propose materials, `bitsandbytes` is a community project)

## Documentation
- [guideline for documentation syntax](https://github.com/huggingface/doc-builder#readme)
- images shall be uploaded via PR in the `bitsandbytes/` directory [here](https://huggingface.co/datasets/huggingface/documentation-images)
7 changes: 7 additions & 0 deletions docs/source/faqs.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# FAQs

Please submit your questions in [this Github Discussion thread](https://github.com/TimDettmers/bitsandbytes/discussions/1013) if you feel that they will likely affect a lot of other users and that they haven't been sufficiently covered in the documentation.

We'll pick the most generally applicable ones and post the QAs here or integrate them into the general documentation (also feel free to submit doc PRs, please).

# ... under construction ...
3 changes: 3 additions & 0 deletions docs/source/integrations.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# Transformers
... TODO: to be filled out ...

# PEFT
... TODO: to be filled out ...

# Trainer for the optimizers
... TODO: to be filled out ...
58 changes: 2 additions & 56 deletions docs/source/introduction.mdx
Original file line number Diff line number Diff line change
@@ -1,39 +1,11 @@
TODO: Many parts of this doc will still be redistributed among the new doc structure.

# `bitsandbytes`

The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.

There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is on its way as well.

# Resources:
- [8-bit Optimizer Paper](https://arxiv.org/abs/2110.02861) -- [Video](https://www.youtube.com/watch?v=IxrlHAJtqKE) -- [Docs](https://bitsandbytes.readthedocs.io/en/latest/)

- [LLM.int8() Paper](https://arxiv.org/abs/2208.07339) -- [LLM.int8() Software Blog Post](https://huggingface.co/blog/hf-bitsandbytes-integration) -- [LLM.int8() Emergent Features Blog Post](https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/)

## TL;DR
**Requirements**
Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0.

(Deprecated: CUDA 10.0 is deprecated and only CUDA >= 11.0) will be supported with release 0.39.0)

**Installation**:

``pip install bitsandbytes``

In some cases it can happen that you need to compile from source. If this happens please consider submitting a bug report with `python -m bitsandbytes` information. What now follows is some short instructions which might work out of the box if `nvcc` is installed. If these do not work see further below.

Compilation quickstart:
```bash
git clone https://github.com/timdettmers/bitsandbytes.git
cd bitsandbytes

# CUDA_VERSIONS in {110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 120}
# make argument in {cuda110, cuda11x, cuda12x}
# if you do not know what CUDA you have, try looking at the output of: python -m bitsandbytes
CUDA_VERSION=117 make cuda11x
python setup.py install
```

**Using Int8 inference with HuggingFace Transformers**

```python
from transformers import AutoModelForCausalLM
Expand Down Expand Up @@ -89,9 +61,6 @@ The bitsandbytes library is currently only supported on Linux distributions. Win

The requirements can best be fulfilled by installing pytorch via anaconda. You can install PyTorch by following the ["Get Started"](https://pytorch.org/get-started/locally/) instructions on the official website.

To install run:

``pip install bitsandbytes``

## Using bitsandbytes

Expand Down Expand Up @@ -166,26 +135,3 @@ For more detailed instruction, please follow the [compile_from_source.md](compil
The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms: Pytorch is licensed under the BSD license.

We thank Fabio Cannizzo for his work on [FastBinarySearch](https://github.com/fabiocannizzo/FastBinarySearch) which we use for CPU quantization.

## How to cite us
If you found this library and found LLM.int8() useful, please consider citing our work:

```bibtex
@article{dettmers2022llmint8,
title={LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale},
author={Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke},
journal={arXiv preprint arXiv:2208.07339},
year={2022}
}
```

For 8-bit optimizers or quantization routines, please consider citing the following work:

```bibtex
@article{dettmers2022optimizers,
title={8-bit Optimizers via Block-wise Quantization},
author={Dettmers, Tim and Lewis, Mike and Shleifer, Sam and Zettlemoyer, Luke},
journal={9th International Conference on Learning Representations, ICLR},
year={2022}
}
```
5 changes: 5 additions & 0 deletions docs/source/moduletree.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Module tree overview

- **bitsandbytes.functional**: Contains quantization functions and stateless 8-bit optimizer update functions.
- **bitsandbytes.nn.modules**: Contains stable embedding layer with automatic 32-bit optimizer overrides (important for NLP stability)
- **bitsandbytes.optim**: Contains 8-bit optimizers.
166 changes: 96 additions & 70 deletions docs/source/optimizers.mdx
Original file line number Diff line number Diff line change
@@ -1,103 +1,129 @@
Here we provide a short description and usage examples for each optimizer in `bitsandbytes.optim. We'll start by explaining the core optimizer class `Optimizer8bit`, followed by the specific implementations `Adagrad`, `Adagrad8bit` and `Adagrad32bit`.
# Introduction: 8-bit optimizers
With 8-bit optimizers, larger models can be finetuned with the same GPU memory compared to standard 32-bit optimizer training. 8-bit optimizers are a drop-in replacement for regular optimizers:

Each of these optimizers can be utilized depending on the specific requirements of the task at hand, such as memory constraints, computational efficiency and the need for precision.
- Faster (e.g. 4x faster than regular Adam)
- 75% less memory, same performance
- No hyperparameter tuning needed

# Optimizer base class
8-bit optimizers are mostly useful to finetune large models that did not fit into memory before. They also make it easier to pretrain larger models and have great synergy with sharded data parallelism. 8-bit Adam, for example, is already used across multiple teams in Facebook. This optimizer saves a ton of memory at no accuracy hit.

## `Optimizer8bit`
See here the biggest models

The `Optimizer8bit` class serves as a base class for all 8-bit optimizers, providing common functionalities required for quantized optimization. The class is designed to support both 32-bit and 8-bit computations, where 8-bit optimizations can significantly reduce memory footprint and increase computation speed.
We feature 8-bit Adam/AdamW, SGD momentum, LARS, LAMB, and RMSProp.

### Usage:
It only requires a two-line code change to get started.
```
import bitsandbytes as bnb
```python
import torch
from bitsandbytes.optim import Optimizer8bit

model = YourModel()
params = model.parameters()

# Initialize the optimizer with your model's parameters
optimizer = Optimizer8bit(params, defaults={
'lr': 0.001,
'betas': (0.9, 0.999),
'eps': 1e-08,
'weight_decay': 0
}, optim_bits=8) # Use optim_bits=32 for 32-bit optimization

# In your training loop
optimizer.zero_grad()
loss = compute_loss() # Implement your loss computation
loss.backward()
optimizer.step()
# before: adam = torch.optim.Adam(...)
adam = bnb.optim.Adam8bit(...)
# recommended for NLP models
# before: torch.nn.Embedding(...)
bnb.nn.StableEmbedding(...)
```

# Adagrad implementations
The arguments passed are the same as standard Adam. For NLP models we recommend also to use the StableEmbedding layers which improves results and helps with stable 8-bit optimization.

## `Adagrad`
## Overview of expected gradients

The `Adagrad` class is an implementation of the Adagrad optimizer, which adapts the learning rate for each parameter based on the historical gradient information. This version allows for both 32-bit and 8-bit representations, with specific classes for each.
TODO: add pics here, no idea how to do that

### `Adagrad` Usage:
Want to add both pics in https://huggingface.co/datasets/huggingface/documentation-images/tree/main/bitsandbytes

```python
import torch
from bitsandbytes.optim import Adagrad
# Research Background

model = YourModel()
params = model.parameters()
Stateful optimizers maintain gradient statistics over time, e.g. the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. `bitsandbytes` optimizers use 8-bit statistics, while maintaining the performance levels of using 32-bit optimizer states.

# Initialize the optimizer with your model's parameters
optimizer = Adagrad(params, lr=0.01)
To overcome the resulting computational, quantization and stability challenges, 8-bit optimizers have three components:
1) **Block-wise quantization** divides input tensors into smaller blocks that are independently quantized, therein isolating outliers and distributing the error more equally over all bits. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization.
2) **dynamic quantization**, which quantizes both small and large values with high precision,
3) a **stable embedding layer** improves stability during optimization for models with word embeddings.

# In your training loop
optimizer.zero_grad()
loss = compute_loss() # Implement your loss computation
loss.backward()
optimizer.step()
```
With these components, performing an optimizer update with 8-bit states is straightforward. We dequantize the 8-bit optimizer states to 32-bit, perform the update and then quantize the states back to 8-bit for storage.

## `Adagrad8bit`
We do this 8-bit to 32-bit conversion element-by-element in registers, which means no slow copies to GPU memory or additional temporary memory are needed to perform quantization and dequantization. For GPUs, this makes 8-bit optimizers faster than regular 32-bit optimizers.

The `Adagrad8bit` class is specifically tailored for 8-bit optimization, inheriting from `Optimizer1State`. It is designed for models where memory efficiency is crucial and it operates with reduced precision to save memory and increase computation speed.
For more details, please refer to the paper [8-bit Optimizers via Block-wise Quantization](https://arxiv.org/abs/2110.02861)

### `Adagrad8bit` Usage:
## Stable Embedding Layer

```python
import torch
from bitsandbytes.optim import Adagrad8bit
The Stable Embedding Layer enhances the standard word embedding layer for improved training stability in NLP tasks. It addresses the challenge of non-uniform input distributions and mitigates extreme gradient variations, ensuring smoother training processes.

### Features:

- **Initialization**: Utilizes Xavier uniform initialization to maintain consistent variance, reducing the likelihood of large gradients.
- **Normalization**: Incorporates layer normalization before adding positional embeddings, aiding in output stability.
- **Optimizer States**: Employs 32-bit optimizer states exclusively for this layer to enhance stability, while the rest of the model may use standard 16-bit precision.

### Benefits:

- Designed to support more aggressive quantization strategies without compromising training stability.
- Helps in achieving stable training outcomes, particularly important for models dealing with diverse and complex language data.

model = YourModel()
# Usage

# Initialize the optimizer with your model's parameters
optimizer = Adagrad8bit(params, lr=0.01)
Some more examples of how you can replace your old optimizer with the 8-bit optimizer:

# In your training loop
optimizer.zero_grad()
loss = compute_loss() # Implement your loss computation
loss.backward()
optimizer.step()
```
import bitsandbytes as bnb
## Adagrad32bit
# adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=8) # equivalent
The `Adagrad32bit` class is similar to `Adagrad` but ensures that all computations are carried out with 32-bit precision. This class is preferable when numerical precision is more critical than memory efficiency.
# use 32-bit Adam with 5th percentile clipping
adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995),
optim_bits=32, percentile_clipping=5)
```

# How to override config hyperparameters for particular weights/parameters

If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things:

1) Register the parameter while they are still on the CPU,
2) override the config with the new desired hyperparameters (anytime, anywhere)

### Adagrad32bit Usage:
For global overrides in many different places in your code you can do:

```python
import torch
from bitsandbytes.optim import Adagrad32bit
import bitsandbytes as bnb

model = YourModel()
params = model.parameters()
mng = bnb.optim.GlobalOptimManager.get_instance()

# Initialize the optimizer with your model's parameters
optimizer = Adagrad32bit(params, lr=0.01)
model = MyModel()
mng.register_parameters(model.parameters()) # 1. register parameters while still on CPU

# In your training loop
optimizer.zero_grad()
loss = compute_loss() # Implement your loss computation
loss.backward()
optimizer.step()
model = model.cuda()
# use 8-bit optimizer states for all parameters
adam = bnb.optim.Adam(model.parameters(), lr=0.001, optim_bits=8)

# 2a. override: the parameter model.fc1.weight now uses 32-bit Adam
mng.override_config(model.fc1.weight, 'optim_bits', 32)

# 2b. override: the two special layers use
# sparse optimization + different learning rate + different Adam betas
mng.override_config([model.special.weight, model.also_special.weight],
key_value_dict ={'is_sparse': True, 'lr': 1e-5, 'betas'=(0.9, 0.98)})
```
Possible options for the config override are: `betas, eps, weight_decay, lr, optim_bits, min_8bit_size, percentile_clipping, block_wise, max_unorm`

For overrides for particular layers we recommend overriding locally in each module. You can do this by passing the module, the parameter, and its attribute name to the GlobalOptimManager:
```python
class MyModule(torch.nn.Module):
def __init__(din, dout):
super(MyModule, self).__init__()
self.linear = torch.nn.Linear(din, dout)
# optimization will happen in 32-bit and
# learning rate will be set to 0.0001 independent of the main learning rate
config = {'optim_bits': 32, 'lr' : 0.0001}
GlobalOptimManager.get_instance().register_module_override(self, 'weight', config)

```

# API Docs

... under construction ...

Here we'll provide auto-generated API docs soon. Please feel free to contribute doc-strings for the respective optimizers, as `bitsandbytes` is a community effort.
1 change: 1 addition & 0 deletions docs/source/qlora.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# ... under construction ...(contributions welcome)
6 changes: 5 additions & 1 deletion docs/source/quantization.mdx
Original file line number Diff line number Diff line change
@@ -1 +1,5 @@
Linear8bitLt & Linear4bit
# Linear8bitLt
... TODO: to be filled out ...

# Linear4bit
... TODO: to be filled out ...
Loading

0 comments on commit 725d29a

Please sign in to comment.