Skip to content

Commit

Permalink
final polish (except integrations)
Browse files Browse the repository at this point in the history
  • Loading branch information
Titus-von-Koeller committed Feb 4, 2024
1 parent e00cbc9 commit 8a67759
Show file tree
Hide file tree
Showing 6 changed files with 51 additions and 52 deletions.
12 changes: 5 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,17 @@
# `bitsandbytes`

The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 + 4-bit quantization functions.
The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions.

The library includes quantization primitives for 8-bit & 4-bit operations, through `bitsandbytes.nn.Linear8bitLt` and `bitsandbytes.nn.Linear4bit` and 8bit optimizers through `bitsandbytes.optim` module.
The library includes quantization primitives for 8-bit & 4-bit operations, through `bitsandbytes.nn.Linear8bitLt` and `bitsandbytes.nn.Linear4bit` and 8-bit optimizers through `bitsandbytes.optim` module.

There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is on its way as well.
There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is quite far along and is on its way as well.

**Please head to the official documentation page:**

**[https://huggingface.co/docs/bitsandbytes/main](https://huggingface.co/docs/bitsandbytes/main)**

## License


# License

The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms, as the parts adapted from Pytorch are licensed under the BSD license.
The majority of bitsandbytes is licensed under MIT, however small portions of the project are available under separate license terms, as the parts adapted from Pytorch are licensed under the BSD license.

We thank Fabio Cannizzo for his work on [FastBinarySearch](https://github.com/fabiocannizzo/FastBinarySearch) which we use for CPU quantization.
2 changes: 1 addition & 1 deletion docs/source/algorithms.mdx
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Other algorithms
_WIP: Still incomplete... Community contributions would be greatly welcome!_

This is an overview of the functional API in `bitsandbytes` that we think would also be useful as standalone entities.
This is an overview of the `bnb.functional` API in `bitsandbytes` that we think would also be useful as standalone entities.

## Using Int8 Matrix Multiplication

Expand Down
7 changes: 2 additions & 5 deletions docs/source/errors.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,11 @@

This problem arises with the cuda version loaded by bitsandbytes is not supported by your GPU, or if you pytorch CUDA version mismatches.

To solve this problem you need to debug ``$LD_LIBRARY_PATH``, ``$CUDA_HOME``, ``$PATH``. You can print these via ``echo $PATH``. You should look for multiple paths to different CUDA versions. This can include versions in your anaconda path, for example ``$HOME/anaconda3/lib``. You can check those versions via ``ls -l $HOME/anaconda3/lib/*cuda*`` or equivalent paths. Look at the CUDA versions of files in these paths. Does it match with ``nvidia-smi``?
To solve this problem you need to debug ``$LD_LIBRARY_PATH``, ``$CUDA_HOME`` as well as ``$PATH``. You can print these via ``echo $PATH``. You should look for multiple paths to different CUDA versions. This can include versions in your anaconda path, for example ``$HOME/anaconda3/lib``. You can check those versions via ``ls -l $HOME/anaconda3/lib/*cuda*`` or equivalent paths. Look at the CUDA versions of files in these paths. Does it match with ``nvidia-smi``?

If you are feeling lucky, you can also try to compile the library from source. This can be still problematic if your PATH variables have multiple cuda versions. As such, it is recommended to figure out path conflicts before you proceed with compilation.

__If you encounter any other error not listed here please create an issue. This will help resolve your problem and will help out others in the future.


## fatbinwrap
## `fatbinwrap`

This error occurs if there is a mismatch between CUDA versions in the C++ library and the CUDA part. Make sure you have right CUDA in your `$PATH` and `$LD_LIBRARY_PATH` variable. In the conda base environment you can find the library under:

Expand Down
8 changes: 5 additions & 3 deletions docs/source/installation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -29,14 +29,14 @@ python setup.py install

with `XXX` being your CUDA version, for <12.0 call `make cuda 11x`. Note support for non-CUDA GPUs (e.g. AMD, Intel), is also coming soon.

For a more detailed guide, head to the [dedicated page on the topic](./compiling)
For a more detailed compilation guide, head to the [dedicated page on the topic](./compiling)

</hfoption>
<hfoption id="Windows">

## Windows

Currently for Windows users, you need to build bitsandbytes from source
Currently for Windows users, you need to build bitsandbytes from source:

```bash
git clone https://github.com/TimDettmers/bitsandbytes.git && cd bitsandbytes/
Expand All @@ -47,12 +47,14 @@ python -m build --wheel

Big thanks to [wkpark](https://github.com/wkpark), [Jamezo97](https://github.com/Jamezo97), [rickardp](https://github.com/rickardp), [akx](https://github.com/akx) for their amazing contributions to make bitsandbytes compatible with Windows.

For a more detailed compilation guide, head to the [dedicated page on the topic](./compiling)

</hfoption>
<hfoption id="MacOS">

## MacOS

Mac support is still a work in progress. Please make sure to check out the latest bitsandbytes issues to get notified about the progress with respect to MacOS integration.
Mac support is still a work in progress. Please make sure to check out the [Apple Silicon implementation coordination issue](https://github.com/TimDettmers/bitsandbytes/issues/1020) to get notified about the discussions and progress with respect to MacOS integration.

</hfoption>

Expand Down
69 changes: 34 additions & 35 deletions docs/source/optimizers.mdx
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
# Introduction: 8-bit optimizers

With 8-bit optimizers, larger models can be finetuned with the same GPU memory compared to standard 32-bit optimizer training. 8-bit optimizers are a drop-in replacement for regular optimizers:
With 8-bit optimizers, larger models can be finetuned with the same GPU memory compared to standard 32-bit optimizer training. 8-bit optimizers are a drop-in replacement for regular optimizers, with the following properties:

- Faster (e.g. 4x faster than regular Adam)
- 75% less memory, same performance
- No hyperparameter tuning needed

8-bit optimizers are mostly useful to finetune large models that did not fit into memory before. They also make it easier to pretrain larger models and have great synergy with sharded data parallelism. 8-bit Adam, for example, is already used across multiple teams in Facebook. This optimizer saves a ton of memory at no accuracy hit.

Our 8-bit optimizers have three components:
Generally, our 8-bit optimizers have three components:
1. **block-wise quantization** isolates outliers and distributes the error more equally over all bits,
2. **dynamic quantization** quantizes both small and large values with high precision,
3. a **stable embedding layer** improves stability during optimization for models with word embeddings.

With these components, performing an optimizer update with 8-bit states is straightforward and for GPUs, this makes 8-bit optimizers faster than regular 32-bit optimizers. [Further details below](#research-background)
With these components, performing an optimizer update with 8-bit states is straightforward and for GPUs, this makes 8-bit optimizers way faster than regular 32-bit optimizers. [Further details below](#research-background)

We feature 8-bit `Adagrad`, `Adam`, `AdamW`, `LAMB`, `LARS`, `Lion`, `RMSprop` and `SGD` (momentum).

Expand All @@ -24,27 +24,40 @@ We feature 8-bit `Adagrad`, `Adam`, `AdamW`, `LAMB`, `LARS`, `Lion`, `RMSprop` a
## Usage

It only requires a two-line code change to get started.
```py
```diff
import bitsandbytes as bnb

# before: adam = torch.optim.Adam(...)
adam = bnb.optim.Adam8bit(...)
- adam = torch.optim.Adam(...)
+ adam = bnb.optim.Adam8bit(...)

# recommended for NLP models
# before: torch.nn.Embedding(...)
bnb.nn.StableEmbedding(...)
- before: torch.nn.Embedding(...)
+ bnb.nn.StableEmbedding(...)
```

The arguments passed are the same as standard Adam. For NLP models we recommend also to use the StableEmbedding layers which improves results and helps with stable 8-bit optimization.
The arguments passed are the same as standard Adam. For NLP models we recommend to also use the StableEmbedding layers which improves results and helps with stable 8-bit optimization.

Note that by default all parameter tensors with less than 4096 elements are kept at 32-bit even if you initialize those parameters with 8-bit optimizers. This is done since such small tensors do not save much memory and often contain highly variable parameters (biases) or parameters that require high precision (batch norm, layer norm). You can change this behavior like so:

```py
# parameter tensors with less than 16384 values are optimized in 32-bit
# it is recommended to use multiplies of 4096
# For parameter tensors with less than 16384 values are optimized in 32-bit
# it is recommended to use multiplies of 4096:
adam = bnb.optim.Adam8bit(model.parameters(), min_8bit_size=16384)
```

Some more examples of how you can replace your old optimizer with the 8-bit optimizer:

```diff
import bitsandbytes as bnb

- adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
+ adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer

# use 32-bit Adam with 5th percentile clipping
+ adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=32, percentile_clipping=5)
- adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
```

## Overview of supported 8-bit optimizers

Currently, `bitsandbytes` supports the following optimizers:
Expand All @@ -58,9 +71,9 @@ Currently, `bitsandbytes` supports the following optimizers:
- `RMSprop`, `RMSprop8bit`, `RMSprop32bit`
- `SGD`, `SGD8bit`, `SGD32bit`

Additionally, for cases in which you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`, which is explained [below](#optim_manager).
Additionally, for cases in which you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`, [as explained in greater detail below](#optim_manager).

Find the API docs [here](#optim_api_docs). (still under construction)
Find the API docs [here](#optim_api_docs) (still under construction).

## Overview of expected gains

Expand All @@ -81,12 +94,12 @@ Stateful optimizers maintain gradient statistics over time, e.g. the exponential
To overcome the resulting computational, quantization and stability challenges, 8-bit optimizers have three components:

1. **Block-wise quantization** divides input tensors into smaller blocks that are independently quantized, therein isolating outliers and distributing the error more equally over all bits. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization.
2. **dynamic quantization**, which quantizes both small and large values with high precision,
2. **Dynamic quantization**, which quantizes both small and large values with high precision and
3. a **stable embedding layer** improves stability during optimization for models with word embeddings.

With these components, performing an optimizer update with 8-bit states is straightforward. We dequantize the 8-bit optimizer states to 32-bit, perform the update and then quantize the states back to 8-bit for storage.

We do this 8-bit to 32-bit conversion element-by-element in registers, which means no slow copies to GPU memory or additional temporary memory are needed to perform quantization and dequantization. For GPUs, this makes 8-bit optimizers faster than regular 32-bit optimizers.
We do this 8-bit to 32-bit conversion element-by-element in registers, which means no slow copies to GPU memory or additional temporary memory are needed to perform quantization and dequantization. For GPUs, this makes 8-bit optimizers much faster than regular 32-bit optimizers.

For more details, please refer to the paper [8-bit Optimizers via Block-wise Quantization](https://arxiv.org/abs/2110.02861).

Expand All @@ -105,7 +118,7 @@ The Stable Embedding Layer enhances the standard word embedding layer for improv
- Designed to support more aggressive quantization strategies without compromising training stability.
- Helps in achieving stable training outcomes, particularly important for models dealing with diverse and complex language data.

## Paged Optimizers
## Paged optimizers

Paged optimizers are build on top of the [unified memory](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) feature of CUDA. This feature is not supported by PyTorch and we added it to `bitsandbytes`.

Expand All @@ -119,27 +132,13 @@ Compared to CPU offloading, this has the advantage that there is zero overhead i

[Find more details in this discussion](https://github.com/TimDettmers/bitsandbytes/issues/962).

## Usage

Some more examples of how you can replace your old optimizer with the 8-bit optimizer:

```diff
import bitsandbytes as bnb

- adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
+ adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer

# use 32-bit Adam with 5th percentile clipping
+ adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=32, percentile_clipping=5)
- adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
```

### How to override config hyperparameters for particular weights/parameters[[optim_manager]]
## `GlobalOptimManager`: How to override config hyperparameters for particular weights/parameters[[optim_manager]]

If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things:

1. Register the parameter while they are still on the CPU,
2. override the config with the new desired hyperparameters (anytime, anywhere)
1. Register the parameter while they are still on the CPU.
2. Override the config with the new desired hyperparameters (anytime, anywhere).

For global overrides in many different places in your code you can do:

Expand All @@ -164,9 +163,9 @@ mng.override_config(model.fc1.weight, 'optim_bits', 32)
mng.override_config([model.special.weight, model.also_special.weight],
key_value_dict ={'is_sparse': True, 'lr': 1e-5, 'betas'=(0.9, 0.98)})
```
Possible options for the config override are: `betas, eps, weight_decay, lr, optim_bits, min_8bit_size, percentile_clipping, block_wise, max_unorm`
Possible options for the config override are: `betas, eps, weight_decay, lr, optim_bits, min_8bit_size, percentile_clipping, block_wise, max_unorm`.

For overrides for particular layers we recommend overriding locally in each module. You can do this by passing the module, the parameter, and its attribute name to the GlobalOptimManager:
For overrides for particular layers, we recommend overriding locally in each module. You can do this by passing the module, the parameter, and its attribute name to the GlobalOptimManager:
```py
class MyModule(torch.nn.Module):
def __init__(din, dout):
Expand Down
5 changes: 4 additions & 1 deletion docs/source/quickstart.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,12 @@

... work in progress ...

## Minimal example
(Community contributions would we very welcome!)

## Minimal examples

The following code illustrates the steps above.

```py
code examples will soon follow
```

0 comments on commit 8a67759

Please sign in to comment.