drafting + refactoring new docs

bitsandbytes-foundation · Feb 1, 2024 · 725d29a · 725d29a
1 parent 84b5fc0
commit 725d29a
Show file tree

Hide file tree

Showing 12 changed files with 231 additions and 170 deletions.
diff --git a/README.md b/README.md
@@ -4,10 +4,7 @@ The bitsandbytes is a lightweight wrapper around CUDA custom functions, in parti
 
 
 
-Resources:
-- [8-bit Optimizer Paper](https://arxiv.org/abs/2110.02861) --  [Video](https://www.youtube.com/watch?v=IxrlHAJtqKE) -- [Docs](https://bitsandbytes.readthedocs.io/en/latest/)
 
-- [LLM.int8() Paper](https://arxiv.org/abs/2208.07339) -- [LLM.int8() Software Blog Post](https://huggingface.co/blog/hf-bitsandbytes-integration) -- [LLM.int8() Emergent Features Blog Post](https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/)
 
 ## TL;DR
 **Requirements**

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -6,6 +6,8 @@
     title: Quickstart
   - local: installation
     title: Installation
+  - local: moduletree
+    title: Module Tree
 - title: Features & Integrations
   sections:
   - local: quantization
@@ -14,3 +16,17 @@
     title: Optimizers
   - local: integrations
     title: Integrations
+  - local: qlora
+    title: QLoRA
+- title: Support & Learning
+  sections:
+  - local: resources
+    title: Papers, related resources & how to cite
+  - local: faqs
+    title: FAQs (Frequently Asked Questions)
+- title: Contributors Guidelines
+  sections:
+  - local: contributing
+    title: Contributing
+  # - local: code_of_conduct
+  #   title: Code of Conduct
diff --git a/docs/source/contributing.mdx b/docs/source/contributing.mdx
@@ -0,0 +1,6 @@
+# Contributors guidelines
+... stil under construction ... (feel free to propose materials, `bitsandbytes` is a community project)
+
+## Documentation
+- [guideline for documentation syntax](https://github.com/huggingface/doc-builder#readme)
+- images shall be uploaded via PR in the `bitsandbytes/` directory [here](https://huggingface.co/datasets/huggingface/documentation-images)
diff --git a/docs/source/faqs.mdx b/docs/source/faqs.mdx
@@ -0,0 +1,7 @@
+# FAQs
+
+Please submit your questions in [this Github Discussion thread](https://github.com/TimDettmers/bitsandbytes/discussions/1013) if you feel that they will likely affect a lot of other users and that they haven't been sufficiently covered in the documentation.
+
+We'll pick the most generally applicable ones and post the QAs here or integrate them into the general documentation (also feel free to submit doc PRs, please).
+
+# ... under construction ...
diff --git a/docs/source/integrations.mdx b/docs/source/integrations.mdx
@@ -1,5 +1,8 @@
 # Transformers
+... TODO: to be filled out ...
 
 # PEFT
+... TODO: to be filled out ...
 
 # Trainer for the optimizers
+... TODO: to be filled out ...
diff --git a/docs/source/introduction.mdx b/docs/source/introduction.mdx
@@ -1,39 +1,11 @@
+TODO: Many parts of this doc will still be redistributed among the new doc structure.
+
 # `bitsandbytes`
 
 The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.
 
 There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is on its way as well.
 
-# Resources:
-- [8-bit Optimizer Paper](https://arxiv.org/abs/2110.02861) --  [Video](https://www.youtube.com/watch?v=IxrlHAJtqKE) -- [Docs](https://bitsandbytes.readthedocs.io/en/latest/)
-
-- [LLM.int8() Paper](https://arxiv.org/abs/2208.07339) -- [LLM.int8() Software Blog Post](https://huggingface.co/blog/hf-bitsandbytes-integration) -- [LLM.int8() Emergent Features Blog Post](https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/)
-
-## TL;DR
-**Requirements**
-Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0.
-
-(Deprecated: CUDA 10.0 is deprecated and only CUDA >= 11.0) will be supported with release 0.39.0)
-
-**Installation**:
-
-``pip install bitsandbytes``
-
-In some cases it can happen that you need to compile from source. If this happens please consider submitting a bug report with `python -m bitsandbytes` information. What now follows is some short instructions which might work out of the box if `nvcc` is installed. If these do not work see further below.
-
-Compilation quickstart:
-```bash
-git clone https://github.com/timdettmers/bitsandbytes.git
-cd bitsandbytes
-
-# CUDA_VERSIONS in {110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 120}
-# make argument in {cuda110, cuda11x, cuda12x}
-# if you do not know what CUDA you have, try looking at the output of: python -m bitsandbytes
-CUDA_VERSION=117 make cuda11x
-python setup.py install
-```
-
-**Using Int8 inference with HuggingFace Transformers**
 
 ```python
 from transformers import AutoModelForCausalLM
@@ -89,9 +61,6 @@ The bitsandbytes library is currently only supported on Linux distributions. Win
 
 The requirements can best be fulfilled by installing pytorch via anaconda. You can install PyTorch by following the ["Get Started"](https://pytorch.org/get-started/locally/) instructions on the official website.
 
-To install run:
-
-``pip install bitsandbytes``
 
 ## Using bitsandbytes
 
@@ -166,26 +135,3 @@ For more detailed instruction, please follow the [compile_from_source.md](compil
 The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms: Pytorch is licensed under the BSD license.
 
 We thank Fabio Cannizzo for his work on [FastBinarySearch](https://github.com/fabiocannizzo/FastBinarySearch) which we use for CPU quantization.
-
-## How to cite us
-If you found this library and found LLM.int8() useful, please consider citing our work:
-
-```bibtex
-@article{dettmers2022llmint8,
-  title={LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale},
-  author={Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke},
-  journal={arXiv preprint arXiv:2208.07339},
-  year={2022}
-}
-```
-
-For 8-bit optimizers or quantization routines, please consider citing the following work:
-
-```bibtex
-@article{dettmers2022optimizers,
-  title={8-bit Optimizers via Block-wise Quantization},
-  author={Dettmers, Tim and Lewis, Mike and Shleifer, Sam and Zettlemoyer, Luke},
-  journal={9th International Conference on Learning Representations, ICLR},
-  year={2022}
-}
-```
diff --git a/docs/source/moduletree.mdx b/docs/source/moduletree.mdx
@@ -0,0 +1,5 @@
+# Module tree overview
+
+- **bitsandbytes.functional**: Contains quantization functions and stateless 8-bit optimizer update functions.
+- **bitsandbytes.nn.modules**: Contains stable embedding layer with automatic 32-bit optimizer overrides (important for NLP stability)
+- **bitsandbytes.optim**: Contains 8-bit optimizers.
diff --git a/docs/source/optimizers.mdx b/docs/source/optimizers.mdx
@@ -1,103 +1,129 @@
-Here we provide a short description and usage examples for each optimizer in `bitsandbytes.optim. We'll start by explaining the core optimizer class `Optimizer8bit`, followed by the specific implementations `Adagrad`, `Adagrad8bit` and `Adagrad32bit`.
+# Introduction: 8-bit optimizers
+With 8-bit optimizers, larger models can be finetuned with the same GPU memory compared to standard 32-bit optimizer training. 8-bit optimizers are a drop-in replacement for regular optimizers:
 
-Each of these optimizers can be utilized depending on the specific requirements of the task at hand, such as memory constraints, computational efficiency and the need for precision.
+- Faster (e.g. 4x faster than regular Adam)
+- 75% less memory, same performance
+- No hyperparameter tuning needed
 
-# Optimizer base class
+8-bit optimizers are mostly useful to finetune large models that did not fit into memory before. They also make it easier to pretrain larger models and have great synergy with sharded data parallelism. 8-bit Adam, for example, is already used across multiple teams in Facebook. This optimizer saves a ton of memory at no accuracy hit.
 
-## `Optimizer8bit`
+See here the biggest models
 
-The `Optimizer8bit` class serves as a base class for all 8-bit optimizers, providing common functionalities required for quantized optimization. The class is designed to support both 32-bit and 8-bit computations, where 8-bit optimizations can significantly reduce memory footprint and increase computation speed.
+We feature 8-bit Adam/AdamW, SGD momentum, LARS, LAMB, and RMSProp.
 
-### Usage:
+It only requires a two-line code change to get started.
+```
+import bitsandbytes as bnb
 
-```python
-import torch
-from bitsandbytes.optim import Optimizer8bit
-
-model = YourModel()
-params = model.parameters()
-
-# Initialize the optimizer with your model's parameters
-optimizer = Optimizer8bit(params, defaults={
-    'lr': 0.001,
-    'betas': (0.9, 0.999),
-    'eps': 1e-08,
-    'weight_decay': 0
-}, optim_bits=8)  # Use optim_bits=32 for 32-bit optimization
-
-# In your training loop
-optimizer.zero_grad()
-loss = compute_loss()  # Implement your loss computation
-loss.backward()
-optimizer.step()
+# before: adam = torch.optim.Adam(...)
+adam = bnb.optim.Adam8bit(...)
+
+# recommended for NLP models
+# before: torch.nn.Embedding(...)
+bnb.nn.StableEmbedding(...)
 ```
 
-# Adagrad implementations
+The arguments passed are the same as standard Adam. For NLP models we recommend also to use the StableEmbedding layers which improves results and helps with stable 8-bit optimization.
 
-## `Adagrad`
+## Overview of expected gradients
 
-The `Adagrad` class is an implementation of the Adagrad optimizer, which adapts the learning rate for each parameter based on the historical gradient information. This version allows for both 32-bit and 8-bit representations, with specific classes for each.
+TODO: add pics here, no idea how to do that
 
-### `Adagrad` Usage:
+Want to add both pics in https://huggingface.co/datasets/huggingface/documentation-images/tree/main/bitsandbytes
 
-```python
-import torch
-from bitsandbytes.optim import Adagrad
+# Research Background
 
-model = YourModel()
-params = model.parameters()
+Stateful optimizers maintain gradient statistics over time, e.g. the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. `bitsandbytes` optimizers use 8-bit statistics, while maintaining the performance levels of using 32-bit optimizer states.
 
-# Initialize the optimizer with your model's parameters
-optimizer = Adagrad(params, lr=0.01)
+To overcome the resulting computational, quantization and stability challenges, 8-bit optimizers have three components:
+1) **Block-wise quantization** divides input tensors into smaller blocks that are independently quantized, therein isolating outliers and distributing the error more equally over all bits. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization.
+2) **dynamic quantization**, which quantizes both small and large values with high precision,
+3) a **stable embedding layer** improves stability during optimization for models with word embeddings.
 
-# In your training loop
-optimizer.zero_grad()
-loss = compute_loss()  # Implement your loss computation
-loss.backward()
-optimizer.step()
-```
+With these components, performing an optimizer update with 8-bit states is straightforward. We dequantize the 8-bit optimizer states to 32-bit, perform the update and then quantize the states back to 8-bit for storage.
 
-## `Adagrad8bit`
+We do this 8-bit to 32-bit conversion element-by-element in registers, which means no slow copies to GPU memory or additional temporary memory are needed to perform quantization and dequantization. For GPUs, this makes 8-bit optimizers faster than regular 32-bit optimizers.
 
-The `Adagrad8bit` class is specifically tailored for 8-bit optimization, inheriting from `Optimizer1State`. It is designed for models where memory efficiency is crucial and it operates with reduced precision to save memory and increase computation speed.
+For more details, please refer to the paper [8-bit Optimizers via Block-wise Quantization](https://arxiv.org/abs/2110.02861)
 
-### `Adagrad8bit` Usage:
+## Stable Embedding Layer
 
-```python
-import torch
-from bitsandbytes.optim import Adagrad8bit
+The Stable Embedding Layer enhances the standard word embedding layer for improved training stability in NLP tasks. It addresses the challenge of non-uniform input distributions and mitigates extreme gradient variations, ensuring smoother training processes.
+
+### Features:
+
+- **Initialization**: Utilizes Xavier uniform initialization to maintain consistent variance, reducing the likelihood of large gradients.
+- **Normalization**: Incorporates layer normalization before adding positional embeddings, aiding in output stability.
+- **Optimizer States**: Employs 32-bit optimizer states exclusively for this layer to enhance stability, while the rest of the model may use standard 16-bit precision.
+
+### Benefits:
+
+- Designed to support more aggressive quantization strategies without compromising training stability.
+- Helps in achieving stable training outcomes, particularly important for models dealing with diverse and complex language data.
 
-model = YourModel()
+# Usage
 
-# Initialize the optimizer with your model's parameters
-optimizer = Adagrad8bit(params, lr=0.01)
+Some more examples of how you can replace your old optimizer with the 8-bit optimizer:
 
-# In your training loop
-optimizer.zero_grad()
-loss = compute_loss()  # Implement your loss computation
-loss.backward()
-optimizer.step()
 ```
+import bitsandbytes as bnb
 
-## Adagrad32bit
+# adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
+adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
+adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=8) # equivalent
 
-The `Adagrad32bit` class is similar to `Adagrad` but ensures that all computations are carried out with 32-bit precision. This class is preferable when numerical precision is more critical than memory efficiency.
+# use 32-bit Adam with 5th percentile clipping
+adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995),
+                      optim_bits=32, percentile_clipping=5)
+```
+
+# How to override config hyperparameters for particular weights/parameters
+
+If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things:
+
+1) Register the parameter while they are still on the CPU,
+2) override the config with the new desired hyperparameters (anytime, anywhere)
 
-### Adagrad32bit Usage:
+For global overrides in many different places in your code you can do:
 
 ```python
 import torch
-from bitsandbytes.optim import Adagrad32bit
+import bitsandbytes as bnb
 
-model = YourModel()
-params = model.parameters()
+mng = bnb.optim.GlobalOptimManager.get_instance()
 
-# Initialize the optimizer with your model's parameters
-optimizer = Adagrad32bit(params, lr=0.01)
+model = MyModel()
+mng.register_parameters(model.parameters()) # 1. register parameters while still on CPU
 
-# In your training loop
-optimizer.zero_grad()
-loss = compute_loss()  # Implement your loss computation
-loss.backward()
-optimizer.step()
+model = model.cuda()
+# use 8-bit optimizer states for all parameters
+adam = bnb.optim.Adam(model.parameters(), lr=0.001, optim_bits=8)
+
+# 2a. override: the parameter model.fc1.weight now uses 32-bit Adam
+mng.override_config(model.fc1.weight, 'optim_bits', 32)
+
+# 2b. override: the two special layers use
+# sparse optimization + different learning rate + different Adam betas
+mng.override_config([model.special.weight, model.also_special.weight],
+                    key_value_dict ={'is_sparse': True, 'lr': 1e-5, 'betas'=(0.9, 0.98)})
 ```
+Possible options for the config override are: `betas, eps, weight_decay, lr, optim_bits, min_8bit_size, percentile_clipping, block_wise, max_unorm`
+
+For overrides for particular layers we recommend overriding locally in each module. You can do this by passing the module, the parameter, and its attribute name to the GlobalOptimManager:
+```python
+class MyModule(torch.nn.Module):
+  def __init__(din, dout):
+    super(MyModule, self).__init__()
+    self.linear = torch.nn.Linear(din, dout)
+    # optimization will happen in 32-bit and
+    # learning rate will be set to 0.0001 independent of the main learning rate
+    config = {'optim_bits': 32, 'lr' : 0.0001}
+    GlobalOptimManager.get_instance().register_module_override(self, 'weight', config)
+
+```
+
+# API Docs
+
+... under construction ...
+
+Here we'll provide auto-generated API docs soon. Please feel free to contribute doc-strings for the respective optimizers, as `bitsandbytes` is a community effort.
diff --git a/docs/source/qlora.mdx b/docs/source/qlora.mdx
@@ -0,0 +1 @@
+# ... under construction ...(contributions welcome)
diff --git a/docs/source/quantization.mdx b/docs/source/quantization.mdx
@@ -1 +1,5 @@
-Linear8bitLt & Linear4bit
+# Linear8bitLt
+... TODO: to be filled out ...
+
+# Linear4bit
+... TODO: to be filled out ...
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		# ... under construction ...(contributions welcome)