From 8a67759cd91f707db3aa36b6dc1e5ab2b10dca35 Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sun, 4 Feb 2024 10:46:48 -0800
Subject: [PATCH] final polish (except integrations)

---
 README.md                    | 12 +++----
 docs/source/algorithms.mdx   |  2 +-
 docs/source/errors.mdx       |  7 ++--
 docs/source/installation.mdx |  8 +++--
 docs/source/optimizers.mdx   | 69 ++++++++++++++++++------------------
 docs/source/quickstart.mdx   |  5 ++-
 6 files changed, 51 insertions(+), 52 deletions(-)

diff --git a/README.md b/README.md
index a9fb7f4e5..43eadf5a3 100644
--- a/README.md
+++ b/README.md
@@ -1,19 +1,17 @@
 # `bitsandbytes`
 
-The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 + 4-bit quantization functions.
+The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions.
 
-The library includes quantization primitives for 8-bit & 4-bit operations, through `bitsandbytes.nn.Linear8bitLt` and `bitsandbytes.nn.Linear4bit` and 8bit optimizers through `bitsandbytes.optim` module.
+The library includes quantization primitives for 8-bit & 4-bit operations, through `bitsandbytes.nn.Linear8bitLt` and `bitsandbytes.nn.Linear4bit` and 8-bit optimizers through `bitsandbytes.optim` module.
 
-There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is on its way as well.
+There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is quite far along and is on its way as well.
 
 **Please head to the official documentation page:**
 
 **[https://huggingface.co/docs/bitsandbytes/main](https://huggingface.co/docs/bitsandbytes/main)**
 
+## License
 
-
-# License
-
-The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms, as the parts adapted from Pytorch are licensed under the BSD license.
+The majority of bitsandbytes is licensed under MIT, however small portions of the project are available under separate license terms, as the parts adapted from Pytorch are licensed under the BSD license.
 
 We thank Fabio Cannizzo for his work on [FastBinarySearch](https://github.com/fabiocannizzo/FastBinarySearch) which we use for CPU quantization.
diff --git a/docs/source/algorithms.mdx b/docs/source/algorithms.mdx
index 53619bed2..d9db5cb04 100644
--- a/docs/source/algorithms.mdx
+++ b/docs/source/algorithms.mdx
@@ -1,7 +1,7 @@
 # Other algorithms
 _WIP: Still incomplete... Community contributions would be greatly welcome!_
 
-This is an overview of the functional API in `bitsandbytes` that we think would also be useful as standalone entities.
+This is an overview of the `bnb.functional` API in `bitsandbytes` that we think would also be useful as standalone entities.
 
 ## Using Int8 Matrix Multiplication
 
diff --git a/docs/source/errors.mdx b/docs/source/errors.mdx
index 68fb7f938..293017173 100644
--- a/docs/source/errors.mdx
+++ b/docs/source/errors.mdx
@@ -4,14 +4,11 @@
 
 This problem arises with the cuda version loaded by bitsandbytes is not supported by your GPU, or if you pytorch CUDA version mismatches.
 
-To solve this problem you need to debug ``$LD_LIBRARY_PATH``, ``$CUDA_HOME``, ``$PATH``. You can print these via ``echo $PATH``. You should look for multiple paths to different CUDA versions. This can include versions in your anaconda path, for example ``$HOME/anaconda3/lib``. You can check those versions via ``ls -l $HOME/anaconda3/lib/*cuda*`` or equivalent paths. Look at the CUDA versions of files in these paths. Does it match with ``nvidia-smi``?
+To solve this problem you need to debug ``$LD_LIBRARY_PATH``, ``$CUDA_HOME`` as well as ``$PATH``. You can print these via ``echo $PATH``. You should look for multiple paths to different CUDA versions. This can include versions in your anaconda path, for example ``$HOME/anaconda3/lib``. You can check those versions via ``ls -l $HOME/anaconda3/lib/*cuda*`` or equivalent paths. Look at the CUDA versions of files in these paths. Does it match with ``nvidia-smi``?
 
 If you are feeling lucky, you can also try to compile the library from source. This can be still problematic if your PATH variables have multiple cuda versions. As such, it is recommended to figure out path conflicts before you proceed with compilation.
 
-__If you encounter any other error not listed here please create an issue. This will help resolve your problem and will help out others in the future.
-
-
-## fatbinwrap
+## `fatbinwrap`
 
 This error occurs if there is a mismatch between CUDA versions in the C++ library and the CUDA part. Make sure you have right CUDA in your `$PATH` and `$LD_LIBRARY_PATH` variable. In the conda base environment you can find the library under:
 
diff --git a/docs/source/installation.mdx b/docs/source/installation.mdx
index fc559471d..ecdcdeb28 100644
--- a/docs/source/installation.mdx
+++ b/docs/source/installation.mdx
@@ -29,14 +29,14 @@ python setup.py install
 
 with `XXX` being your CUDA version, for <12.0 call `make cuda 11x`. Note support for non-CUDA GPUs (e.g. AMD, Intel), is also coming soon.
 
-For a more detailed guide, head to the [dedicated page on the topic](./compiling)
+For a more detailed compilation guide, head to the [dedicated page on the topic](./compiling)
 
 </hfoption>
 <hfoption id="Windows">
 
 ## Windows
 
-Currently for Windows users, you need to build bitsandbytes from source
+Currently for Windows users, you need to build bitsandbytes from source:
 
 ```bash
 git clone https://github.com/TimDettmers/bitsandbytes.git && cd bitsandbytes/
@@ -47,12 +47,14 @@ python -m build --wheel
 
 Big thanks to [wkpark](https://github.com/wkpark), [Jamezo97](https://github.com/Jamezo97), [rickardp](https://github.com/rickardp), [akx](https://github.com/akx) for their amazing contributions to make bitsandbytes compatible with Windows.
 
+For a more detailed compilation guide, head to the [dedicated page on the topic](./compiling)
+
 </hfoption>
 <hfoption id="MacOS">
 
 ## MacOS
 
-Mac support is still a work in progress. Please make sure to check out the latest bitsandbytes issues to get notified about the progress with respect to MacOS integration.
+Mac support is still a work in progress. Please make sure to check out the [Apple Silicon implementation coordination issue](https://github.com/TimDettmers/bitsandbytes/issues/1020) to get notified about the discussions and progress with respect to MacOS integration.
 
 </hfoption>
 
diff --git a/docs/source/optimizers.mdx b/docs/source/optimizers.mdx
index d4597dd89..18d20de1d 100644
--- a/docs/source/optimizers.mdx
+++ b/docs/source/optimizers.mdx
@@ -1,6 +1,6 @@
 # Introduction: 8-bit optimizers
 
-With 8-bit optimizers, larger models can be finetuned with the same GPU memory compared to standard 32-bit optimizer training. 8-bit optimizers are a drop-in replacement for regular optimizers:
+With 8-bit optimizers, larger models can be finetuned with the same GPU memory compared to standard 32-bit optimizer training. 8-bit optimizers are a drop-in replacement for regular optimizers, with the following properties:
 
 - Faster (e.g. 4x faster than regular Adam)
 - 75% less memory, same performance
@@ -8,12 +8,12 @@ With 8-bit optimizers, larger models can be finetuned with the same GPU memory c
 
 8-bit optimizers are mostly useful to finetune large models that did not fit into memory before. They also make it easier to pretrain larger models and have great synergy with sharded data parallelism. 8-bit Adam, for example, is already used across multiple teams in Facebook. This optimizer saves a ton of memory at no accuracy hit.
 
-Our 8-bit optimizers have three components:
+Generally, our 8-bit optimizers have three components:
 1. **block-wise quantization** isolates outliers and distributes the error more equally over all bits,
 2. **dynamic quantization** quantizes both small and large values with high precision,
 3. a **stable embedding layer** improves stability during optimization for models with word embeddings.
 
-With these components, performing an optimizer update with 8-bit states is straightforward and for GPUs, this makes 8-bit optimizers faster than regular 32-bit optimizers. [Further details below](#research-background)
+With these components, performing an optimizer update with 8-bit states is straightforward and for GPUs, this makes 8-bit optimizers way faster than regular 32-bit optimizers. [Further details below](#research-background)
 
 We feature 8-bit `Adagrad`, `Adam`, `AdamW`, `LAMB`, `LARS`, `Lion`, `RMSprop` and `SGD` (momentum).
 
@@ -24,27 +24,40 @@ We feature 8-bit `Adagrad`, `Adam`, `AdamW`, `LAMB`, `LARS`, `Lion`, `RMSprop` a
 ## Usage
 
 It only requires a two-line code change to get started.
-```py
+```diff
 import bitsandbytes as bnb
 
-# before: adam = torch.optim.Adam(...)
-adam = bnb.optim.Adam8bit(...)
+- adam = torch.optim.Adam(...)
++ adam = bnb.optim.Adam8bit(...)
 
 # recommended for NLP models
-# before: torch.nn.Embedding(...)
-bnb.nn.StableEmbedding(...)
+- before: torch.nn.Embedding(...)
++ bnb.nn.StableEmbedding(...)
 ```
 
-The arguments passed are the same as standard Adam. For NLP models we recommend also to use the StableEmbedding layers which improves results and helps with stable 8-bit optimization.
+The arguments passed are the same as standard Adam. For NLP models we recommend to also use the StableEmbedding layers which improves results and helps with stable 8-bit optimization.
 
 Note that by default all parameter tensors with less than 4096 elements are kept at 32-bit even if you initialize those parameters with 8-bit optimizers. This is done since such small tensors do not save much memory and often contain highly variable parameters (biases) or parameters that require high precision (batch norm, layer norm). You can change this behavior like so:
 
 ```py
-# parameter tensors with less than 16384 values are optimized in 32-bit
-# it is recommended to use multiplies of 4096
+# For parameter tensors with less than 16384 values are optimized in 32-bit
+# it is recommended to use multiplies of 4096:
 adam = bnb.optim.Adam8bit(model.parameters(), min_8bit_size=16384)
 ```
 
+Some more examples of how you can replace your old optimizer with the 8-bit optimizer:
+
+```diff
+import bitsandbytes as bnb
+
+- adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
++ adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
+
+# use 32-bit Adam with 5th percentile clipping
++ adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=32, percentile_clipping=5)
+- adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
+```
+
 ## Overview of supported 8-bit optimizers
 
 Currently, `bitsandbytes` supports the following optimizers:
@@ -58,9 +71,9 @@ Currently, `bitsandbytes` supports the following optimizers:
 - `RMSprop`, `RMSprop8bit`, `RMSprop32bit`
 - `SGD`, `SGD8bit`, `SGD32bit`
 
-Additionally, for cases in which you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`, which is explained [below](#optim_manager).
+Additionally, for cases in which you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`, [as explained in greater detail below](#optim_manager).
 
-Find the API docs [here](#optim_api_docs). (still under construction)
+Find the API docs [here](#optim_api_docs) (still under construction).
 
 ## Overview of expected gains
 
@@ -81,12 +94,12 @@ Stateful optimizers maintain gradient statistics over time, e.g. the exponential
 To overcome the resulting computational, quantization and stability challenges, 8-bit optimizers have three components:
 
 1. **Block-wise quantization** divides input tensors into smaller blocks that are independently quantized, therein isolating outliers and distributing the error more equally over all bits. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization.
-2. **dynamic quantization**, which quantizes both small and large values with high precision,
+2. **Dynamic quantization**, which quantizes both small and large values with high precision and
 3. a **stable embedding layer** improves stability during optimization for models with word embeddings.
 
 With these components, performing an optimizer update with 8-bit states is straightforward. We dequantize the 8-bit optimizer states to 32-bit, perform the update and then quantize the states back to 8-bit for storage.
 
-We do this 8-bit to 32-bit conversion element-by-element in registers, which means no slow copies to GPU memory or additional temporary memory are needed to perform quantization and dequantization. For GPUs, this makes 8-bit optimizers faster than regular 32-bit optimizers.
+We do this 8-bit to 32-bit conversion element-by-element in registers, which means no slow copies to GPU memory or additional temporary memory are needed to perform quantization and dequantization. For GPUs, this makes 8-bit optimizers much faster than regular 32-bit optimizers.
 
 For more details, please refer to the paper [8-bit Optimizers via Block-wise Quantization](https://arxiv.org/abs/2110.02861).
 
@@ -105,7 +118,7 @@ The Stable Embedding Layer enhances the standard word embedding layer for improv
 - Designed to support more aggressive quantization strategies without compromising training stability.
 - Helps in achieving stable training outcomes, particularly important for models dealing with diverse and complex language data.
 
-## Paged Optimizers
+## Paged optimizers
 
 Paged optimizers are build on top of the [unified memory](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) feature of CUDA. This feature is not supported by PyTorch and we added it to `bitsandbytes`.
 
@@ -119,27 +132,13 @@ Compared to CPU offloading, this has the advantage that there is zero overhead i
 
 [Find more details in this discussion](https://github.com/TimDettmers/bitsandbytes/issues/962).
 
-## Usage
-
-Some more examples of how you can replace your old optimizer with the 8-bit optimizer:
-
-```diff
-import bitsandbytes as bnb
-
-- adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
-+ adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
-
-# use 32-bit Adam with 5th percentile clipping
-+ adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=32, percentile_clipping=5)
-- adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
-```
 
-### How to override config hyperparameters for particular weights/parameters[[optim_manager]]
+## `GlobalOptimManager`: How to override config hyperparameters for particular weights/parameters[[optim_manager]]
 
 If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things:
 
-1. Register the parameter while they are still on the CPU,
-2. override the config with the new desired hyperparameters (anytime, anywhere)
+1. Register the parameter while they are still on the CPU.
+2. Override the config with the new desired hyperparameters (anytime, anywhere).
 
 For global overrides in many different places in your code you can do:
 
@@ -164,9 +163,9 @@ mng.override_config(model.fc1.weight, 'optim_bits', 32)
 mng.override_config([model.special.weight, model.also_special.weight],
                     key_value_dict ={'is_sparse': True, 'lr': 1e-5, 'betas'=(0.9, 0.98)})
 ```
-Possible options for the config override are: `betas, eps, weight_decay, lr, optim_bits, min_8bit_size, percentile_clipping, block_wise, max_unorm`
+Possible options for the config override are: `betas, eps, weight_decay, lr, optim_bits, min_8bit_size, percentile_clipping, block_wise, max_unorm`.
 
-For overrides for particular layers we recommend overriding locally in each module. You can do this by passing the module, the parameter, and its attribute name to the GlobalOptimManager:
+For overrides for particular layers, we recommend overriding locally in each module. You can do this by passing the module, the parameter, and its attribute name to the GlobalOptimManager:
 ```py
 class MyModule(torch.nn.Module):
   def __init__(din, dout):
diff --git a/docs/source/quickstart.mdx b/docs/source/quickstart.mdx
index 3a560ff6b..ed92c896b 100644
--- a/docs/source/quickstart.mdx
+++ b/docs/source/quickstart.mdx
@@ -4,9 +4,12 @@
 
 ... work in progress ...
 
-## Minimal example
+(Community contributions would we very welcome!)
+
+## Minimal examples
 
 The following code illustrates the steps above.
 
 ```py
+code examples will soon follow
 ```