From 725d29af6c4118ba0ea7557bc960a2a1ea0c0f5f Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Thu, 1 Feb 2024 15:36:50 -0800
Subject: [PATCH] drafting + refactoring new docs

---
 README.md                    |   3 -
 docs/source/_toctree.yml     |  16 ++++
 docs/source/contributing.mdx |   6 ++
 docs/source/faqs.mdx         |   7 ++
 docs/source/integrations.mdx |   3 +
 docs/source/introduction.mdx |  58 +-----------
 docs/source/moduletree.mdx   |   5 ++
 docs/source/optimizers.mdx   | 166 ++++++++++++++++++++---------------
 docs/source/qlora.mdx        |   1 +
 docs/source/quantization.mdx |   6 +-
 docs/source/resources.mdx    |  90 +++++++++++++++++++
 howto_config_override.md     |  40 ---------
 12 files changed, 231 insertions(+), 170 deletions(-)
 create mode 100644 docs/source/contributing.mdx
 create mode 100644 docs/source/faqs.mdx
 create mode 100644 docs/source/moduletree.mdx
 create mode 100644 docs/source/qlora.mdx
 create mode 100644 docs/source/resources.mdx
 delete mode 100644 howto_config_override.md

diff --git a/README.md b/README.md
index 61dede8c1..35a03dbcb 100644
--- a/README.md
+++ b/README.md
@@ -4,10 +4,7 @@ The bitsandbytes is a lightweight wrapper around CUDA custom functions, in parti
 
 
 
-Resources:
-- [8-bit Optimizer Paper](https://arxiv.org/abs/2110.02861) --  [Video](https://www.youtube.com/watch?v=IxrlHAJtqKE) -- [Docs](https://bitsandbytes.readthedocs.io/en/latest/)
 
-- [LLM.int8() Paper](https://arxiv.org/abs/2208.07339) -- [LLM.int8() Software Blog Post](https://huggingface.co/blog/hf-bitsandbytes-integration) -- [LLM.int8() Emergent Features Blog Post](https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/)
 
 ## TL;DR
 **Requirements**
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 8f63a6339..b1a957c6c 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -6,6 +6,8 @@
     title: Quickstart
   - local: installation
     title: Installation
+  - local: moduletree
+    title: Module Tree
 - title: Features & Integrations
   sections:
   - local: quantization
@@ -14,3 +16,17 @@
     title: Optimizers
   - local: integrations
     title: Integrations
+  - local: qlora
+    title: QLoRA
+- title: Support & Learning
+  sections:
+  - local: resources
+    title: Papers, related resources & how to cite
+  - local: faqs
+    title: FAQs (Frequently Asked Questions)
+- title: Contributors Guidelines
+  sections:
+  - local: contributing
+    title: Contributing
+  # - local: code_of_conduct
+  #   title: Code of Conduct
diff --git a/docs/source/contributing.mdx b/docs/source/contributing.mdx
new file mode 100644
index 000000000..45bb72ce9
--- /dev/null
+++ b/docs/source/contributing.mdx
@@ -0,0 +1,6 @@
+# Contributors guidelines
+... stil under construction ... (feel free to propose materials, `bitsandbytes` is a community project)
+
+## Documentation
+- [guideline for documentation syntax](https://github.com/huggingface/doc-builder#readme)
+- images shall be uploaded via PR in the `bitsandbytes/` directory [here](https://huggingface.co/datasets/huggingface/documentation-images)
diff --git a/docs/source/faqs.mdx b/docs/source/faqs.mdx
new file mode 100644
index 000000000..b9549e9d8
--- /dev/null
+++ b/docs/source/faqs.mdx
@@ -0,0 +1,7 @@
+# FAQs
+
+Please submit your questions in [this Github Discussion thread](https://github.com/TimDettmers/bitsandbytes/discussions/1013) if you feel that they will likely affect a lot of other users and that they haven't been sufficiently covered in the documentation.
+
+We'll pick the most generally applicable ones and post the QAs here or integrate them into the general documentation (also feel free to submit doc PRs, please).
+
+# ... under construction ...
diff --git a/docs/source/integrations.mdx b/docs/source/integrations.mdx
index a12dd31ef..25deb839b 100644
--- a/docs/source/integrations.mdx
+++ b/docs/source/integrations.mdx
@@ -1,5 +1,8 @@
 # Transformers
+... TODO: to be filled out ...
 
 # PEFT
+... TODO: to be filled out ...
 
 # Trainer for the optimizers
+... TODO: to be filled out ...
diff --git a/docs/source/introduction.mdx b/docs/source/introduction.mdx
index 7506992bc..b7bf499b9 100644
--- a/docs/source/introduction.mdx
+++ b/docs/source/introduction.mdx
@@ -1,39 +1,11 @@
+TODO: Many parts of this doc will still be redistributed among the new doc structure.
+
 # `bitsandbytes`
 
 The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.
 
 There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is on its way as well.
 
-# Resources:
-- [8-bit Optimizer Paper](https://arxiv.org/abs/2110.02861) --  [Video](https://www.youtube.com/watch?v=IxrlHAJtqKE) -- [Docs](https://bitsandbytes.readthedocs.io/en/latest/)
-
-- [LLM.int8() Paper](https://arxiv.org/abs/2208.07339) -- [LLM.int8() Software Blog Post](https://huggingface.co/blog/hf-bitsandbytes-integration) -- [LLM.int8() Emergent Features Blog Post](https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/)
-
-## TL;DR
-**Requirements**
-Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0.
-
-(Deprecated: CUDA 10.0 is deprecated and only CUDA >= 11.0) will be supported with release 0.39.0)
-
-**Installation**:
-
-``pip install bitsandbytes``
-
-In some cases it can happen that you need to compile from source. If this happens please consider submitting a bug report with `python -m bitsandbytes` information. What now follows is some short instructions which might work out of the box if `nvcc` is installed. If these do not work see further below.
-
-Compilation quickstart:
-```bash
-git clone https://github.com/timdettmers/bitsandbytes.git
-cd bitsandbytes
-
-# CUDA_VERSIONS in {110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 120}
-# make argument in {cuda110, cuda11x, cuda12x}
-# if you do not know what CUDA you have, try looking at the output of: python -m bitsandbytes
-CUDA_VERSION=117 make cuda11x
-python setup.py install
-```
-
-**Using Int8 inference with HuggingFace Transformers**
 
 ```python
 from transformers import AutoModelForCausalLM
@@ -89,9 +61,6 @@ The bitsandbytes library is currently only supported on Linux distributions. Win
 
 The requirements can best be fulfilled by installing pytorch via anaconda. You can install PyTorch by following the ["Get Started"](https://pytorch.org/get-started/locally/) instructions on the official website.
 
-To install run:
-
-``pip install bitsandbytes``
 
 ## Using bitsandbytes
 
@@ -166,26 +135,3 @@ For more detailed instruction, please follow the [compile_from_source.md](compil
 The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms: Pytorch is licensed under the BSD license.
 
 We thank Fabio Cannizzo for his work on [FastBinarySearch](https://github.com/fabiocannizzo/FastBinarySearch) which we use for CPU quantization.
-
-## How to cite us
-If you found this library and found LLM.int8() useful, please consider citing our work:
-
-```bibtex
-@article{dettmers2022llmint8,
-  title={LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale},
-  author={Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke},
-  journal={arXiv preprint arXiv:2208.07339},
-  year={2022}
-}
-```
-
-For 8-bit optimizers or quantization routines, please consider citing the following work:
-
-```bibtex
-@article{dettmers2022optimizers,
-  title={8-bit Optimizers via Block-wise Quantization},
-  author={Dettmers, Tim and Lewis, Mike and Shleifer, Sam and Zettlemoyer, Luke},
-  journal={9th International Conference on Learning Representations, ICLR},
-  year={2022}
-}
-```
diff --git a/docs/source/moduletree.mdx b/docs/source/moduletree.mdx
new file mode 100644
index 000000000..2bd10a4a6
--- /dev/null
+++ b/docs/source/moduletree.mdx
@@ -0,0 +1,5 @@
+# Module tree overview
+
+- **bitsandbytes.functional**: Contains quantization functions and stateless 8-bit optimizer update functions.
+- **bitsandbytes.nn.modules**: Contains stable embedding layer with automatic 32-bit optimizer overrides (important for NLP stability)
+- **bitsandbytes.optim**: Contains 8-bit optimizers.
diff --git a/docs/source/optimizers.mdx b/docs/source/optimizers.mdx
index 1ac80b593..a71478adc 100644
--- a/docs/source/optimizers.mdx
+++ b/docs/source/optimizers.mdx
@@ -1,103 +1,129 @@
-Here we provide a short description and usage examples for each optimizer in `bitsandbytes.optim. We'll start by explaining the core optimizer class `Optimizer8bit`, followed by the specific implementations `Adagrad`, `Adagrad8bit` and `Adagrad32bit`.
+# Introduction: 8-bit optimizers
+With 8-bit optimizers, larger models can be finetuned with the same GPU memory compared to standard 32-bit optimizer training. 8-bit optimizers are a drop-in replacement for regular optimizers:
 
-Each of these optimizers can be utilized depending on the specific requirements of the task at hand, such as memory constraints, computational efficiency and the need for precision.
+- Faster (e.g. 4x faster than regular Adam)
+- 75% less memory, same performance
+- No hyperparameter tuning needed
 
-# Optimizer base class
+8-bit optimizers are mostly useful to finetune large models that did not fit into memory before. They also make it easier to pretrain larger models and have great synergy with sharded data parallelism. 8-bit Adam, for example, is already used across multiple teams in Facebook. This optimizer saves a ton of memory at no accuracy hit.
 
-## `Optimizer8bit`
+See here the biggest models
 
-The `Optimizer8bit` class serves as a base class for all 8-bit optimizers, providing common functionalities required for quantized optimization. The class is designed to support both 32-bit and 8-bit computations, where 8-bit optimizations can significantly reduce memory footprint and increase computation speed.
+We feature 8-bit Adam/AdamW, SGD momentum, LARS, LAMB, and RMSProp.
 
-### Usage:
+It only requires a two-line code change to get started.
+```
+import bitsandbytes as bnb
 
-```python
-import torch
-from bitsandbytes.optim import Optimizer8bit
-
-model = YourModel()
-params = model.parameters()
-
-# Initialize the optimizer with your model's parameters
-optimizer = Optimizer8bit(params, defaults={
-    'lr': 0.001,
-    'betas': (0.9, 0.999),
-    'eps': 1e-08,
-    'weight_decay': 0
-}, optim_bits=8)  # Use optim_bits=32 for 32-bit optimization
-
-# In your training loop
-optimizer.zero_grad()
-loss = compute_loss()  # Implement your loss computation
-loss.backward()
-optimizer.step()
+# before: adam = torch.optim.Adam(...)
+adam = bnb.optim.Adam8bit(...)
+
+# recommended for NLP models
+# before: torch.nn.Embedding(...)
+bnb.nn.StableEmbedding(...)
 ```
 
-# Adagrad implementations
+The arguments passed are the same as standard Adam. For NLP models we recommend also to use the StableEmbedding layers which improves results and helps with stable 8-bit optimization.
 
-## `Adagrad`
+## Overview of expected gradients
 
-The `Adagrad` class is an implementation of the Adagrad optimizer, which adapts the learning rate for each parameter based on the historical gradient information. This version allows for both 32-bit and 8-bit representations, with specific classes for each.
+TODO: add pics here, no idea how to do that
 
-### `Adagrad` Usage:
+Want to add both pics in https://huggingface.co/datasets/huggingface/documentation-images/tree/main/bitsandbytes
 
-```python
-import torch
-from bitsandbytes.optim import Adagrad
+# Research Background
 
-model = YourModel()
-params = model.parameters()
+Stateful optimizers maintain gradient statistics over time, e.g. the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. `bitsandbytes` optimizers use 8-bit statistics, while maintaining the performance levels of using 32-bit optimizer states.
 
-# Initialize the optimizer with your model's parameters
-optimizer = Adagrad(params, lr=0.01)
+To overcome the resulting computational, quantization and stability challenges, 8-bit optimizers have three components:
+1) **Block-wise quantization** divides input tensors into smaller blocks that are independently quantized, therein isolating outliers and distributing the error more equally over all bits. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization.
+2) **dynamic quantization**, which quantizes both small and large values with high precision,
+3) a **stable embedding layer** improves stability during optimization for models with word embeddings.
 
-# In your training loop
-optimizer.zero_grad()
-loss = compute_loss()  # Implement your loss computation
-loss.backward()
-optimizer.step()
-```
+With these components, performing an optimizer update with 8-bit states is straightforward. We dequantize the 8-bit optimizer states to 32-bit, perform the update and then quantize the states back to 8-bit for storage.
 
-## `Adagrad8bit`
+We do this 8-bit to 32-bit conversion element-by-element in registers, which means no slow copies to GPU memory or additional temporary memory are needed to perform quantization and dequantization. For GPUs, this makes 8-bit optimizers faster than regular 32-bit optimizers.
 
-The `Adagrad8bit` class is specifically tailored for 8-bit optimization, inheriting from `Optimizer1State`. It is designed for models where memory efficiency is crucial and it operates with reduced precision to save memory and increase computation speed.
+For more details, please refer to the paper [8-bit Optimizers via Block-wise Quantization](https://arxiv.org/abs/2110.02861)
 
-### `Adagrad8bit` Usage:
+## Stable Embedding Layer
 
-```python
-import torch
-from bitsandbytes.optim import Adagrad8bit
+The Stable Embedding Layer enhances the standard word embedding layer for improved training stability in NLP tasks. It addresses the challenge of non-uniform input distributions and mitigates extreme gradient variations, ensuring smoother training processes.
+
+### Features:
+
+- **Initialization**: Utilizes Xavier uniform initialization to maintain consistent variance, reducing the likelihood of large gradients.
+- **Normalization**: Incorporates layer normalization before adding positional embeddings, aiding in output stability.
+- **Optimizer States**: Employs 32-bit optimizer states exclusively for this layer to enhance stability, while the rest of the model may use standard 16-bit precision.
+
+### Benefits:
+
+- Designed to support more aggressive quantization strategies without compromising training stability.
+- Helps in achieving stable training outcomes, particularly important for models dealing with diverse and complex language data.
 
-model = YourModel()
+# Usage
 
-# Initialize the optimizer with your model's parameters
-optimizer = Adagrad8bit(params, lr=0.01)
+Some more examples of how you can replace your old optimizer with the 8-bit optimizer:
 
-# In your training loop
-optimizer.zero_grad()
-loss = compute_loss()  # Implement your loss computation
-loss.backward()
-optimizer.step()
 ```
+import bitsandbytes as bnb
 
-## Adagrad32bit
+# adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
+adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
+adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=8) # equivalent
 
-The `Adagrad32bit` class is similar to `Adagrad` but ensures that all computations are carried out with 32-bit precision. This class is preferable when numerical precision is more critical than memory efficiency.
+# use 32-bit Adam with 5th percentile clipping
+adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995),
+                      optim_bits=32, percentile_clipping=5)
+```
+
+# How to override config hyperparameters for particular weights/parameters
+
+If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things:
+
+1) Register the parameter while they are still on the CPU,
+2) override the config with the new desired hyperparameters (anytime, anywhere)
 
-### Adagrad32bit Usage:
+For global overrides in many different places in your code you can do:
 
 ```python
 import torch
-from bitsandbytes.optim import Adagrad32bit
+import bitsandbytes as bnb
 
-model = YourModel()
-params = model.parameters()
+mng = bnb.optim.GlobalOptimManager.get_instance()
 
-# Initialize the optimizer with your model's parameters
-optimizer = Adagrad32bit(params, lr=0.01)
+model = MyModel()
+mng.register_parameters(model.parameters()) # 1. register parameters while still on CPU
 
-# In your training loop
-optimizer.zero_grad()
-loss = compute_loss()  # Implement your loss computation
-loss.backward()
-optimizer.step()
+model = model.cuda()
+# use 8-bit optimizer states for all parameters
+adam = bnb.optim.Adam(model.parameters(), lr=0.001, optim_bits=8)
+
+# 2a. override: the parameter model.fc1.weight now uses 32-bit Adam
+mng.override_config(model.fc1.weight, 'optim_bits', 32)
+
+# 2b. override: the two special layers use
+# sparse optimization + different learning rate + different Adam betas
+mng.override_config([model.special.weight, model.also_special.weight],
+                    key_value_dict ={'is_sparse': True, 'lr': 1e-5, 'betas'=(0.9, 0.98)})
 ```
+Possible options for the config override are: `betas, eps, weight_decay, lr, optim_bits, min_8bit_size, percentile_clipping, block_wise, max_unorm`
+
+For overrides for particular layers we recommend overriding locally in each module. You can do this by passing the module, the parameter, and its attribute name to the GlobalOptimManager:
+```python
+class MyModule(torch.nn.Module):
+  def __init__(din, dout):
+    super(MyModule, self).__init__()
+    self.linear = torch.nn.Linear(din, dout)
+    # optimization will happen in 32-bit and
+    # learning rate will be set to 0.0001 independent of the main learning rate
+    config = {'optim_bits': 32, 'lr' : 0.0001}
+    GlobalOptimManager.get_instance().register_module_override(self, 'weight', config)
+
+```
+
+# API Docs
+
+... under construction ...
+
+Here we'll provide auto-generated API docs soon. Please feel free to contribute doc-strings for the respective optimizers, as `bitsandbytes` is a community effort.
diff --git a/docs/source/qlora.mdx b/docs/source/qlora.mdx
new file mode 100644
index 000000000..3eb24a5e9
--- /dev/null
+++ b/docs/source/qlora.mdx
@@ -0,0 +1 @@
+# ... under construction ...(contributions welcome)
diff --git a/docs/source/quantization.mdx b/docs/source/quantization.mdx
index b4bb9d17d..c020df642 100644
--- a/docs/source/quantization.mdx
+++ b/docs/source/quantization.mdx
@@ -1 +1,5 @@
-Linear8bitLt & Linear4bit
+# Linear8bitLt
+... TODO: to be filled out ...
+
+# Linear4bit
+... TODO: to be filled out ...
diff --git a/docs/source/resources.mdx b/docs/source/resources.mdx
new file mode 100644
index 000000000..cafaf189b
--- /dev/null
+++ b/docs/source/resources.mdx
@@ -0,0 +1,90 @@
+# Papers, related resources & how to cite
+
+The below academic work is ordered in reverse chronological order.
+
+## [SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression (Jun 2023)](https://arxiv.org/abs/2306.03078)
+Authors: Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh
+
+- [Twitter summary thread](https://twitter.com/Tim_Dettmers/status/1666076553665744896)
+
+```
+@article{dettmers2023spqr,
+  title={SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression},
+  author={Dettmers, Tim and Svirschevski, Ruslan and Egiazarian, Vage and Kuznedelev, Denis and Frantar, Elias and Ashkboos, Saleh and Borzunov, Alexander and Hoefler, Torsten and Alistarh, Dan},
+  journal={arXiv preprint arXiv:2306.03078},
+  year={2023}
+}
+```
+
+## [QLoRA: Efficient Finetuning of Quantized LLMs (May 2023)](https://arxiv.org/abs/2305.14314)
+Authors: Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer
+
+- [Video](https://www.youtube.com/watch?v=y9PHWGOa8HA&ab_channel=LondonMachineLearningMeetup)
+- [Twitter summary thread](https://twitter.com/Tim_Dettmers/status/1661379354507476994)
+
+```
+@article{dettmers2023qlora,
+  title={Qlora: Efficient finetuning of quantized llms},
+  author={Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
+  journal={arXiv preprint arXiv:2305.14314},
+  year={2023}
+}
+
+## [The case for 4-bit precision: k-bit Inference Scaling Laws (Dec 2022)](https://arxiv.org/abs/2212.09720)
+Authors: Tim Dettmers, Luke Zettlemoyer
+
+- [Video](https://www.youtube.com/watch?v=odlQa6AE1gY&ab_channel=TheInsideView)
+- [Twitter summary thread](https://twitter.com/Tim_Dettmers/status/1605209171758284805)
+
+```
+@inproceedings{dettmers2023case,
+  title={The case for 4-bit precision: k-bit inference scaling laws},
+  author={Dettmers, Tim and Zettlemoyer, Luke},
+  booktitle={International Conference on Machine Learning},
+  pages={7750--7774},
+  year={2023},
+  organization={PMLR}
+}
+```
+
+## [LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Nov 2022)](https://arxiv.org/abs/2208.07339)
+Authors: Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer
+
+- [LLM.int8() Blog Post](https://huggingface.co/blog/hf-bitsandbytes-integration)
+- [LLM.int8() Emergent Features Blog Post](https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/)
+- [Introduction to Weight Quantization](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c)
+- [Poster](https://twitter.com/Tim_Dettmers/status/1598351301942951937)
+
+```
+@article{dettmers2022llm,
+  title={Llm. int8 (): 8-bit matrix multiplication for transformers at scale},
+  author={Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke},
+  journal={arXiv preprint arXiv:2208.07339},
+  year={2022}
+}
+```
+
+## [8-bit Optimizers via Block-wise Quantization (Oct 2021)](https://arxiv.org/abs/2110.02861)
+Authors: Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer
+
+- [Video](https://www.youtube.com/watch?v=IxrlHAJtqKE)
+- [Twitter summary thread](https://twitter.com/Tim_Dettmers/status/1446472128979562499)
+
+```
+@article{DBLP:journals/corr/abs-2110-02861,
+  author       = {Tim Dettmers and
+                  Mike Lewis and
+                  Sam Shleifer and
+                  Luke Zettlemoyer},
+  title        = {8-bit Optimizers via Block-wise Quantization},
+  journal      = {CoRR},
+  volume       = {abs/2110.02861},
+  year         = {2021},
+  url          = {https://arxiv.org/abs/2110.02861},
+  eprinttype    = {arXiv},
+  eprint       = {2110.02861},
+  timestamp    = {Thu, 21 Oct 2021 16:20:08 +0200},
+  biburl       = {https://dblp.org/rec/journals/corr/abs-2110-02861.bib},
+  bibsource    = {dblp computer science bibliography, https://dblp.org}
+}
+```
diff --git a/howto_config_override.md b/howto_config_override.md
deleted file mode 100644
index 55b24e3ab..000000000
--- a/howto_config_override.md
+++ /dev/null
@@ -1,40 +0,0 @@
-# How to override config hyperparameters for particular weights/parameters
-
-If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things: (1) register the parameter while they are still on the CPU, (2) override the config with the new desired hyperparameters (anytime, anywhere). See our [guide](howto_config_override.md) for more details
-
-For global overrides in many different places in your code you can do:
-```python
-import torch
-import bitsandbytes as bnb
-
-mng = bnb.optim.GlobalOptimManager.get_instance()
-
-model = MyModel()
-mng.register_parameters(model.parameters()) # 1. register parameters while still on CPU
-
-model = model.cuda()
-# use 8-bit optimizer states for all parameters
-adam = bnb.optim.Adam(model.parameters(), lr=0.001, optim_bits=8)
-
-# 2a. override: the parameter model.fc1.weight now uses 32-bit Adam
-mng.override_config(model.fc1.weight, 'optim_bits', 32)
-
-# 2b. override: the two special layers use
-# sparse optimization + different learning rate + different Adam betas
-mng.override_config([model.special.weight, model.also_special.weight],
-                    key_value_dict ={'is_sparse': True, 'lr': 1e-5, 'betas'=(0.9, 0.98)})
-```
-Possible options for the config override are: `betas, eps, weight_decay, lr, optim_bits, min_8bit_size, percentile_clipping, block_wise, max_unorm`
-
-For overrides for particular layers we recommend overriding locally in each module. You can do this by passing the module, the parameter, and its attribute name to the GlobalOptimManager:
-```python
-class MyModule(torch.nn.Module):
-  def __init__(din, dout):
-    super(MyModule, self).__init__()
-    self.linear = torch.nn.Linear(din, dout)
-    # optimization will happen in 32-bit and
-    # learning rate will be set to 0.0001 independent of the main learning rate
-    config = {'optim_bits': 32, 'lr' : 0.0001}
-    GlobalOptimManager.get_instance().register_module_override(self, 'weight', config)
-
-```