Add support for Task Arithmetics (#698)

This PR adds support for various task arithmetic options for LoRA. Until now, our library supported averaging only by linearly combining different adapters. This may be insufficient, especially for LoRA — hence, several publications have proposed other ways to perform task arithmetics. This PR: - makes it easier to implement different weighting methods - adds 2 additional merging methods for LoRA - adds a method to merge heads - provides docu & notebook --------- Co-authored-by: calpt <[email protected]>
adapter-hub · Aug 2, 2024 · 8ddbcc8 · 8ddbcc8
1 parent b6dda33
commit 8ddbcc8
Show file tree

Hide file tree

Showing 28 changed files with 1,384 additions and 149 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,5 @@
+examples/checkpoint-*
+
 # Initially taken from Github's Python gitignore file
 
 # Byte-compiled / optimized / DLL files

diff --git a/README.md b/README.md
@@ -36,9 +36,12 @@ A Unified Library for Parameter-Efficient and Modular Transfer Learning
 [![GitHub](https://img.shields.io/github/license/adapter-hub/adapters.svg?color=blue)](https://github.com/adapter-hub/adapters/blob/main/LICENSE)
 [![PyPI](https://img.shields.io/pypi/v/adapters)](https://pypi.org/project/adapters/)
 
-_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [various adapter methods](https://docs.adapterhub.ml/overview.html) into [state-of-the-art pre-trained language models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference.
+_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [10+ adapter methods](https://docs.adapterhub.ml/overview.html) into [20+ state-of-the-art Transformer models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference.
+
+_Adapters_ provides a unified interface for efficient fine-tuning and modular transfer learning, supporting a myriad of features like full-precision or quantized training (e.g. [Q-LoRA, Q-Bottleneck Adapters, or Q-PrefixTuning](https://github.com/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)), [adapter merging via task arithmetics](https://docs.adapterhub.ml/adapter_composition.html#merging-adapters) or the composition of multiple adapters via [composition blocks](https://docs.adapterhub.ml/adapter_composition.html), allowing advanced research in parameter-efficient transfer learning for NLP tasks.
+
+> **Note**: The _Adapters_ library has replaced the [`adapter-transformers`](https://github.com/adapter-hub/adapter-transformers-legacy) package. All previously trained adapters are compatible with the new library. For transitioning, please read: https://docs.adapterhub.ml/transitioning.html.
 
-> **Note**: The _Adapters_ library has replaced the `adapter-transformers` package. All previously trained adapters are compatible with the new library. For transitioning, please read: https://docs.adapterhub.ml/transitioning.html.
 
 ## Installation
 
@@ -57,6 +60,7 @@ cd adapters
 pip install .
 ```
 
+
 ## Quick Tour
 
 #### Load pre-trained adapters:
@@ -157,6 +161,8 @@ Currently, adapters integrates all architectures and methods listed below:
 | Prompt Tuning | [Lester et al. (2021)](https://aclanthology.org/2021.emnlp-main.243/) | [Docs](https://docs.adapterhub.ml/methods.html#prompt-tuning) |
 | QLoRA | [Dettmers et al. (2023)](https://arxiv.org/pdf/2305.14314.pdf) | [Notebook](https://colab.research.google.com/github/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb) |
 | ReFT | [Wu et al. (2024)](https://arxiv.org/pdf/2404.03592) | [Docs](https://docs.adapterhub.ml/methods.html#reft) |
+| Adapter Task Arithmetics | [Chronopoulou et al. (2023)](https://arxiv.org/abs/2311.09344) and [Zhang et al. (2023)](https://proceedings.neurips.cc/paper_files/paper/2023/hash/299a08ee712d4752c890938da99a77c6-Abstract-Conference.html) | [Docs](https://docs.adapterhub.ml/merging_adapters.html), [Notebook](https://colab.research.google.com/github/Adapter-Hub/adapters/blob/main/notebooks/06_Task_Arithmetics.ipynb)|
+
 
 ## Supported Models
 

diff --git a/docs/adapter_composition.md b/docs/adapter_composition.md
@@ -40,17 +40,17 @@ The basic building blocks of the more advanced setups are objects derived from `
 each representing a different possibility to combine single adapters.
 The following table gives an overview on the supported composition blocks and their support by different adapter methods.
 
-| Block | Bottleneck<br> Adapters | Prefix<br> Tuning | Compacter | LoRA | (IA)³ | Prompt Tuning |
-| --- | --- | --- | --- | --- | --- | --- |
-| [`Stack`](#stack) | ✅ | ✅ | ✅ | ✅(*) | ✅(*) |  |
-| [`Fuse`](#fuse) | ✅ |  | ✅ |  |  |  |
-| [`Split`](#split) | ✅ |  | ✅ |  |  |  |
-| [`BatchSplit`](#batchsplit) | ✅ | ✅ | ✅ | ✅(*) | ✅(*) |  |
-| [`Parallel`](#parallel) | ✅ | ✅ | ✅ | ✅(*) | ✅(*) |  |
-| [Output averaging](#output-averaging) | ✅ |  | ✅ | ✅(*) | ✅(*) |  |
-| [Parameter averaging](#parameter-averaging) | ✅ | ✅ | ✅ | ✅ | ✅ |  |
-
-(*) except for Deberta-v1, GPT-2.
+| Block                                       | Bottleneck<br> Adapters | Prefix<br> Tuning | Compacter | LoRA | (IA)³ | Prompt Tuning |
+| ------------------------------------------- | ----------------------- | ----------------- | --------- | ---- | ----- | ------------- |
+| [`Stack`](#stack)                           | ✅                       | ✅                 | ✅         | ✅(*) | ✅(*)  |               |
+| [`Fuse`](#fuse)                             | ✅                       |                   | ✅         |      |       |               |
+| [`Split`](#split)                           | ✅                       |                   | ✅         |      |       |               |
+| [`BatchSplit`](#batchsplit)                 | ✅                       | ✅                 | ✅         | ✅(*) | ✅(*)  |               |
+| [`Parallel`](#parallel)                     | ✅                       | ✅                 | ✅         | ✅(*) | ✅(*)  |               |
+| [Output averaging](#output-averaging)       | ✅                       |                   | ✅         | ✅(*) | ✅(*)  |               |
+| [Parameter averaging](#parameter-averaging) | ✅                       | ✅                 | ✅         | ✅    | ✅     | ✅             |
+
+(*) except for Deberta and GPT-2.
 
 Next, we present all composition blocks in more detail.
 
@@ -236,16 +236,12 @@ print("STS-B adapter output:", output1[0].item())
 print("MRPC adapter output:", bool(torch.argmax(output2[0]).item()))
 ```
 
-## Averaging Outputs or Parameters
+## Output averaging
 
-Following approaches of ensembling full models at inference time for better generalization, recent work on adapters has explored methods of averaging pre-trained adapters.
-This includes averaging output representations of adapters ([Wang et al., 2021](https://arxiv.org/pdf/2109.04877.pdf)) as well as averaging adapter parameters ([Wang et al., 2022](https://arxiv.org/pdf/2205.12410.pdf), [Chronopoulou et al., 2023](https://aclanthology.org/2023.findings-eacl.153.pdf)).
-`adapters` provides built-in support for both types of inference time averaging methods.
+Recent work on adapters has explored methods to ensemble models for better generalization. 
+This includes averaging output representations of adapters ([Wang et al., 2021](https://aclanthology.org/2021.findings-emnlp.63)) as well as averaging adapter parameters ([Wang et al., 2022](https://aclanthology.org/2022.emnlp-main.388/), [Chronopoulou et al., 2023](https://aclanthology.org/2023.findings-eacl.153.pdf)). _Adapters_ provides built-in support for both types of inference-time averaging methods. The output averaging composition block is described below and merging adapter parameters is explained in the [Merging Adapters](merging_adapters.md) documentation page.
 
-### Output averaging
-
-Output averaging allows to dynamically aggregate the output representations of multiple adapters in a model forward pass via weighted averaging.
-This is realized via the `Average` composition block that works similar to other composition blocks.
+Output averaging allows the dynamic aggregation of output representations of multiple adapters in a model forward pass via weighted averaging. This is realized via the `Average` composition block, which works similarly to other composition blocks.
 In the example below, the three adapters are averaged with the weights `0.1` for `m`, `0.6` for `n` and `0.3` for `o`.
 
 ```python
@@ -260,25 +256,6 @@ model.add_adapter("o")
 model.active_adapters = ac.Average("m", "n", "o", weights=[0.1, 0.6, 0.3])
 ```
 
-### Parameter averaging
-
-Parameter averaging enables creating a new adapter via weighted averaging of the parameters of multiple pre-trained adapters.
-As this process is typically not done dynamically at runtime, `adapters` provides `average_adapter()` as a dedicated method for parameter averaging.
-In the example below, the parameters of the adapters `m`, `n` and `o` are averaged (with weights `0.1` `0.6` and `0.3`, respectively) to create a new adapter `avg`.
-Note that for this to succeed, all averaged adapters must use the same adapter configuration.
-
-```python
-model.add_adapter("m")
-model.add_adapter("n")
-model.add_adapter("o")
-
-model.average_adapter("avg", ["m", "n", "o"], weights=[0.1, 0.6, 0.3])
-```
-
-Compared to output averaging, parameter averaging of adapters has the advantage of not inducing any additional inference time relative to using a single adapter.
-
-For both output and parameter averaging, passed weights are normalized by default.
-To disable normalization, pass `normalize_weights=False`.
 
 ## Nesting composition blocks
 
@@ -293,13 +270,13 @@ model.active_adapters = ac.Stack("a", ac.Split("b", "c", splits=60))
 
 However, combinations of adapter composition blocks cannot be arbitrarily deep. All currently supported possibilities are visualized in the table below.
 
-|Block|Supported Nesting|
-|---|---|
-| [`Stack`](#stack)|[str, Fuse, Split, Parallel, BatchSplit, Average]|
-| [`Fuse`](#fuse)|[str, Stack]|
-|[`Split`](#split)|[str, Split, Stack, BatchSplit, Average]|
-|[`Parallel`](#parallel)|[str, Stack, BatchSplit, Average]|
-|[`BatchSplit`](#batchsplit)|[str, Stack, Split, BatchSplit, Average]|
-|[`Average`](#output-averaging)|[str, Stack, Split, BatchSplit]|
+| Block                          | Supported Nesting                                 |
+| ------------------------------ | ------------------------------------------------- |
+| [`Stack`](#stack)              | [str, Fuse, Split, Parallel, BatchSplit, Average] |
+| [`Fuse`](#fuse)                | [str, Stack]                                      |
+| [`Split`](#split)              | [str, Split, Stack, BatchSplit, Average]          |
+| [`Parallel`](#parallel)        | [str, Stack, BatchSplit, Average]                 |
+| [`BatchSplit`](#batchsplit)    | [str, Stack, Split, BatchSplit, Average]          |
+| [`Average`](#output-averaging) | [str, Stack, Split, BatchSplit]                   |
 
 In the table, `str` represents an adapter, e.g. adapter "a" in the nesting example above. Depending on the individual model, some nested compositions might not be possible.
diff --git a/docs/index.rst b/docs/index.rst
@@ -49,6 +49,7 @@ Currently, we support the PyTorch versions of all models as listed on the `Model
    :caption: Advanced
 
    adapter_composition
+   merging_adapters
    prediction_heads
    embeddings
    extending

diff --git a/docs/merging_adapters.md b/docs/merging_adapters.md
@@ -0,0 +1,77 @@
+# Merging Adapters
+
+The adapters library allows new adapters to be created by combining the parameters of multiple trained adapters, i.e. merging multiple existing adapters into a new one. This allows efficient domain, language and task transfer. Adapter Merging is a form of Task Arithmetics ([Ilharco et al., 2023](https://arxiv.org/abs/2212.04089); [Zhang et al., 2023](https://proceedings.neurips.cc/paper_files/paper/2023/hash/299a08ee712d4752c890938da99a77c6-Abstract-Conference.html)) and hence allows increasing or unlearning specific skills. Unlearning is done by using negative weights.
+
+The `average_adapter()` method provides this merging functionality:
+
+```python
+model.add_adapter("bottleneck_1", "seq_bn")
+model.add_adapter("bottleneck_2", "seq_bn")
+model.add_adapter("bottleneck_3", "seq_bn")
+
+model.average_adapter(adapter_name="avg", adapter_list=["bottleneck_1", "bottleneck_2", "bottleneck_3"], weights=[-1, 1.2, 0.8])
+```
+In this example, the parameters of the three added bottleneck adapters are merged (with weights `-1`, `1.2` and `0.8`, respectively) to create a new adapter `avg`.
+Note that for this to succeed, all averaged adapters must use the same adapter configuration. Compared to the [output averaging](adapter_composition.md#output-averaging) composition block, merging parameters of adapters has the advantage of not inducing any additional inference time relative to using a single adapter.
+
+All [adapter methods](model_overview.md#table-of-adapter-methods) support linear merging. In linear merging, the weights of the trained adapters are linearly combined: Let us have *N* adapters and let $\Phi_i$ be all the parameters of the *i*-th adapter, and $\lambda_i$ be the corresponding weight that determines how strongly we weigh this adapter. The merged adapter parameters $\Phi_{merged}$ are calculated as:
+
+$$
+\Phi_{merged} = \sum_{i=0}^{N} \lambda_i \Phi_i
+$$
+
+The `average_adapter` method only merges the weights of the adapters but does not create a new head. To average the weights of heads, use the `average_head` method. Since heads are usually linear layers, the `average_head` method uses linear merging:
+
+```python
+model.add_masked_lm_head("head_1")
+model.add_masked_lm_head("head_2")
+
+model.average_head(head_name="avg_head", head_list=["head_1", "head_2"], weights=[0.2, 0.8])
+```
+
+#### Merging LoRA Adapters
+LoRA introduces $A$ and $B$ matrices with $\Delta W = BA$. Since the B and A matrices are strongly dependent on each other, there are several ways to merge the weights of LoRA adapters. You can choose the combination method by passing the `combine_strategy` parameter to the `average_adapter` method:
+
+1. `combine_strategy = "linear"`: Linear Combination (default). This has been proposed for LoRA by [Chronopoulou et al. (2023)](https://arxiv.org/abs/2311.09344). With $\Phi = \{A, B\}$:
+
+    $$
+    \Phi_{merged} = \sum_{i=0}^{N} \lambda_i \Phi_i
+    $$
+
+2. `combine_strategy = "lora_linear_only_negate_b"` Following [Zhang et al. (2023)](https://proceedings.neurips.cc/paper_files/paper/2023/hash/299a08ee712d4752c890938da99a77c6-Abstract-Conference.html), this method only uses negative weights for the B-matrix if the weight is negative:
+
+    $$
+    A_{merged} &= \sum_{i=0}^{N} |\lambda_i| A_i\\
+    B_{merged} &= \sum_{i=0}^{N} \lambda_i B_i
+    $$
+
+3. `combine_strategy = "lora_delta_w_svd"`: This method merges the $\Delta W_i$ of each adapter and then performs a singular value decomposition (SVD) to obtain the *A* and *B* LoRA matrices:
+    1. For every adapter *i* we calculate: $\Delta W_i = B_i \cdot A_i$
+    2. $\Delta W_{new} = \sum_{i=0}^N \lambda_i \cdot W_i$ 
+    3. Perform SVD on $\text{SVD}(\Delta W_{new})$ to obtain $A_{new}$ and $B_{new}$
+
+`lora_delta_w_svd` is not supported by Deberta and GPT-2. Example usage of these LoRA-specific merging strategies:
+
+```python
+model.add_adapter("lora_1", "seq_bn")
+model.add_adapter("lora_2", "seq_bn")
+model.add_adapter("lora_3", "seq_bn")
+
+model.average_adapter(
+    adapter_name="lora_avg",
+    adapter_list=["lora_1", "lora_2", "lora_3"],
+    weights=[1, -1, 1],
+    combine_strategy="lora_delta_w_svd",
+    svd_rank=8
+)
+# Note that "lora_delta_w_svd" requires the "svd_rank" parameter, which determines the r (rank) of the resulting LoRA adapter after SVD
+```
+
+For both output and parameter averaging, passed weights are normalized by default. To disable normalization, pass `normalize_weights=False`.
+For more detailed examples and explanations, refer to our [Task Arithmetic notebook](https://github.com/adapter-hub/adapters/tree/main/notebooks/task_arithmetics_in_adapter.ipynb).
+
+
+```{eval-rst}
+.. tip::
+    Adding more adapter merging methods is easy: You have to simply modify the ``average_adapter`` method. Most adapter-methods use the default implementation that only supports linear merging in `model_mixin.py <https://github.com/adapter-hub/adapters/blob/main/src/adapters/model_mixin.py>`_. Others like LoRA overwrite this method to add new merging methods like "lora_delta_w_svd", have a look at `lora.py <https://github.com/adapter-hub/adapters/blob/main/src/adapters/methods/lora.py>`_.
+```