Skip to content

Commit

Permalink
Add support for Task Arithmetics (#698)
Browse files Browse the repository at this point in the history
This PR adds support for various task arithmetic options for LoRA. Until
now, our library supported averaging only by linearly combining
different adapters. This may be insufficient, especially for LoRA —
hence, several publications have proposed other ways to perform task
arithmetics.

This PR:
- makes it easier to implement different weighting methods
- adds 2 additional merging methods for LoRA
- adds a method to merge heads
- provides docu & notebook

---------

Co-authored-by: calpt <[email protected]>
  • Loading branch information
lenglaender and calpt authored Aug 2, 2024
1 parent b6dda33 commit 8ddbcc8
Show file tree
Hide file tree
Showing 28 changed files with 1,384 additions and 149 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
examples/checkpoint-*

# Initially taken from Github's Python gitignore file

# Byte-compiled / optimized / DLL files
Expand Down
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,12 @@ A Unified Library for Parameter-Efficient and Modular Transfer Learning
[![GitHub](https://img.shields.io/github/license/adapter-hub/adapters.svg?color=blue)](https://github.com/adapter-hub/adapters/blob/main/LICENSE)
[![PyPI](https://img.shields.io/pypi/v/adapters)](https://pypi.org/project/adapters/)

_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [various adapter methods](https://docs.adapterhub.ml/overview.html) into [state-of-the-art pre-trained language models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference.
_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [10+ adapter methods](https://docs.adapterhub.ml/overview.html) into [20+ state-of-the-art Transformer models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference.

_Adapters_ provides a unified interface for efficient fine-tuning and modular transfer learning, supporting a myriad of features like full-precision or quantized training (e.g. [Q-LoRA, Q-Bottleneck Adapters, or Q-PrefixTuning](https://github.com/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)), [adapter merging via task arithmetics](https://docs.adapterhub.ml/adapter_composition.html#merging-adapters) or the composition of multiple adapters via [composition blocks](https://docs.adapterhub.ml/adapter_composition.html), allowing advanced research in parameter-efficient transfer learning for NLP tasks.

> **Note**: The _Adapters_ library has replaced the [`adapter-transformers`](https://github.com/adapter-hub/adapter-transformers-legacy) package. All previously trained adapters are compatible with the new library. For transitioning, please read: https://docs.adapterhub.ml/transitioning.html.
> **Note**: The _Adapters_ library has replaced the `adapter-transformers` package. All previously trained adapters are compatible with the new library. For transitioning, please read: https://docs.adapterhub.ml/transitioning.html.

## Installation

Expand All @@ -57,6 +60,7 @@ cd adapters
pip install .
```


## Quick Tour

#### Load pre-trained adapters:
Expand Down Expand Up @@ -157,6 +161,8 @@ Currently, adapters integrates all architectures and methods listed below:
| Prompt Tuning | [Lester et al. (2021)](https://aclanthology.org/2021.emnlp-main.243/) | [Docs](https://docs.adapterhub.ml/methods.html#prompt-tuning) |
| QLoRA | [Dettmers et al. (2023)](https://arxiv.org/pdf/2305.14314.pdf) | [Notebook](https://colab.research.google.com/github/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb) |
| ReFT | [Wu et al. (2024)](https://arxiv.org/pdf/2404.03592) | [Docs](https://docs.adapterhub.ml/methods.html#reft) |
| Adapter Task Arithmetics | [Chronopoulou et al. (2023)](https://arxiv.org/abs/2311.09344) and [Zhang et al. (2023)](https://proceedings.neurips.cc/paper_files/paper/2023/hash/299a08ee712d4752c890938da99a77c6-Abstract-Conference.html) | [Docs](https://docs.adapterhub.ml/merging_adapters.html), [Notebook](https://colab.research.google.com/github/Adapter-Hub/adapters/blob/main/notebooks/06_Task_Arithmetics.ipynb)|


## Supported Models

Expand Down
69 changes: 23 additions & 46 deletions docs/adapter_composition.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,17 +40,17 @@ The basic building blocks of the more advanced setups are objects derived from `
each representing a different possibility to combine single adapters.
The following table gives an overview on the supported composition blocks and their support by different adapter methods.

| Block | Bottleneck<br> Adapters | Prefix<br> Tuning | Compacter | LoRA | (IA)³ | Prompt Tuning |
| --- | --- | --- | --- | --- | --- | --- |
| [`Stack`](#stack) |||| ✅(*) | ✅(*) | |
| [`Fuse`](#fuse) || || | | |
| [`Split`](#split) || || | | |
| [`BatchSplit`](#batchsplit) |||| ✅(*) | ✅(*) | |
| [`Parallel`](#parallel) |||| ✅(*) | ✅(*) | |
| [Output averaging](#output-averaging) || || ✅(*) | ✅(*) | |
| [Parameter averaging](#parameter-averaging) |||||| |

(*) except for Deberta-v1, GPT-2.
| Block | Bottleneck<br> Adapters | Prefix<br> Tuning | Compacter | LoRA | (IA)³ | Prompt Tuning |
| ------------------------------------------- | ----------------------- | ----------------- | --------- | ---- | ----- | ------------- |
| [`Stack`](#stack) | | | | ✅(*) | ✅(*) | |
| [`Fuse`](#fuse) | | | | | | |
| [`Split`](#split) | | | | | | |
| [`BatchSplit`](#batchsplit) | | | | ✅(*) | ✅(*) | |
| [`Parallel`](#parallel) | | | | ✅(*) | ✅(*) | |
| [Output averaging](#output-averaging) | | | | ✅(*) | ✅(*) | |
| [Parameter averaging](#parameter-averaging) | | | | | | |

(*) except for Deberta and GPT-2.

Next, we present all composition blocks in more detail.

Expand Down Expand Up @@ -236,16 +236,12 @@ print("STS-B adapter output:", output1[0].item())
print("MRPC adapter output:", bool(torch.argmax(output2[0]).item()))
```

## Averaging Outputs or Parameters
## Output averaging

Following approaches of ensembling full models at inference time for better generalization, recent work on adapters has explored methods of averaging pre-trained adapters.
This includes averaging output representations of adapters ([Wang et al., 2021](https://arxiv.org/pdf/2109.04877.pdf)) as well as averaging adapter parameters ([Wang et al., 2022](https://arxiv.org/pdf/2205.12410.pdf), [Chronopoulou et al., 2023](https://aclanthology.org/2023.findings-eacl.153.pdf)).
`adapters` provides built-in support for both types of inference time averaging methods.
Recent work on adapters has explored methods to ensemble models for better generalization.
This includes averaging output representations of adapters ([Wang et al., 2021](https://aclanthology.org/2021.findings-emnlp.63)) as well as averaging adapter parameters ([Wang et al., 2022](https://aclanthology.org/2022.emnlp-main.388/), [Chronopoulou et al., 2023](https://aclanthology.org/2023.findings-eacl.153.pdf)). _Adapters_ provides built-in support for both types of inference-time averaging methods. The output averaging composition block is described below and merging adapter parameters is explained in the [Merging Adapters](merging_adapters.md) documentation page.

### Output averaging

Output averaging allows to dynamically aggregate the output representations of multiple adapters in a model forward pass via weighted averaging.
This is realized via the `Average` composition block that works similar to other composition blocks.
Output averaging allows the dynamic aggregation of output representations of multiple adapters in a model forward pass via weighted averaging. This is realized via the `Average` composition block, which works similarly to other composition blocks.
In the example below, the three adapters are averaged with the weights `0.1` for `m`, `0.6` for `n` and `0.3` for `o`.

```python
Expand All @@ -260,25 +256,6 @@ model.add_adapter("o")
model.active_adapters = ac.Average("m", "n", "o", weights=[0.1, 0.6, 0.3])
```

### Parameter averaging

Parameter averaging enables creating a new adapter via weighted averaging of the parameters of multiple pre-trained adapters.
As this process is typically not done dynamically at runtime, `adapters` provides `average_adapter()` as a dedicated method for parameter averaging.
In the example below, the parameters of the adapters `m`, `n` and `o` are averaged (with weights `0.1` `0.6` and `0.3`, respectively) to create a new adapter `avg`.
Note that for this to succeed, all averaged adapters must use the same adapter configuration.

```python
model.add_adapter("m")
model.add_adapter("n")
model.add_adapter("o")

model.average_adapter("avg", ["m", "n", "o"], weights=[0.1, 0.6, 0.3])
```

Compared to output averaging, parameter averaging of adapters has the advantage of not inducing any additional inference time relative to using a single adapter.

For both output and parameter averaging, passed weights are normalized by default.
To disable normalization, pass `normalize_weights=False`.

## Nesting composition blocks

Expand All @@ -293,13 +270,13 @@ model.active_adapters = ac.Stack("a", ac.Split("b", "c", splits=60))

However, combinations of adapter composition blocks cannot be arbitrarily deep. All currently supported possibilities are visualized in the table below.

|Block|Supported Nesting|
|---|---|
| [`Stack`](#stack)|[str, Fuse, Split, Parallel, BatchSplit, Average]|
| [`Fuse`](#fuse)|[str, Stack]|
|[`Split`](#split)|[str, Split, Stack, BatchSplit, Average]|
|[`Parallel`](#parallel)|[str, Stack, BatchSplit, Average]|
|[`BatchSplit`](#batchsplit)|[str, Stack, Split, BatchSplit, Average]|
|[`Average`](#output-averaging)|[str, Stack, Split, BatchSplit]|
| Block | Supported Nesting |
| ------------------------------ | ------------------------------------------------- |
| [`Stack`](#stack) | [str, Fuse, Split, Parallel, BatchSplit, Average] |
| [`Fuse`](#fuse) | [str, Stack] |
| [`Split`](#split) | [str, Split, Stack, BatchSplit, Average] |
| [`Parallel`](#parallel) | [str, Stack, BatchSplit, Average] |
| [`BatchSplit`](#batchsplit) | [str, Stack, Split, BatchSplit, Average] |
| [`Average`](#output-averaging) | [str, Stack, Split, BatchSplit] |

In the table, `str` represents an adapter, e.g. adapter "a" in the nesting example above. Depending on the individual model, some nested compositions might not be possible.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ Currently, we support the PyTorch versions of all models as listed on the `Model
:caption: Advanced

adapter_composition
merging_adapters
prediction_heads
embeddings
extending
Expand Down
77 changes: 77 additions & 0 deletions docs/merging_adapters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Merging Adapters

The adapters library allows new adapters to be created by combining the parameters of multiple trained adapters, i.e. merging multiple existing adapters into a new one. This allows efficient domain, language and task transfer. Adapter Merging is a form of Task Arithmetics ([Ilharco et al., 2023](https://arxiv.org/abs/2212.04089); [Zhang et al., 2023](https://proceedings.neurips.cc/paper_files/paper/2023/hash/299a08ee712d4752c890938da99a77c6-Abstract-Conference.html)) and hence allows increasing or unlearning specific skills. Unlearning is done by using negative weights.

The `average_adapter()` method provides this merging functionality:

```python
model.add_adapter("bottleneck_1", "seq_bn")
model.add_adapter("bottleneck_2", "seq_bn")
model.add_adapter("bottleneck_3", "seq_bn")

model.average_adapter(adapter_name="avg", adapter_list=["bottleneck_1", "bottleneck_2", "bottleneck_3"], weights=[-1, 1.2, 0.8])
```
In this example, the parameters of the three added bottleneck adapters are merged (with weights `-1`, `1.2` and `0.8`, respectively) to create a new adapter `avg`.
Note that for this to succeed, all averaged adapters must use the same adapter configuration. Compared to the [output averaging](adapter_composition.md#output-averaging) composition block, merging parameters of adapters has the advantage of not inducing any additional inference time relative to using a single adapter.

All [adapter methods](model_overview.md#table-of-adapter-methods) support linear merging. In linear merging, the weights of the trained adapters are linearly combined: Let us have *N* adapters and let $\Phi_i$ be all the parameters of the *i*-th adapter, and $\lambda_i$ be the corresponding weight that determines how strongly we weigh this adapter. The merged adapter parameters $\Phi_{merged}$ are calculated as:

$$
\Phi_{merged} = \sum_{i=0}^{N} \lambda_i \Phi_i
$$

The `average_adapter` method only merges the weights of the adapters but does not create a new head. To average the weights of heads, use the `average_head` method. Since heads are usually linear layers, the `average_head` method uses linear merging:

```python
model.add_masked_lm_head("head_1")
model.add_masked_lm_head("head_2")

model.average_head(head_name="avg_head", head_list=["head_1", "head_2"], weights=[0.2, 0.8])
```

#### Merging LoRA Adapters
LoRA introduces $A$ and $B$ matrices with $\Delta W = BA$. Since the B and A matrices are strongly dependent on each other, there are several ways to merge the weights of LoRA adapters. You can choose the combination method by passing the `combine_strategy` parameter to the `average_adapter` method:

1. `combine_strategy = "linear"`: Linear Combination (default). This has been proposed for LoRA by [Chronopoulou et al. (2023)](https://arxiv.org/abs/2311.09344). With $\Phi = \{A, B\}$:

$$
\Phi_{merged} = \sum_{i=0}^{N} \lambda_i \Phi_i
$$

2. `combine_strategy = "lora_linear_only_negate_b"` Following [Zhang et al. (2023)](https://proceedings.neurips.cc/paper_files/paper/2023/hash/299a08ee712d4752c890938da99a77c6-Abstract-Conference.html), this method only uses negative weights for the B-matrix if the weight is negative:

$$
A_{merged} &= \sum_{i=0}^{N} |\lambda_i| A_i\\
B_{merged} &= \sum_{i=0}^{N} \lambda_i B_i
$$

3. `combine_strategy = "lora_delta_w_svd"`: This method merges the $\Delta W_i$ of each adapter and then performs a singular value decomposition (SVD) to obtain the *A* and *B* LoRA matrices:
1. For every adapter *i* we calculate: $\Delta W_i = B_i \cdot A_i$
2. $\Delta W_{new} = \sum_{i=0}^N \lambda_i \cdot W_i$
3. Perform SVD on $\text{SVD}(\Delta W_{new})$ to obtain $A_{new}$ and $B_{new}$

`lora_delta_w_svd` is not supported by Deberta and GPT-2. Example usage of these LoRA-specific merging strategies:

```python
model.add_adapter("lora_1", "seq_bn")
model.add_adapter("lora_2", "seq_bn")
model.add_adapter("lora_3", "seq_bn")

model.average_adapter(
adapter_name="lora_avg",
adapter_list=["lora_1", "lora_2", "lora_3"],
weights=[1, -1, 1],
combine_strategy="lora_delta_w_svd",
svd_rank=8
)
# Note that "lora_delta_w_svd" requires the "svd_rank" parameter, which determines the r (rank) of the resulting LoRA adapter after SVD
```

For both output and parameter averaging, passed weights are normalized by default. To disable normalization, pass `normalize_weights=False`.
For more detailed examples and explanations, refer to our [Task Arithmetic notebook](https://github.com/adapter-hub/adapters/tree/main/notebooks/task_arithmetics_in_adapter.ipynb).


```{eval-rst}
.. tip::
Adding more adapter merging methods is easy: You have to simply modify the ``average_adapter`` method. Most adapter-methods use the default implementation that only supports linear merging in `model_mixin.py <https://github.com/adapter-hub/adapters/blob/main/src/adapters/model_mixin.py>`_. Others like LoRA overwrite this method to add new merging methods like "lora_delta_w_svd", have a look at `lora.py <https://github.com/adapter-hub/adapters/blob/main/src/adapters/methods/lora.py>`_.
```
Loading

0 comments on commit 8ddbcc8

Please sign in to comment.