Skip to content

Commit

Permalink
Merge pull request #72 from SeanLee97/feature/espresso
Browse files Browse the repository at this point in the history
Feature/espresso
  • Loading branch information
SeanLee97 authored May 21, 2024
2 parents 75e3ce9 + 7ad543d commit 08984f3
Show file tree
Hide file tree
Showing 20 changed files with 1,139 additions and 513 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
Expand Down
292 changes: 137 additions & 155 deletions README.md

Large diffs are not rendered by default.

28 changes: 3 additions & 25 deletions README_2DMSE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,38 +2,16 @@

> Paper: https://arxiv.org/abs/2402.14776
# Usage
"🪆 2D Matryoshka Sentence Embeddings" has been renamed to ☕️ "ESE: Espresso Sentence Embeddings".

**⚠️ The Document is Working in Progress!**


Example:

```bash
WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=0 angle-trainer \
--model_name_or_path WhereIsAI/UAE-Large-V1 \
--train_name_or_path data.jsonl --save_dir ckpts/custom-UAE-2dmse \
--w2 20.0 --w1 1. --w3 1. --angle_tau 20.0 --learning_rate 1e-5 --maxlen 128 \
--workers 16 \
--pooling_strategy all \
--epochs 1 \
--batch_size 16 \
--apply_tdmse 1 \
--fixed_teacher_name_or_path WhereIsAI/UAE-Large-V1 \
--logging_steps 1000 \
--warmup_steps 100 \
--is_llm 0 \
--save_steps 1000 --seed -1 --gradient_accumulation_steps 6 --fp16 1
```

The `--apply_tdmse 1` is required.
Please find the document in [☕️ Espresso](README_Espresso.md)


# Citation

```bibtex
@article{li20242d,
title={2D Matryoshka Sentence Embeddings},
title={ESE: Espresso Sentence Embeddings},
author={Xianming Li and Zongxi Li and Jing Li and Haoran Xie and Qing Li},
journal={arXiv preprint arXiv:2402.14776},
year={2024}
Expand Down
49 changes: 49 additions & 0 deletions README_ESE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Espresso Sentence Embeddings (previously known as 2DMSE)

> Paper: https://arxiv.org/abs/2402.14776
## Abstract

High-quality sentence embeddings are fundamental in many natural language processing (NLP) tasks, such as semantic textual similarity (STS) and retrieval-augmented generation (RAG).
Nevertheless, most existing methods leverage fixed-length embeddings from full-layer language models, which lack the scalability to accommodate the diverse available resources across various applications.
Viewing this gap, we propose a novel sentence embedding model $\mathrm{Espresso}$ $\mathrm{Sentence}$ $\mathrm{Embeddings}$ (ESE) with two learning processes.
First, the **learn-to-express** process encodes more salient representations to lower layers.
Second, the **learn-to-compress** process compacts essential features into the initial dimensions using Principal Component Analysis (PCA).
This way, ESE can scale model depth via the former process and embedding size via the latter.
Extensive experiments on STS and RAG suggest that ESE can effectively produce high-quality embeddings with less model depth and embedding size, enhancing embedding inference efficiency.

## How to train

To enable espresso sentence embeddings (ESE), please specify `--apply_ese 1` and configure appropriate ESE hyperparameters via `--ese_kl_temperature float` and `--ese_compression_size integer`.

Here is an training example:

```bash
WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=1234 -m angle_emb.angle_trainer \
--model_name_or_path WhereIsAI/UAE-Large-V1 \
--train_name_or_path SeanLee97/nli_for_simcse --save_dir ckpts/UAE-Large-Espresso \
--ibn_w 10.0 --cosine_w 0. --angle_w 1.0 --angle_tau 20.0 --learning_rate 1e-6 --maxlen 75 \
--workers 16 \
--pooling_strategy cls \
--epochs 1 \
--batch_size 128 \
--logging_steps 100 \
--warmup_steps 200 \
--save_steps 1000 \
--fp16 1 \
--gradient_accumulation_steps 4 \
--apply_ese 1 \
--ese_compression_size 128 \
--ese_kl_temperature 1.0
```

# Citation

```bibtex
@article{li20242d,
title={ESE: Espresso Sentence Embeddings},
author={Xianming Li and Zongxi Li and Jing Li and Haoran Xie and Qing Li},
journal={arXiv preprint arXiv:2402.14776},
year={2024}
}
```
Loading

0 comments on commit 08984f3

Please sign in to comment.