Skip to content

Commit

Permalink
feat: update default reranker to RRF (lancedb#1580)
Browse files Browse the repository at this point in the history
- Both LinearCombination (the current default) and RRF are pretty fast
compared to model based rerankers. RRF is slightly faster.
- In our tests RRF has also been slightly more accurate.

This PR:
- Makes RRF the default reranker
- Removed duplicate docs for rerankers
  • Loading branch information
AyushExel authored Sep 3, 2024
1 parent fde636c commit 03ef1dc
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 190 deletions.
189 changes: 4 additions & 185 deletions docs/src/hybrid_search/hybrid_search.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,199 +57,18 @@ results = table.search(query_type="hybrid")

```

By default, LanceDB uses `LinearCombinationReranker(weight=0.7)` to combine and rerank the results of semantic and full-text search. You can customize the hyperparameters as needed or write your own custom reranker. Here's how you can use any of the available rerankers:
By default, LanceDB uses `RRFReranker()`, which uses reciprocal rank fusion score, to combine and rerank the results of semantic and full-text search. You can customize the hyperparameters as needed or write your own custom reranker. Here's how you can use any of the available rerankers:


### `rerank()` arguments
* `normalize`: `str`, default `"score"`:
The method to normalize the scores. Can be "rank" or "score". If "rank", the scores are converted to ranks and then normalized. If "score", the scores are normalized directly.
* `reranker`: `Reranker`, default `LinearCombinationReranker(weight=0.7)`.
* `reranker`: `Reranker`, default `RRF()`.
The reranker to use. If not specified, the default reranker is used.


## Available Rerankers
LanceDB provides a number of re-rankers out of the box. You can use any of these re-rankers by passing them to the `rerank()` method. Here's a list of available re-rankers:
LanceDB provides a number of re-rankers out of the box. You can use any of these re-rankers by passing them to the `rerank()` method.
Go to [Rerankers](../reranking/index.md) to learn more about using the available rerankers and implementing custom rerankers.

### Linear Combination Reranker
This is the default re-ranker used by LanceDB. It combines the results of semantic and full-text search using a linear combination of the scores. The weights for the linear combination can be specified. It defaults to 0.7, i.e, 70% weight for semantic search and 30% weight for full-text search.


```python
from lancedb.rerankers import LinearCombinationReranker

reranker = LinearCombinationReranker(weight=0.3) # Use 0.3 as the weight for vector search

results = table.search("rebel", query_type="hybrid").rerank(reranker=reranker).to_pandas()
```

### Arguments
----------------
* `weight`: `float`, default `0.7`:
The weight to use for the semantic search score. The weight for the full-text search score is `1 - weights`.
* `fill`: `float`, default `1.0`:
The score to give to results that are only in one of the two result sets.This is treated as penalty, so a higher value means a lower score.
TODO: We should just hardcode this-- its pretty confusing as we invert scores to calculate final score
* `return_score` : str, default `"relevance"`
options are "relevance" or "all"
The type of score to return. If "relevance", will return only the `_relevance_score. If "all", will return all scores from the vector and FTS search along with the relevance score.

### Cohere Reranker
This re-ranker uses the [Cohere](https://cohere.ai/) API to combine the results of semantic and full-text search. You can use this re-ranker by passing `CohereReranker()` to the `rerank()` method. Note that you'll need to set the `COHERE_API_KEY` environment variable to use this re-ranker.

```python
from lancedb.rerankers import CohereReranker

reranker = CohereReranker()

results = table.search("vampire weekend", query_type="hybrid").rerank(reranker=reranker).to_pandas()
```

### Arguments
----------------
* `model_name` : str, default `"rerank-english-v2.0"`
The name of the cross encoder model to use. Available cohere models are:
- rerank-english-v2.0
- rerank-multilingual-v2.0
* `column` : str, default `"text"`
The name of the column to use as input to the cross encoder model.
* `top_n` : str, default `None`
The number of results to return. If None, will return all results.

!!! Note
Only returns `_relevance_score`. Does not support `return_score = "all"`.

### Cross Encoder Reranker
This reranker uses the [Sentence Transformers](https://www.sbert.net/) library to combine the results of semantic and full-text search. You can use it by passing `CrossEncoderReranker()` to the `rerank()` method.

```python
from lancedb.rerankers import CrossEncoderReranker

reranker = CrossEncoderReranker()

results = table.search("harmony hall", query_type="hybrid").rerank(reranker=reranker).to_pandas()
```


### Arguments
----------------
* `model` : str, default `"cross-encoder/ms-marco-TinyBERT-L-6"`
The name of the cross encoder model to use. Available cross encoder models can be found [here](https://www.sbert.net/docs/pretrained_cross-encoders.html)
* `column` : str, default `"text"`
The name of the column to use as input to the cross encoder model.
* `device` : str, default `None`
The device to use for the cross encoder model. If None, will use "cuda" if available, otherwise "cpu".

!!! Note
Only returns `_relevance_score`. Does not support `return_score = "all"`.


### ColBERT Reranker
This reranker uses the ColBERT model to combine the results of semantic and full-text search. You can use it by passing `ColbertrReranker()` to the `rerank()` method.

ColBERT reranker model calculates relevance of given docs against the query and don't take existing fts and vector search scores into account, so it currently only supports `return_score="relevance"`. By default, it looks for `text` column to rerank the results. But you can specify the column name to use as input to the cross encoder model as described below.

```python
from lancedb.rerankers import ColbertReranker

reranker = ColbertReranker()

results = table.search("harmony hall", query_type="hybrid").rerank(reranker=reranker).to_pandas()
```

### Arguments
----------------
* `model_name` : `str`, default `"colbert-ir/colbertv2.0"`
The name of the cross encoder model to use.
* `column` : `str`, default `"text"`
The name of the column to use as input to the cross encoder model.
* `return_score` : `str`, default `"relevance"`
options are `"relevance"` or `"all"`. Only `"relevance"` is supported for now.

!!! Note
Only returns `_relevance_score`. Does not support `return_score = "all"`.

### OpenAI Reranker
This reranker uses the OpenAI API to combine the results of semantic and full-text search. You can use it by passing `OpenaiReranker()` to the `rerank()` method.

!!! Note
This prompts chat model to rerank results which is not a dedicated reranker model. This should be treated as experimental.

!!! Tip
- You might run out of token limit so set the search `limits` based on your token limit.
- It is recommended to use gpt-4-turbo-preview, the default model, older models might lead to undesired behaviour

```python
from lancedb.rerankers import OpenaiReranker

reranker = OpenaiReranker()

results = table.search("harmony hall", query_type="hybrid").rerank(reranker=reranker).to_pandas()
```

### Arguments
----------------
* `model_name` : `str`, default `"gpt-4-turbo-preview"`
The name of the cross encoder model to use.
* `column` : `str`, default `"text"`
The name of the column to use as input to the cross encoder model.
* `return_score` : `str`, default `"relevance"`
options are "relevance" or "all". Only "relevance" is supported for now.
* `api_key` : `str`, default `None`
The API key to use. If None, will use the OPENAI_API_KEY environment variable.


## Building Custom Rerankers
You can build your own custom reranker by subclassing the `Reranker` class and implementing the `rerank_hybrid()` method. Here's an example of a custom reranker that combines the results of semantic and full-text search using a linear combination of the scores.

The `Reranker` base interface comes with a `merge_results()` method that can be used to combine the results of semantic and full-text search. This is a vanilla merging algorithm that simply concatenates the results and removes the duplicates without taking the scores into consideration. It only keeps the first copy of the row encountered. This works well in cases that don't require the scores of semantic and full-text search to combine the results. If you want to use the scores or want to support `return_score="all"`, you'll need to implement your own merging algorithm.

```python

from lancedb.rerankers import Reranker
import pyarrow as pa

class MyReranker(Reranker):
def __init__(self, param1, param2, ..., return_score="relevance"):
super().__init__(return_score)
self.param1 = param1
self.param2 = param2

def rerank_hybrid(self, query: str, vector_results: pa.Table, fts_results: pa.Table):
# Use the built-in merging function
combined_result = self.merge_results(vector_results, fts_results)

# Do something with the combined results
# ...

# Return the combined results
return combined_result

```

### Example of a Custom Reranker
For the sake of simplicity let's build custom reranker that just enchances the Cohere Reranker by accepting a filter query, and accept other CohereReranker params as kwags.

```python

from typing import List, Union
import pandas as pd
from lancedb.rerankers import CohereReranker

class MofidifiedCohereReranker(CohereReranker):
def __init__(self, filters: Union[str, List[str]], **kwargs):
super().__init__(**kwargs)
filters = filters if isinstance(filters, list) else [filters]
self.filters = filters

def rerank_hybrid(self, query: str, vector_results: pa.Table, fts_results: pa.Table)-> pa.Table:
combined_result = super().rerank_hybrid(query, vector_results, fts_results)
df = combined_result.to_pandas()
for filter in self.filters:
df = df.query("not text.str.contains(@filter)")

return pa.Table.from_pandas(df)

```

!!! tip
The `vector_results` and `fts_results` are pyarrow tables. You can convert them to pandas dataframes using `to_pandas()` method and perform any operations you want. After you are done, you can convert the dataframe back to pyarrow table using `pa.Table.from_pandas()` method and return it.
2 changes: 2 additions & 0 deletions docs/src/reranking/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ LanceDB comes with some built-in rerankers. Here are some of the rerankers that
- [OpenAI Reranker](./openai.md)
- [Linear Combination Reranker](./linear_combination.md)
- [Jina Reranker](./jina.md)
- [AnswerDotAI Rerankers](./answerdotai.md)
- [Reciprocal Rank Fusion Reranker](./rrf.md)

## Creating Custom Rerankers

Expand Down
11 changes: 6 additions & 5 deletions python/python/lancedb/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
from . import __version__
from .arrow import AsyncRecordBatchReader
from .rerankers.base import Reranker
from .rerankers.linear_combination import LinearCombinationReranker
from .rerankers.rrf import RRFReranker
from .util import safe_import_pandas

if TYPE_CHECKING:
Expand Down Expand Up @@ -916,7 +916,8 @@ class LanceHybridQueryBuilder(LanceQueryBuilder):
"""
A query builder that performs hybrid vector and full text search.
Results are combined and reranked based on the specified reranker.
By default, the results are reranked using the LinearCombinationReranker.
By default, the results are reranked using the RRFReranker, which
uses reciprocal rank fusion score for reranking.
To make the vector and fts results comparable, the scores are normalized.
Instead of normalizing scores, the `normalize` parameter can be set to "rank"
Expand All @@ -935,7 +936,7 @@ def __init__(
self._vector_column = vector_column
self._fts_columns = fts_columns
self._norm = "score"
self._reranker = LinearCombinationReranker(weight=0.7, fill=1.0)
self._reranker = RRFReranker()
self._nprobes = None
self._refine_factor = None

Expand Down Expand Up @@ -1066,7 +1067,7 @@ def _normalize_scores(self, results: pa.Table, column: str, invert=False):
def rerank(
self,
normalize="score",
reranker: Reranker = LinearCombinationReranker(weight=0.7, fill=1.0),
reranker: Reranker = RRFReranker(),
) -> LanceHybridQueryBuilder:
"""
Rerank the hybrid search results using the specified reranker. The reranker
Expand All @@ -1078,7 +1079,7 @@ def rerank(
The method to normalize the scores. Can be "rank" or "score". If "rank",
the scores are converted to ranks and then normalized. If "score", the
scores are normalized directly.
reranker: Reranker, default LinearCombinationReranker(weight=0.7, fill=1.0)
reranker: Reranker, default RRFReranker()
The reranker to use. Must be an instance of Reranker class.
Returns
-------
Expand Down

0 comments on commit 03ef1dc

Please sign in to comment.