feat: update default reranker to RRF (lancedb#1580)

- Both LinearCombination (the current default) and RRF are pretty fast compared to model based rerankers. RRF is slightly faster. - In our tests RRF has also been slightly more accurate. This PR: - Makes RRF the default reranker - Removed duplicate docs for rerankers
Epicism · Sep 3, 2024 · 03ef1dc · 03ef1dc
1 parent fde636c
commit 03ef1dc
Show file tree

Hide file tree

Showing 3 changed files with 12 additions and 190 deletions.
diff --git a/docs/src/hybrid_search/hybrid_search.md b/docs/src/hybrid_search/hybrid_search.md
@@ -57,199 +57,18 @@ results = table.search(query_type="hybrid")
 
 ```
 
-By default, LanceDB uses `LinearCombinationReranker(weight=0.7)` to combine and rerank the results of semantic and full-text search. You can customize the hyperparameters as needed or write your own custom reranker. Here's how you can use any of the available rerankers:
+By default, LanceDB uses `RRFReranker()`, which uses reciprocal rank fusion score, to combine and rerank the results of semantic and full-text search. You can customize the hyperparameters as needed or write your own custom reranker. Here's how you can use any of the available rerankers:
 
 
 ### `rerank()` arguments
 * `normalize`: `str`, default `"score"`:
     The method to normalize the scores. Can be "rank" or "score". If "rank", the scores are converted to ranks and then normalized. If "score", the scores are normalized directly.
-* `reranker`: `Reranker`, default `LinearCombinationReranker(weight=0.7)`.
+* `reranker`: `Reranker`, default `RRF()`.
     The reranker to use. If not specified, the default reranker is used.
 
 
 ## Available Rerankers
-LanceDB provides a number of re-rankers out of the box. You can use any of these re-rankers by passing them to the `rerank()` method. Here's a list of available re-rankers:
+LanceDB provides a number of re-rankers out of the box. You can use any of these re-rankers by passing them to the `rerank()` method. 
+Go to [Rerankers](../reranking/index.md) to learn more about using the available rerankers and implementing custom rerankers.
 
-### Linear Combination Reranker
-This is the default re-ranker used by LanceDB. It combines the results of semantic and full-text search using a linear combination of the scores. The weights for the linear combination can be specified. It defaults to 0.7, i.e, 70% weight for semantic search and 30% weight for full-text search.
 
-
-```python
-from lancedb.rerankers import LinearCombinationReranker
-
-reranker = LinearCombinationReranker(weight=0.3) # Use 0.3 as the weight for vector search
-
-results = table.search("rebel", query_type="hybrid").rerank(reranker=reranker).to_pandas()
-```
-
-### Arguments
-----------------
-* `weight`: `float`, default `0.7`:
-    The weight to use for the semantic search score. The weight for the full-text search score is `1 - weights`.
-* `fill`: `float`, default `1.0`:
-        The score to give to results that are only in one of the two result sets.This is treated as penalty, so a higher value means a lower score.
-        TODO: We should just hardcode this-- its pretty confusing as we invert scores to calculate final score
-* `return_score` : str, default `"relevance"`
-        options are "relevance" or "all"
-        The type of score to return. If "relevance", will return only the `_relevance_score. If "all", will return all scores from the vector and FTS search along with the relevance score.
-
-### Cohere Reranker
-This re-ranker uses the [Cohere](https://cohere.ai/) API to combine the results of semantic and full-text search. You can use this re-ranker by passing `CohereReranker()` to the `rerank()` method. Note that you'll need to set the `COHERE_API_KEY` environment variable to use this re-ranker.
-
-```python
-from lancedb.rerankers import CohereReranker
-
-reranker = CohereReranker()
-
-results = table.search("vampire weekend", query_type="hybrid").rerank(reranker=reranker).to_pandas()
-```
-
-### Arguments
-----------------
-* `model_name` : str, default `"rerank-english-v2.0"`
-        The name of the cross encoder model to use. Available cohere models are:
-        - rerank-english-v2.0
-        - rerank-multilingual-v2.0
-* `column` : str, default `"text"`
-        The name of the column to use as input to the cross encoder model.
-* `top_n` : str, default `None`
-        The number of results to return. If None, will return all results.
-
-!!! Note
-    Only returns `_relevance_score`. Does not support `return_score = "all"`.
-
-### Cross Encoder Reranker
-This reranker uses the [Sentence Transformers](https://www.sbert.net/) library to combine the results of semantic and full-text search. You can use it by passing `CrossEncoderReranker()` to the `rerank()` method.
-
-```python
-from lancedb.rerankers import CrossEncoderReranker
-
-reranker = CrossEncoderReranker()
-
-results = table.search("harmony hall", query_type="hybrid").rerank(reranker=reranker).to_pandas()
-```
-
-
-### Arguments
-----------------
-* `model` : str, default `"cross-encoder/ms-marco-TinyBERT-L-6"`
-        The name of the cross encoder model to use. Available cross encoder models can be found [here](https://www.sbert.net/docs/pretrained_cross-encoders.html)
-* `column` : str, default `"text"`
-        The name of the column to use as input to the cross encoder model.
-* `device` : str, default `None`
-        The device to use for the cross encoder model. If None, will use "cuda" if available, otherwise "cpu".
-
-!!! Note
-    Only returns `_relevance_score`. Does not support `return_score = "all"`.
-
-
-### ColBERT Reranker
-This reranker uses the ColBERT model to combine the results of semantic and full-text search. You can use it by passing `ColbertrReranker()` to the `rerank()` method. 
-
-ColBERT reranker model calculates relevance of given docs against the query and don't take existing fts and vector search scores into account, so it currently only supports `return_score="relevance"`. By default, it looks for `text` column to rerank the results. But you can specify the column name to use as input to the cross encoder model as described below.
-
-```python
-from lancedb.rerankers import ColbertReranker
-
-reranker = ColbertReranker()
-
-results = table.search("harmony hall", query_type="hybrid").rerank(reranker=reranker).to_pandas()
-```
-
-### Arguments
-----------------
-* `model_name` : `str`, default `"colbert-ir/colbertv2.0"`
-        The name of the cross encoder model to use.
-* `column` : `str`, default `"text"`
-        The name of the column to use as input to the cross encoder model.
-* `return_score` : `str`, default `"relevance"`
-        options are `"relevance"` or `"all"`. Only `"relevance"` is supported for now.
-
-!!! Note
-    Only returns `_relevance_score`. Does not support `return_score = "all"`.
-
-### OpenAI Reranker
-This reranker uses the OpenAI API to combine the results of semantic and full-text search. You can use it by passing `OpenaiReranker()` to the `rerank()` method.
-
-!!! Note
-    This prompts chat model to rerank results which is not a dedicated reranker model. This should be treated as experimental.
-
-!!! Tip
-    - You might run out of token limit so set the search `limits` based on your token limit.
-    - It is recommended to use gpt-4-turbo-preview, the default model, older models might lead to undesired behaviour
-
-```python
-from lancedb.rerankers import OpenaiReranker
-
-reranker = OpenaiReranker()
-
-results = table.search("harmony hall", query_type="hybrid").rerank(reranker=reranker).to_pandas()
-```
-
-### Arguments
-----------------
-* `model_name` : `str`, default `"gpt-4-turbo-preview"`
-    The name of the cross encoder model to use.
-* `column` : `str`, default `"text"`
-    The name of the column to use as input to the cross encoder model.
-* `return_score` : `str`, default `"relevance"`
-    options are "relevance" or "all". Only "relevance" is supported for now.
-* `api_key` : `str`, default `None`
-    The API key to use. If None, will use the OPENAI_API_KEY environment variable.
-
-
-## Building Custom Rerankers
-You can build your own custom reranker by subclassing the `Reranker` class and implementing the `rerank_hybrid()` method. Here's an example of a custom reranker that combines the results of semantic and full-text search using a linear combination of the scores.
-
-The `Reranker` base interface comes with a `merge_results()` method that can be used to combine the results of semantic and full-text search. This is a vanilla merging algorithm that simply concatenates the results and removes the duplicates without taking the scores into consideration. It only keeps the first copy of the row encountered. This works well in cases that don't require the scores of semantic and full-text search to combine the results. If you want to use the scores or want to support `return_score="all"`, you'll need to implement your own merging algorithm.
-
-```python
-
-from lancedb.rerankers import Reranker
-import pyarrow as pa
-
-class MyReranker(Reranker):
-    def __init__(self, param1, param2, ..., return_score="relevance"):
-        super().__init__(return_score)
-        self.param1 = param1
-        self.param2 = param2
-
-    def rerank_hybrid(self, query: str, vector_results: pa.Table, fts_results: pa.Table):
-        # Use the built-in merging function
-        combined_result = self.merge_results(vector_results, fts_results)
-
-        # Do something with the combined results
-        # ...
-
-        # Return the combined results
-        return combined_result
-
-```
-
-### Example of a Custom Reranker
-For the sake of simplicity let's build custom reranker that just enchances the Cohere Reranker by accepting a filter query, and accept other CohereReranker params as kwags.
-
-```python
-
-from typing import List, Union
-import pandas as pd
-from lancedb.rerankers import CohereReranker
-
-class MofidifiedCohereReranker(CohereReranker):
-    def __init__(self, filters: Union[str, List[str]], **kwargs):
-        super().__init__(**kwargs)
-        filters = filters if isinstance(filters, list) else [filters]
-        self.filters = filters
-
-    def rerank_hybrid(self, query: str, vector_results: pa.Table, fts_results: pa.Table)-> pa.Table:
-        combined_result = super().rerank_hybrid(query, vector_results, fts_results)
-        df = combined_result.to_pandas()
-        for filter in self.filters:
-            df = df.query("not text.str.contains(@filter)")
-
-        return pa.Table.from_pandas(df)
-
-```
-
-!!! tip
-    The `vector_results` and `fts_results` are pyarrow tables. You can convert them to pandas dataframes using `to_pandas()` method and perform any operations you want. After you are done, you can convert the dataframe back to pyarrow table using `pa.Table.from_pandas()` method and return it.
diff --git a/docs/src/reranking/index.md b/docs/src/reranking/index.md
@@ -71,6 +71,8 @@ LanceDB comes with some built-in rerankers. Here are some of the rerankers that
 - [OpenAI Reranker](./openai.md)
 - [Linear Combination Reranker](./linear_combination.md)
 - [Jina Reranker](./jina.md)
+- [AnswerDotAI Rerankers](./answerdotai.md)
+- [Reciprocal Rank Fusion Reranker](./rrf.md)
 
 ## Creating Custom Rerankers
 

diff --git a/python/python/lancedb/query.py b/python/python/lancedb/query.py
@@ -35,7 +35,7 @@
 from . import __version__
 from .arrow import AsyncRecordBatchReader
 from .rerankers.base import Reranker
-from .rerankers.linear_combination import LinearCombinationReranker
+from .rerankers.rrf import RRFReranker
 from .util import safe_import_pandas
 
 if TYPE_CHECKING:
@@ -916,7 +916,8 @@ class LanceHybridQueryBuilder(LanceQueryBuilder):
     """
     A query builder that performs hybrid vector and full text search.
     Results are combined and reranked based on the specified reranker.
-    By default, the results are reranked using the LinearCombinationReranker.
+    By default, the results are reranked using the RRFReranker, which
+    uses reciprocal rank fusion score for reranking.
 
     To make the vector and fts results comparable, the scores are normalized.
     Instead of normalizing scores, the `normalize` parameter can be set to "rank"
@@ -935,7 +936,7 @@ def __init__(
         self._vector_column = vector_column
         self._fts_columns = fts_columns
         self._norm = "score"
-        self._reranker = LinearCombinationReranker(weight=0.7, fill=1.0)
+        self._reranker = RRFReranker()
         self._nprobes = None
         self._refine_factor = None
 
@@ -1066,7 +1067,7 @@ def _normalize_scores(self, results: pa.Table, column: str, invert=False):
     def rerank(
         self,
         normalize="score",
-        reranker: Reranker = LinearCombinationReranker(weight=0.7, fill=1.0),
+        reranker: Reranker = RRFReranker(),
     ) -> LanceHybridQueryBuilder:
         """
         Rerank the hybrid search results using the specified reranker. The reranker
@@ -1078,7 +1079,7 @@ def rerank(
             The method to normalize the scores. Can be "rank" or "score". If "rank",
             the scores are converted to ranks and then normalized. If "score", the
             scores are normalized directly.
-        reranker: Reranker, default LinearCombinationReranker(weight=0.7, fill=1.0)
+        reranker: Reranker, default RRFReranker()
             The reranker to use. Must be an instance of Reranker class.
         Returns
         -------