v0.2.2

AmenRa · Aug 23, 2023 · e959ebe · e959ebe
1 parent 1c335cc
commit e959ebe
Show file tree

Hide file tree

Showing 20 changed files with 1,616 additions and 24 deletions.
diff --git a/README.md b/README.md
@@ -24,6 +24,10 @@
 </p>
 
 ## 🔥 News
+- [August 23, 2023] `retriv` 0.2.2 is out!  
+This release adds _experimental_ support for multi-field documents and filters.
+Please, refer to [Advanced Retriever](https://github.com/AmenRa/retriv/blob/main/docs/advanced_retriever.md) documentation.
+
 - [February 18, 2023] `retriv` 0.2.0 is out!  
 This release adds support for Dense and Hybrid Retrieval.
 Dense Retrieval leverages the semantic similarity of the queries' and documents' vector representations, which can be computed directly by `retriv` or imported from other sources.
@@ -51,6 +55,8 @@ Click [here](https://github.com/AmenRa/retriv/blob/main/docs/sparse_retriever.md
 Click [here](https://github.com/AmenRa/retriv/blob/main/docs/dense_retriever.md) to learn more.
 - [Hybrid Retriever](https://github.com/AmenRa/retriv/blob/main/docs/hybrid_retriever.md): an hybrid retriever is a retrieval model built on top of a sparse and a dense retriever.
 Click [here](https://github.com/AmenRa/retriv/blob/main/docs/hybrid_retriever.md) to learn more.
+- [Advanced Retriever](https://github.com/AmenRa/retriv/blob/main/docs/advanced_retriever.md): an advanced sparse retriever supporting filters. This is and experimental feature.
+Click [here](https://github.com/AmenRa/retriv/blob/main/docs/advanced_retriever.md) to learn more.
 
 ### Unified Search Interface
 All the supported retrievers share the same search interface:
@@ -101,7 +107,7 @@ se = SearchEngine("new-index").index(collection)
 se.search("witches masses")
 ```
 Output:
-```python
+```json
 [
   {
     "id": "doc_2",

diff --git a/changelog.md b/changelog.md
@@ -4,6 +4,13 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.2.2] - 2023-08-23
+### Added
+- Added `advanced_retriever.py`.
+
+### Changed
+- Text preprocessing refactoring.
+
 ## [0.2.1] - 2023-05-16
 ### Added
 - Added doc strings to `sparse_retriver.py`.

diff --git a/docs/advanced_retriever.md b/docs/advanced_retriever.md
@@ -0,0 +1,281 @@
+# Advanced Retriever
+
+⚠️ This is an experimental feature.
+
+The Advanced Retriever is a searcher based on lexical matching and search filters.
+It supports [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) and [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf) as the [Sparse Retriever](https://github.com/AmenRa/retriv/blob/main/docs/sparse_retriever.md) and provides the same resources for multi-lingual [text pre-processing](https://github.com/AmenRa/retriv/blob/main/docs/text_preprocessing.md). In addition, it supports search filters, i.e., a set of rules that can be used to filter out documents from the search results.
+
+In the following, we show how to build a search engine employing an advanced retriever, index a document collection, and search it.
+
+## Schema
+
+The first step to building an Advanced Retriever is to define the `schema` of document collection.
+The `schema` is a dictionary describing the documents' `fields` and their `data types`.
+Based on the `data types`, search `filters` can be defined and applied to the search results.
+
+[retriv](https://github.com/AmenRa/retriv) supports the following data types:
+- __id:__ field used for the document IDs.
+- __text:__ text field used for lexical matching.
+- __number:__ numeric value.
+- __bool:__ boolean value (True or False).
+- __keyword:__ string or number representing a keyword or a category.
+- __keywords:__ list of keywords.
+
+An example of `schema` for a collection of books is shown below.  
+NB: At the time of writing, [retriv](https://github.com/AmenRa/retriv) supports only one text field per schema.
+Therefore, the `content` field is used for both the title and the abstract of the books.
+
+```json
+schema = {
+  "isbn": "id",
+  "content": "text",
+  "year": "number",
+  "is_english": "bool",
+  "author": "keyword",
+  "genres": "keywords",
+}
+```
+
+## Build
+
+The Advanced Retriever provides several options to tailor its functioning to you preferences, as shown below.
+
+```python
+from retriv.experimental import AdvancedRetriever
+
+ar = AdvancedRetriever(
+  schema=schema,
+  index_name="new-index",
+  model="bm25",
+  min_df=1,
+  tokenizer="whitespace",
+  stemmer="english",
+  stopwords="english",
+  do_lowercasing=True,
+  do_ampersand_normalization=True,
+  do_special_chars_normalization=True,
+  do_acronyms_normalization=True,
+  do_punctuation_removal=True,
+)
+```
+
+- `schema`: the documents' schema.
+- `index_name`: [retriv](https://github.com/AmenRa/retriv) will use `index_name` as the identifier of your index.
+- `model`: defines the retrieval model to use for searching (`bm25` or `tf-idf`).
+- `min_df`: terms that appear in less than `min_df` documents will be ignored.
+If integer, the parameter indicates the absolute count.
+If float, it represents a proportion of documents.
+- `tokenizer`: [tokenizer](https://github.com/AmenRa/retriv/blob/main/docs/text_preprocessing.md) to use during preprocessing. You can pass a custom callable tokenizer or disable tokenization by setting the parameter to `None`.
+- `stemmer`: [stemmer](https://github.com/AmenRa/retriv/blob/main/docs/text_preprocessing.md) to use during preprocessing. You can pass a custom callable stemmer or disable stemming setting the parameter to `None`.
+- `stopwords`: [stopwords](https://github.com/AmenRa/retriv/blob/main/docs/text_preprocessing.md) to remove during preprocessing. You can pass a custom stop-word list or disable stop-words removal by setting the parameter to `None`.
+- `do_lowercasing`: whether to lowercase texts.
+- `do_ampersand_normalization`: whether to convert `&` in `and` during pre-processing.
+- `do_special_chars_normalization`: whether to remove special characters for letters, e.g., `übermensch` → `ubermensch`.
+- `do_acronyms_normalization`: whether to remove full stop symbols from acronyms without splitting them in multiple words, e.g., `P.C.I.` → `PCI`.
+- `do_punctuation_removal`: whether to remove punctuation.
+
+__Note:__ text pre-processing is equally applied to documents during indexing and to queries at search time.
+
+## Index
+
+### Create
+You can index a document collection from JSONl, CSV, or TSV files.
+CSV and TSV files must have a header.
+[retriv](https://github.com/AmenRa/retriv) automatically infers the file kind, so there's no need to specify it.
+Use the `callback` parameter to pass a function for converting your documents in the format defined by your `schema` on the fly.
+Indexes are automatically persisted on disk at the end of the process.
+
+```python
+ar = ar.index_file(
+  path="path/to/collection",  # File kind is automatically inferred
+  show_progress=True,         # Default value
+  callback=lambda doc: {      # Callback defaults to None.
+    "id": doc["id"],
+    "text": doc["title"] + ". " + doc["text"],       
+    ...   
+  )
+```
+
+### Load
+```python
+ar = AdvancedRetriever.load("index-name")
+```
+
+### Delete
+```python
+AdvancedRetriever.delete("index-name")
+```
+
+## Search
+
+### Query & Filters
+
+Advanced Retriever search query can be either a string or a dictionary.
+In the former case, the string is used as the query text and no filters are applied.
+In the latter case, the dictionary defines the query text and the filters to apply to the search results. If the query text is omitted from the dictionary, documents matching the filters will be returned.
+
+[retriv](https://github.com/AmenRa/retriv) supports two way of filtering the search results (`where` and `where_not`) and several type-specific operators.
+
+- `where` means that only the documents matching the filter will be considered during search.
+- `where_not` means that the documents matching the filter will be ignored during search.
+
+Below we describe the effects of the supported operators for each data type and way of filtering.
+
+#### Where
+
+| Field Type | Operator  | Value                      | Meaning                                                                                                                                              |
+| ---------- | --------- | -------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
+| number     | `eq`      | number                     | Only the documents whose field value is **equal to** the provided value will be considered during search.                                            |
+| number     | `gt`      | number                     | Only the documents whose field value is **greater than** the provided value will be considered during search.                                        |
+| number     | `gte`     | number                     | Only the documents whose field value is **greater or equal to** the provided value will be considered during search.                                 |
+| number     | `lt`      | number                     | Only the documents whose field value is **less than** the provided value will be considered during search.                                           |
+| number     | `lte`     | number                     | Only the documents whose field value is **less or equal to** the provided value will be considered during search.                                    |
+| number     | `between` | number                     | Only the documents whose field value is **between** the provided values (included) will be considered during search.                                 |
+| bool       |           | True / False               | Only the documents whose field value is **equal to** the provided value will be considered during search.                                            |
+| keyword    |           | any value / list of values | Only the documents whose field value is **equal to** the provided value or **among** the provided values will be considered during search.           |
+| keywords   | `or`      | any value / list of values | Only the documents whose field value is **contains** the provided value or **contains one of** the provided values will be considered during search. |
+| keywords   | `and`     | any value / list of values | Only the documents whose field value **contains all** the provided values will be considered during search.                                          |
+
+Query example:
+```python
+query = {
+    "text": "search terms",
+    "where": {
+        "numeric_field_name": ("gte", 1970),
+        "boolean_field_name": True,
+        "keyword_field_name": "kw_1",
+        "keywords_field_name": ("or", ["kws_23", "kws_666"]),
+    }
+}
+```
+
+Alternatively, you can omit the `where` key and use the following syntax:
+```python
+query = {
+    "text": "search terms",
+    "numeric_field_name": ("gte", 1970),
+    "boolean_field_name": True,
+    "keyword_field_name": "kw_1",
+    "keywords_field_name": ("or", ["kws_23", "kws_666"]),
+}
+```
+
+
+#### Where not
+
+| Field Type | Operator  | Value                      | Meaning                                                                                                                        |
+| ---------- | --------- | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
+| number     | `eq`      | number                     | The documents whose field value is **equal to** the provided value will be ignored.                                            |
+| number     | `gt`      | number                     | The documents whose field value is **greater than** the provided value will be ignored.                                        |
+| number     | `gte`     | number                     | The documents whose field value is **greater or equal to** the provided value will be ignored.                                 |
+| number     | `lt`      | number                     | The documents whose field value is **less than** the provided value will be ignored.                                           |
+| number     | `lte`     | number                     | The documents whose field value is **less or equal to** the provided value will be ignored.                                    |
+| number     | `between` | number                     | The documents whose field value is **between** the provided values (included) will be ignored.                                 |
+| bool       |           | True / False               | The documents whose field value is **equal to** the provided value will be ignored.                                            |
+| keyword    |           | any value / list of values | The documents whose field value is **equal to** the provided value or **among** the provided values will be ignored.           |
+| keywords   | `or`      | any value / list of values | The documents whose field value is **contains** the provided value or **contains one of** the provided values will be ignored. |
+| keywords   | `and`     | any value / list of values | The documents whose field value **contains all** the provided values will be ignored.                                          |
+
+Query example:
+```python
+query = {
+    "text": "search terms",
+    "where": {
+        "numeric_field_name": ("gte", 1970),
+        "boolean_field_name": True,
+        "keyword_field_name": "kw_1",
+        "keywords_field_name": ("or", ["kws_23", "kws_666"]),
+    }
+}
+```
+
+### Search
+
+```python
+ar.search(
+  query: ... 
+  return_docs=True     # Default value.
+  cutoff=100           # Default value.
+  operator="OR"        # Default value.
+  subset_doc_ids=None  # Default value.
+)
+```
+
+- `query`: what to search for and which filters to apply. See the section [Query & Filters](#query--filters) for more details.
+- `return_docs`: whether to return documents or only their IDs.
+- `cutoff`: number of results to return.
+- `operator`: whether to perform conjunctive (`AND`) or disjunctive (`OR`) search.  Conjunctive search retrieves documents that contain **all** the query terms. Disjunctive search retrieves documents that contain **at least one** of the query terms.
+- `subset_doc_ids`: restrict the search to the subset of documents having the provided IDs.
+
+Sample output:
+```json
+[
+  {
+    "id": "doc_2",
+    "text": "Just like witches at black masses",
+    "score": 1.7536403
+  },
+  {
+    "id": "doc_1",
+    "text": "Generals gathered in their masses",
+    "score": 0.6931472
+  }
+]
+```
+
+
+
+
+<!-- ### Multi-Search
+
+Compute results for multiple queries at once.
+
+```python
+ar.msearch(
+  queries=[{"id": "q_1", "text": "witches masses"}, ...],
+  cutoff=100,  # Default value, number of results
+)
+```
+Output:
+```json
+{
+  "q_1": {
+    "doc_2": 1.7536403,
+    "doc_1": 0.6931472
+  },
+  ...
+}
+```
+
+### Batch-Search
+
+Batch-Search is similar to Multi-Search but automatically generates batches of queries to evaluate and allows dynamic writing of the search results to disk in [JSONl](https://jsonlines.org) format.
+[bsearch](#batch-search) is handy for computing results for hundreds of thousands or even millions of queries without hogging your RAM.
+
+```python
+ar.bsearch(
+  queries=[{"id": "q_1", "text": "witches masses"}, ...],
+  cutoff=100,
+  batch_size=1_000,
+  show_progress=True,
+  qrels=None,   # To add relevance information to the saved files 
+  path=None,    # Where to save the results, if you want to save them
+)
+``` -->
+
+<!-- ## AutoTune
+
+Use the AutoTune function to tune [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) parameters w.r.t. your document collection and queries.
+All metrics supported by [ranx](https://github.com/AmenRa/ranx) are supported by the `autotune` function.
+At the of the process, the best parameter configuration is automatically applied to the `AdvancedRetriever` instance and saved to disk.
+aou can inspect the current configuration by printing `ar.hyperparams`.
+
+```python
+ar.autotune(
+  queries=[{ "q_id": "q_1", "text": "...", ... }],  # Train queries
+  qrels=[{ "q_1": { "doc_1": 1, ... }, ... }],      # Train qrels
+  metric="ndcg",  # Default value, metric to maximize
+  n_trials=100,   # Default value, number of trials
+  cutoff=100,     # Default value, number of results
+)
+``` -->
diff --git a/docs/dense_retriever.md b/docs/dense_retriever.md
@@ -84,7 +84,7 @@ dr.search(
 )
 ```
 Output:
-```python
+```json
 [
   {
     "id": "doc_2",
@@ -111,7 +111,7 @@ dr.msearch(
 )
 ```
 Output:
-```python
+```json
 {
   "q_1": {
     "doc_2": 1.7536403,