feat!: migrate FTS from tantivy to lance-index (lancedb#1483)

Lance now supports FTS, so add it into lancedb Python, TypeScript and Rust SDKs. For Python, we still use tantivy based FTS by default because the lance FTS index now misses some features of tantivy. For Python: - Support to create lance based FTS index - Support to specify columns for full text search (only available for lance based FTS index) For TypeScript: - Change the search method so that it can accept both string and vector - Support full text search For Rust - Support full text search The others: - Update the FTS doc BREAKING CHANGE: - for Python, this renames the attached score column of FTS from "score" to "_score", this could be a breaking change for users that rely the scores --------- Signed-off-by: BubbleCal <[email protected]>
Epicism · Aug 8, 2024 · f9d5fa8 · f9d5fa8
1 parent 4db554e
commit f9d5fa8
Show file tree

Hide file tree

Showing 34 changed files with 715 additions and 147 deletions.
diff --git a/docs/src/fts.md b/docs/src/fts.md
@@ -1,9 +1,14 @@
 # Full-text search
 
-LanceDB provides support for full-text search via [Tantivy](https://github.com/quickwit-oss/tantivy) (currently Python only), allowing you to incorporate keyword-based search (based on BM25) in your retrieval solutions. Our goal is to push the FTS integration down to the Rust level in the future, so that it's available for Rust and JavaScript users as well.  Follow along at [this Github issue](https://github.com/lancedb/lance/issues/1195)
+LanceDB provides support for full-text search via Lance (before via [Tantivy](https://github.com/quickwit-oss/tantivy) (Python only)), allowing you to incorporate keyword-based search (based on BM25) in your retrieval solutions.
 
+Currently, the Lance full text search is missing some features that are in the Tantivy full text search. This includes phrase queries, re-ranking, and customizing the tokenizer. Thus, in Python, Tantivy is still the default way to do full text search and many of the instructions below apply just to Tantivy-based indices.
 
-## Installation
+
+## Installation (Only for Tantivy-based FTS)
+
+!!! note
+    No need to install the tantivy dependency if using native FTS
 
 To use full-text search, install the dependency [`tantivy-py`](https://github.com/quickwit-oss/tantivy-py):
 
@@ -14,63 +19,117 @@ pip install tantivy==0.20.1
 
 ## Example
 
-Consider that we have a LanceDB table named `my_table`, whose string column `text` we want to index and query via keyword search.
+Consider that we have a LanceDB table named `my_table`, whose string column `text` we want to index and query via keyword search, the FTS index must be created before you can search via keywords.
 
-```python
-import lancedb
-
-uri = "data/sample-lancedb"
-db = lancedb.connect(uri)
-
-table = db.create_table(
-    "my_table",
-    data=[
-        {"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"},
-        {"vector": [5.9, 26.5], "text": "There are several kittens playing"},
-    ],
-)
-```
+=== "Python"
 
-## Create FTS index on single column
+    ```python
+    import lancedb
 
-The FTS index must be created before you can search via keywords.
+    uri = "data/sample-lancedb"
+    db = lancedb.connect(uri)
 
-```python
-table.create_fts_index("text")
-```
+    table = db.create_table(
+        "my_table",
+        data=[
+            {"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"},
+            {"vector": [5.9, 26.5], "text": "There are several kittens playing"},
+        ],
+    )
 
-To search an FTS index via keywords, LanceDB's `table.search` accepts a string as input:
+    # passing `use_tantivy=False` to use lance FTS index
+    # `use_tantivy=True` by default
+    table.create_fts_index("text")
+    table.search("puppy").limit(10).select(["text"]).to_list()
+    # [{'text': 'Frodo was a happy puppy', '_score': 0.6931471824645996}]
+    # ...
+    ```
 
-```python
-table.search("puppy").limit(10).select(["text"]).to_list()
-```
+=== "TypeScript"
+
+    ```typescript
+    import * as lancedb from "@lancedb/lancedb";
+    const uri = "data/sample-lancedb"
+    const db = await lancedb.connect(uri);
+
+    const data = [
+    { vector: [3.1, 4.1], text: "Frodo was a happy puppy" },
+    { vector: [5.9, 26.5], text: "There are several kittens playing" },
+    ];
+    const tbl = await db.createTable("my_table", data, { mode: "overwrite" });
+    await tbl.createIndex("text", {
+        config: lancedb.Index.fts(),
+    });
+
+    await tbl
+        .search("puppy")
+        .select(["text"])
+        .limit(10)
+        .toArray();
+    ```
 
-This returns the result as a list of dictionaries as follows.
+=== "Rust"
+
+    ```rust
+    let uri = "data/sample-lancedb";
+    let db = connect(uri).execute().await?;
+    let initial_data: Box<dyn RecordBatchReader + Send> = create_some_records()?;
+    let tbl = db
+        .create_table("my_table", initial_data)
+        .execute()
+        .await?;
+    tbl
+        .create_index(&["text"], Index::FTS(FtsIndexBuilder::default()))
+        .execute()
+        .await?;
+
+    tbl
+        .query()
+        .full_text_search(FullTextSearchQuery::new("puppy".to_owned()))
+        .select(lancedb::query::Select::Columns(vec!["text".to_owned()]))
+        .limit(10)
+        .execute()
+        .await?;
+    ```
 
-```python
-[{'text': 'Frodo was a happy puppy', 'score': 0.6931471824645996}]
-```
+It would search on all indexed columns by default, so it's useful when there are multiple indexed columns.
+For now, this is supported in tantivy way only.
+
+Passing `fts_columns="text"` if you want to specify the columns to search, but it's not available for Tantivy-based full text search.
 
 !!! note
     LanceDB automatically searches on the existing FTS index if the input to the search is of type `str`. If you provide a vector as input, LanceDB will search the ANN index instead.
 
 ## Tokenization
 By default the text is tokenized by splitting on punctuation and whitespaces and then removing tokens that are longer than 40 chars. For more language specific tokenization then provide the argument tokenizer_name with the 2 letter language code followed by "_stem". So for english it would be "en_stem".
 
-```python
-table.create_fts_index("text", tokenizer_name="en_stem")
-```
+For now, only the Tantivy-based FTS index supports to specify the tokenizer, so it's only available in Python with `use_tantivy=True`.
+
+=== "use_tantivy=True"
+
+    ```python
+    table.create_fts_index("text", use_tantivy=True, tokenizer_name="en_stem")
+    ```
+
+=== "use_tantivy=False"
 
-The following [languages](https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html) are currently supported.
+    [**Not supported yet**](https://github.com/lancedb/lance/issues/1195)
 
+the following [languages](https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html) are currently supported.
 
 ## Index multiple columns
 
 If you have multiple string columns to index, there's no need to combine them manually -- simply pass them all as a list to `create_fts_index`:
 
-```python
-table.create_fts_index(["text1", "text2"])
-```
+=== "use_tantivy=True"
+
+    ```python
+    table.create_fts_index(["text1", "text2"])
+    ```
+
+=== "use_tantivy=False"
+
+    [**Not supported yet**](https://github.com/lancedb/lance/issues/1195)
 
 Note that the search API call does not change - you can search over all indexed columns at once.
 
@@ -80,19 +139,48 @@ Currently the LanceDB full text search feature supports *post-filtering*, meanin
 applied on top of the full text search results. This can be invoked via the familiar
 `where` syntax:
 
-```python
-table.search("puppy").limit(10).where("meta='foo'").to_list()
-```
+=== "Python"
+
+    ```python
+    table.search("puppy").limit(10).where("meta='foo'").to_list()
+    ```
+
+=== "TypeScript"
+
+    ```typescript
+    await tbl
+    .search("apple")
+    .select(["id", "doc"])
+    .limit(10)
+    .where("meta='foo'")
+    .toArray();
+    ```
+
+=== "Rust"
+
+    ```rust
+    table
+        .query()
+        .full_text_search(FullTextSearchQuery::new(words[0].to_owned()))
+        .select(lancedb::query::Select::Columns(vec!["doc".to_owned()]))
+        .limit(10)
+        .only_if("meta='foo'")
+        .execute()
+        .await?;
+    ```
 
 ## Sorting
 
+!!! warning "Warn"
+    Sorting is available for only Tantivy-based FTS
+
 You can pre-sort the documents by specifying `ordering_field_names` when
 creating the full-text search index. Once pre-sorted, you can then specify
 `ordering_field_name` while searching to return results sorted by the given
-field. For example, 
+field. For example,
 
-```
-table.create_fts_index(["text_field"], ordering_field_names=["sort_by_field"])
+```python
+table.create_fts_index(["text_field"], use_tantivy=True, ordering_field_names=["sort_by_field"])
 
 (table.search("terms", ordering_field_name="sort_by_field")
  .limit(20)
@@ -105,8 +193,8 @@ table.create_fts_index(["text_field"], ordering_field_names=["sort_by_field"])
     error will be raised that looks like `ValueError: The field does not exist: xxx`
 
 !!! note
-    The fields to sort on must be of typed unsigned integer, or else you will see 
-    an error during indexing that looks like 
+    The fields to sort on must be of typed unsigned integer, or else you will see
+    an error during indexing that looks like
     `TypeError: argument 'value': 'float' object cannot be interpreted as an integer`.
 
 !!! note
@@ -116,6 +204,9 @@ table.create_fts_index(["text_field"], ordering_field_names=["sort_by_field"])
 
 ## Phrase queries vs. terms queries
 
+!!! warning "Warn"
+    Phrase queries are available for only Tantivy-based FTS
+
 For full-text search you can specify either a **phrase** query like `"the old man and the sea"`,
 or a **terms** search query like `"(Old AND Man) AND Sea"`. For more details on the terms
 query syntax, see Tantivy's [query parser rules](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html).
@@ -142,15 +233,15 @@ enforce it in one of two ways:
 
 1. Place the double-quoted query inside single quotes. For example, `table.search('"they could have been dogs OR cats"')` is treated as
 a phrase query.
-2. Explicitly declare the `phrase_query()` method. This is useful when you have a phrase query that
+1. Explicitly declare the `phrase_query()` method. This is useful when you have a phrase query that
 itself contains double quotes. For example, `table.search('the cats OR dogs were not really "pets" at all').phrase_query()`
 is treated as a phrase query.
 
 In general, a query that's declared as a phrase query will be wrapped in double quotes during parsing, with nested
 double quotes replaced by single quotes.
 
 
-## Configurations
+## Configurations (Only for Tantivy-based FTS)
 
 By default, LanceDB configures a 1GB heap size limit for creating the index. You can
 reduce this if running on a smaller node, or increase this for faster performance while
@@ -164,6 +255,8 @@ table.create_fts_index(["text1", "text2"], writer_heap_size=heap, replace=True)
 
 ## Current limitations
 
+For that Tantivy-based FTS:
+
 1. Currently we do not yet support incremental writes.
    If you add data after FTS index creation, it won't be reflected
    in search results until you do a full reindex.

diff --git a/nodejs/__test__/table.test.ts b/nodejs/__test__/table.test.ts
@@ -785,11 +785,26 @@ describe.each([arrow13, arrow14, arrow15, arrow16, arrow17])(
       ];
       const table = await db.createTable("test", data);
 
-      expect(table.search("hello").toArray()).rejects.toThrow(
+      expect(table.search("hello", "vector").toArray()).rejects.toThrow(
         "No embedding functions are defined in the table",
       );
     });
 
+    test("full text search if no embedding function provided", async () => {
+      const db = await connect(tmpDir.name);
+      const data = [
+        { text: "hello world", vector: [0.1, 0.2, 0.3] },
+        { text: "goodbye world", vector: [0.4, 0.5, 0.6] },
+      ];
+      const table = await db.createTable("test", data);
+      await table.createIndex("text", {
+        config: Index.fts(),
+      });
+
+      const results = await table.search("hello").toArray();
+      expect(results[0].text).toBe(data[0].text);
+    });
+
     test.each([
       [0.4, 0.5, 0.599], // number[]
       Float32Array.of(0.4, 0.5, 0.599), // Float32Array

diff --git a/nodejs/examples/full_text_search.ts b/nodejs/examples/full_text_search.ts
@@ -0,0 +1,52 @@
+// Copyright 2024 Lance Developers.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+import * as lancedb from "@lancedb/lancedb";
+
+const db = await lancedb.connect("data/sample-lancedb");
+
+const words = [
+  "apple",
+  "banana",
+  "cherry",
+  "date",
+  "elderberry",
+  "fig",
+  "grape",
+];
+
+const data = Array.from({ length: 10_000 }, (_, i) => ({
+  vector: Array(1536).fill(i),
+  id: i,
+  item: `item ${i}`,
+  strId: `${i}`,
+  doc: words[i % words.length],
+}));
+
+const tbl = await db.createTable("myVectors", data, { mode: "overwrite" });
+
+await tbl.createIndex("doc", {
+  config: lancedb.Index.fts(),
+});
+
+// --8<-- [start:full_text_search]
+let result = await tbl
+  .search("apple")
+  .select(["id", "doc"])
+  .limit(10)
+  .toArray();
+console.log(result);
+// --8<-- [end:full_text_search]
+
+console.log("SQL search: done");