Skip to content

Commit

Permalink
feat!: migrate FTS from tantivy to lance-index (lancedb#1483)
Browse files Browse the repository at this point in the history
Lance now supports FTS, so add it into lancedb Python, TypeScript and
Rust SDKs.

For Python, we still use tantivy based FTS by default because the lance
FTS index now misses some features of tantivy.

For Python:
- Support to create lance based FTS index
- Support to specify columns for full text search (only available for
lance based FTS index)

For TypeScript:
- Change the search method so that it can accept both string and vector
- Support full text search

For Rust
- Support full text search

The others:
- Update the FTS doc

BREAKING CHANGE: 
- for Python, this renames the attached score column of FTS from "score"
to "_score", this could be a breaking change for users that rely the
scores

---------

Signed-off-by: BubbleCal <[email protected]>
  • Loading branch information
BubbleCal authored Aug 8, 2024
1 parent 4db554e commit f9d5fa8
Show file tree
Hide file tree
Showing 34 changed files with 715 additions and 147 deletions.
187 changes: 140 additions & 47 deletions docs/src/fts.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
# Full-text search

LanceDB provides support for full-text search via [Tantivy](https://github.com/quickwit-oss/tantivy) (currently Python only), allowing you to incorporate keyword-based search (based on BM25) in your retrieval solutions. Our goal is to push the FTS integration down to the Rust level in the future, so that it's available for Rust and JavaScript users as well. Follow along at [this Github issue](https://github.com/lancedb/lance/issues/1195)
LanceDB provides support for full-text search via Lance (before via [Tantivy](https://github.com/quickwit-oss/tantivy) (Python only)), allowing you to incorporate keyword-based search (based on BM25) in your retrieval solutions.

Currently, the Lance full text search is missing some features that are in the Tantivy full text search. This includes phrase queries, re-ranking, and customizing the tokenizer. Thus, in Python, Tantivy is still the default way to do full text search and many of the instructions below apply just to Tantivy-based indices.

## Installation

## Installation (Only for Tantivy-based FTS)

!!! note
No need to install the tantivy dependency if using native FTS

To use full-text search, install the dependency [`tantivy-py`](https://github.com/quickwit-oss/tantivy-py):

Expand All @@ -14,63 +19,117 @@ pip install tantivy==0.20.1

## Example

Consider that we have a LanceDB table named `my_table`, whose string column `text` we want to index and query via keyword search.
Consider that we have a LanceDB table named `my_table`, whose string column `text` we want to index and query via keyword search, the FTS index must be created before you can search via keywords.

```python
import lancedb

uri = "data/sample-lancedb"
db = lancedb.connect(uri)

table = db.create_table(
"my_table",
data=[
{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"},
{"vector": [5.9, 26.5], "text": "There are several kittens playing"},
],
)
```
=== "Python"

## Create FTS index on single column
```python
import lancedb

The FTS index must be created before you can search via keywords.
uri = "data/sample-lancedb"
db = lancedb.connect(uri)

```python
table.create_fts_index("text")
```
table = db.create_table(
"my_table",
data=[
{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"},
{"vector": [5.9, 26.5], "text": "There are several kittens playing"},
],
)

To search an FTS index via keywords, LanceDB's `table.search` accepts a string as input:
# passing `use_tantivy=False` to use lance FTS index
# `use_tantivy=True` by default
table.create_fts_index("text")
table.search("puppy").limit(10).select(["text"]).to_list()
# [{'text': 'Frodo was a happy puppy', '_score': 0.6931471824645996}]
# ...
```

```python
table.search("puppy").limit(10).select(["text"]).to_list()
```
=== "TypeScript"

```typescript
import * as lancedb from "@lancedb/lancedb";
const uri = "data/sample-lancedb"
const db = await lancedb.connect(uri);

const data = [
{ vector: [3.1, 4.1], text: "Frodo was a happy puppy" },
{ vector: [5.9, 26.5], text: "There are several kittens playing" },
];
const tbl = await db.createTable("my_table", data, { mode: "overwrite" });
await tbl.createIndex("text", {
config: lancedb.Index.fts(),
});

await tbl
.search("puppy")
.select(["text"])
.limit(10)
.toArray();
```

This returns the result as a list of dictionaries as follows.
=== "Rust"

```rust
let uri = "data/sample-lancedb";
let db = connect(uri).execute().await?;
let initial_data: Box<dyn RecordBatchReader + Send> = create_some_records()?;
let tbl = db
.create_table("my_table", initial_data)
.execute()
.await?;
tbl
.create_index(&["text"], Index::FTS(FtsIndexBuilder::default()))
.execute()
.await?;

tbl
.query()
.full_text_search(FullTextSearchQuery::new("puppy".to_owned()))
.select(lancedb::query::Select::Columns(vec!["text".to_owned()]))
.limit(10)
.execute()
.await?;
```

```python
[{'text': 'Frodo was a happy puppy', 'score': 0.6931471824645996}]
```
It would search on all indexed columns by default, so it's useful when there are multiple indexed columns.
For now, this is supported in tantivy way only.

Passing `fts_columns="text"` if you want to specify the columns to search, but it's not available for Tantivy-based full text search.

!!! note
LanceDB automatically searches on the existing FTS index if the input to the search is of type `str`. If you provide a vector as input, LanceDB will search the ANN index instead.

## Tokenization
By default the text is tokenized by splitting on punctuation and whitespaces and then removing tokens that are longer than 40 chars. For more language specific tokenization then provide the argument tokenizer_name with the 2 letter language code followed by "_stem". So for english it would be "en_stem".

```python
table.create_fts_index("text", tokenizer_name="en_stem")
```
For now, only the Tantivy-based FTS index supports to specify the tokenizer, so it's only available in Python with `use_tantivy=True`.

=== "use_tantivy=True"

```python
table.create_fts_index("text", use_tantivy=True, tokenizer_name="en_stem")
```

=== "use_tantivy=False"

The following [languages](https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html) are currently supported.
[**Not supported yet**](https://github.com/lancedb/lance/issues/1195)

the following [languages](https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html) are currently supported.

## Index multiple columns

If you have multiple string columns to index, there's no need to combine them manually -- simply pass them all as a list to `create_fts_index`:

```python
table.create_fts_index(["text1", "text2"])
```
=== "use_tantivy=True"

```python
table.create_fts_index(["text1", "text2"])
```

=== "use_tantivy=False"

[**Not supported yet**](https://github.com/lancedb/lance/issues/1195)

Note that the search API call does not change - you can search over all indexed columns at once.

Expand All @@ -80,19 +139,48 @@ Currently the LanceDB full text search feature supports *post-filtering*, meanin
applied on top of the full text search results. This can be invoked via the familiar
`where` syntax:

```python
table.search("puppy").limit(10).where("meta='foo'").to_list()
```
=== "Python"

```python
table.search("puppy").limit(10).where("meta='foo'").to_list()
```

=== "TypeScript"

```typescript
await tbl
.search("apple")
.select(["id", "doc"])
.limit(10)
.where("meta='foo'")
.toArray();
```

=== "Rust"

```rust
table
.query()
.full_text_search(FullTextSearchQuery::new(words[0].to_owned()))
.select(lancedb::query::Select::Columns(vec!["doc".to_owned()]))
.limit(10)
.only_if("meta='foo'")
.execute()
.await?;
```

## Sorting

!!! warning "Warn"
Sorting is available for only Tantivy-based FTS

You can pre-sort the documents by specifying `ordering_field_names` when
creating the full-text search index. Once pre-sorted, you can then specify
`ordering_field_name` while searching to return results sorted by the given
field. For example,
field. For example,

```
table.create_fts_index(["text_field"], ordering_field_names=["sort_by_field"])
```python
table.create_fts_index(["text_field"], use_tantivy=True, ordering_field_names=["sort_by_field"])

(table.search("terms", ordering_field_name="sort_by_field")
.limit(20)
Expand All @@ -105,8 +193,8 @@ table.create_fts_index(["text_field"], ordering_field_names=["sort_by_field"])
error will be raised that looks like `ValueError: The field does not exist: xxx`

!!! note
The fields to sort on must be of typed unsigned integer, or else you will see
an error during indexing that looks like
The fields to sort on must be of typed unsigned integer, or else you will see
an error during indexing that looks like
`TypeError: argument 'value': 'float' object cannot be interpreted as an integer`.

!!! note
Expand All @@ -116,6 +204,9 @@ table.create_fts_index(["text_field"], ordering_field_names=["sort_by_field"])

## Phrase queries vs. terms queries

!!! warning "Warn"
Phrase queries are available for only Tantivy-based FTS

For full-text search you can specify either a **phrase** query like `"the old man and the sea"`,
or a **terms** search query like `"(Old AND Man) AND Sea"`. For more details on the terms
query syntax, see Tantivy's [query parser rules](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html).
Expand All @@ -142,15 +233,15 @@ enforce it in one of two ways:

1. Place the double-quoted query inside single quotes. For example, `table.search('"they could have been dogs OR cats"')` is treated as
a phrase query.
2. Explicitly declare the `phrase_query()` method. This is useful when you have a phrase query that
1. Explicitly declare the `phrase_query()` method. This is useful when you have a phrase query that
itself contains double quotes. For example, `table.search('the cats OR dogs were not really "pets" at all').phrase_query()`
is treated as a phrase query.

In general, a query that's declared as a phrase query will be wrapped in double quotes during parsing, with nested
double quotes replaced by single quotes.


## Configurations
## Configurations (Only for Tantivy-based FTS)

By default, LanceDB configures a 1GB heap size limit for creating the index. You can
reduce this if running on a smaller node, or increase this for faster performance while
Expand All @@ -164,6 +255,8 @@ table.create_fts_index(["text1", "text2"], writer_heap_size=heap, replace=True)

## Current limitations

For that Tantivy-based FTS:

1. Currently we do not yet support incremental writes.
If you add data after FTS index creation, it won't be reflected
in search results until you do a full reindex.
Expand Down
17 changes: 16 additions & 1 deletion nodejs/__test__/table.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -785,11 +785,26 @@ describe.each([arrow13, arrow14, arrow15, arrow16, arrow17])(
];
const table = await db.createTable("test", data);

expect(table.search("hello").toArray()).rejects.toThrow(
expect(table.search("hello", "vector").toArray()).rejects.toThrow(
"No embedding functions are defined in the table",
);
});

test("full text search if no embedding function provided", async () => {
const db = await connect(tmpDir.name);
const data = [
{ text: "hello world", vector: [0.1, 0.2, 0.3] },
{ text: "goodbye world", vector: [0.4, 0.5, 0.6] },
];
const table = await db.createTable("test", data);
await table.createIndex("text", {
config: Index.fts(),
});

const results = await table.search("hello").toArray();
expect(results[0].text).toBe(data[0].text);
});

test.each([
[0.4, 0.5, 0.599], // number[]
Float32Array.of(0.4, 0.5, 0.599), // Float32Array
Expand Down
52 changes: 52 additions & 0 deletions nodejs/examples/full_text_search.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
// Copyright 2024 Lance Developers.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

import * as lancedb from "@lancedb/lancedb";

const db = await lancedb.connect("data/sample-lancedb");

const words = [
"apple",
"banana",
"cherry",
"date",
"elderberry",
"fig",
"grape",
];

const data = Array.from({ length: 10_000 }, (_, i) => ({
vector: Array(1536).fill(i),
id: i,
item: `item ${i}`,
strId: `${i}`,
doc: words[i % words.length],
}));

const tbl = await db.createTable("myVectors", data, { mode: "overwrite" });

await tbl.createIndex("doc", {
config: lancedb.Index.fts(),
});

// --8<-- [start:full_text_search]
let result = await tbl
.search("apple")
.select(["id", "doc"])
.limit(10)
.toArray();
console.log(result);
// --8<-- [end:full_text_search]

console.log("SQL search: done");
Loading

0 comments on commit f9d5fa8

Please sign in to comment.