Add tantivy index #161

Dalvany · 2023-06-02T19:52:15Z

This pull request aims to address #19 issue by adding a Tantivy index.
While there is still room for improvement, it might be a first step.

Configuration

Add a section [search] with a field directory that contains where Tantivy should store its files.

[search]
directory = "/tmp/alexandrie/tantivy

Search

As it uses QueryParser you can use full Tantivy query language.

By default search use all fields except name.prefix (see below) and suggester search amongst name, name.full and name.prefix (see below).

Implementation

`fts` module

This module contains all full text search related structures.

Tantivy structure handles all boiler plate to setup an index, search and suggest. It also delegate method to index document, commit documents.

TantivyDocument is a structure that represents a crate and can be converted into a Tantivy's Document

Indices

Crate's name are index multiple times to improve both result relevance of suggester and search.

name : a simple tokenized version of crate's name :
- tokenize on non alphanumeric character using SimpleTokenizer
- apply English stop words
- apply lower-casing to make search case insensitive
name.full : not tokenized, only lower-cased. It's main purpose is to increase relevance when the searched text match exactly a crate name
name.prefix : index word prefix to handle suggester.
- tokenize on non alphanumeric characters
- lower case
- apply a custom filter, edge ngram to index word prefixes.

Other fields that are indexed :

categories are index using the same pipeline as name.full as they should be amongst a precise list
keywords are index using the same pipeline as ̀name` as they are free text
description and readme use the same pipeline as name.

Note that at search time, we should not apply apply the edge ngram filter to reduce noise.

How to index

When Alexandrie starts, it index everything.

Things that still need work

~~Actually running indexer endpoint causes 500 HTTP error when trying to access UI. It comes from a lock on the database since I browse all crates for indexing in a single transaction.~~ Use run method and index at startup instead in an endpoint.
~~Need to change API search endpoint as I only change frontend search~~
~~New crates aren't yet indexed~~

Hirevo · 2023-06-07T17:38:24Z

Hi,

I've seen that you closed the old PR and named this one as draft.
Do you wish for me to convert this PR to an actual draft PR ?

I'm sorry for not having commented on your old PR, I wasn't sure if you were expecting/ready for a review or just general feedback.

As someone who is new to working with search engines, It is very fascinating for me to learn from your code how full text indexing works and how to optimize the relevancy of the results.

I have tried your branch on my local testing setup and the automatic and fast indexing of the crates (~34000 crates) and the very notable improvement of the search results, all without needing much configuration at all blew me away !

Feel free to ping me anytime if you want me to start reviewing some code or other more general feedback !

Dalvany · 2023-06-07T19:42:36Z

Hi @Hirevo I closed the old PR as there was conflicts and find easier to redo thing on a proper branch instead of the forked master.

If you could, yes it would be nice to make it a proper draft as I still have to work on the API search endpoint but I must confess I'm doing a pause right now.
If you wish, make any comments, feedback or ask for any explanation, I'll take care of things when I'll get back working on it :-)

For the performance part, I tested it with crates.io index (only indexing crate's name, no keywords or categories) and it got indexed in 15s (if I remember) when I do it on startup, while when I made a custom endpoint it took something like 10 minutes (I don't know how to explain such a difference, perhaps I didn't ran Alexandrie with release flag and/or it comes from the RWLock on the indexer).

Hirevo · 2023-06-07T20:15:43Z

I must confess I'm doing a pause right now

Understandable, no need to pressure yourself, there is no rush at all. :)

For the performance part, I tested it with crates.io index (only indexing crate's name, no keywords or categories) and it got indexed in 15s (if I remember) when I do it on startup

That's amazing !

My testing setup with the 34 000 crates is from the DB dump from crates.io and I have the rendered READMEs for all of them, but it is quite old, which is why it isn't up to the 116 000+ crates that crates.io has today.

I should get around to bring my setup more up-to-date to potentially also be able to contribute more substantial performance testing (even though I am not even sure if this kind of scale will ever be needed, and I suspect that people won't be constantly restarting their registry with 100 000+ crates).

Dalvany · 2023-06-07T20:27:28Z

I suspect that people won't be constantly restarting their registry with 100 000+ crates)

Yeah, you're right, especially since Alexandrie works great ! Good job by the way !

Dalvany · 2023-06-07T20:34:43Z

I have a remark for e2e testing, I think you should consider looking for integration testing framework. I only used two of them python's behave and cucumber-rs (they work the same way).
I don't know if they're really a good fit here but I think it might worth looking into it.

Dalvany · 2023-06-11T15:42:01Z

Hi @Hirevo
I added search to the api search endpoint. I had to introduce a limit to the number of result per page since you can't construct a Tantivy's TopDocs without. I set it to 15 and reflect that change in the book.

I don't index README as I don't think it's really useful: it will serve few use case. But if you want I can work on that.

Dalvany · 2023-06-11T16:06:43Z

There's a thing you should know, I use a crate I made, tantivy-analysis-contrib, to make the suggester work the way I intended to (doing the "word starts with XXX") as Tantivy provide's a ngram only as tokenizer not as a filter.
A tokenizer is responsible of breaking the whole text into tokens (basically "words"), when a filter is responsible to "transform" each token (lowercasing, generating synonyms, ...etc).

Hirevo · 2023-06-11T17:05:27Z

I have a remark for e2e testing, I think you should consider looking for integration testing framework. I only used two of them python's behave and cucumber-rs (they work the same way). I don't know if they're really a good fit here but I think it might worth looking into it.

There is definitely lots of improvements to be made to the E2E testing setup, even just in terms of performance, as it takes quite some time for them to run to completion (~95 % of it is spent just building the docker image).
What I like with the current setup though is that the scenarios are simple shell scripts in which we can run cargo commands just as a user would do them.

I had to introduce a limit to the number of result per page since you can't construct a Tantivy's TopDocs without. I set it to 15 and reflect that change in the book.

Yeah, thank you for doing that. Doing infinite results by default seems like a poor choice anyway.
One of my old comments in the code you removed seems to indicate that I was already a bit doubtful about this choice back then, not sure what made me go forward with it. :/

I don't index README as I don't think it's really useful: it will serve few use case. But if you want I can work on that.

No worries, my guess is that it can be added later if the need for it comes. You're right that the combination of the crate name, categories and keywords should be enough (making the crate more easily discoverable is their primary purpose already, so crate authors are likely to use them appropriately).

There's a thing you should know, I use a crate I made, tantivy-analysis-contrib, to make the suggester work the way I intended to (doing the "word starts with XXX") as Tantivy provide's a ngram only as tokenizer not as a filter.

That's not an issue, I am completely fine with that, but thanks for the heads up.
Nice work on that crate by the way, it seems like it adds some useful and non-trivial tokenizers.

I'll jump on to reviewing the PR, I'll try to have it completed either tonight or tomorrow.

Dalvany · 2023-06-11T21:00:42Z

A thing that I think I didn't describe: we can restrict search to a specific field by typing

field:text

For example looking for crates that contains the keyword log :

keywords:log

Hirevo

I don't have that much to comment about your implementation.
It all seems pretty good.

It is also really good functionality-wise.

The only potential remark on that front would be that an empty search in the API (when q is an empty string) leads to 0 results, instead of matching every crates in the registry, which can happen when doing cargo search with no terms.

But while crates.io behaves that way, the API docs found in the Cargo book don't mandate it to work that way (it is up to the "criteria defined on the server"), so I don't think it should be blocking anything.

It can be addressed later without much trouble, if we want to restore this behavior.

alexandrie.toml

crates/alexandrie/src/config/mod.rs

crates/alexandrie/src/frontend/search.rs

docker/sqlite/alexandrie.toml

crates/alexandrie/src/api/crates/search.rs

crates/alexandrie/src/fts/document.rs

crates/alexandrie/src/fts/index.rs

Hirevo · 2023-06-11T21:07:50Z

A thing that I think I didn't describe: we can restrict search to a specific field by typing
field:text
For example looking for crates that contains the keyword log :
keywords:log

Oh wow, I didn't notice that, that's really cool !

Dalvany · 2023-06-12T18:08:49Z

Thanks, I'll try to make the changes this week

Dalvany · 2023-06-12T18:27:48Z

I thought there was a comment on the zero result on empty search. It should be easy to solve using an AllQuery. I can make the change if you wish.

Dalvany · 2023-06-12T19:59:57Z

I applied everything, I hope I didn't missed something.
I added a commit to handle empty query using an AllQuery.
I made a few manual check to see if everything is correct :

typing in the search box to see if suggestion work properly
search for "log" in the frontend
running an empty search. Perhaps we should improve that by sorting alphabetically (but I don't know how to do that with tantivy, perhaps using IndexSortByField in index settings, but that's not done at query time).
call api/v1/search?q=log

Dalvany · 2023-06-12T20:23:21Z

Perhaps it would be a goof thing to add a few explanation on search in the book :

though it search also through keywords and category, it's mostly designed to search by crate's name
an or is made between words (it Tantivy's default see set conjonction)
explaining how to "filter" by keywords and categories (eg. actix +categories:"web")
perhaps adding a link to QueryParser documentation

(cherry picked from commit 22227a2)

(cherry picked from commit e5bb5b0)

(cherry picked from commit 217fc68)

(cherry picked from commit 6706797)

Instead, the RwLock is in the index writer to allow interior mutability for commit operation. Though IndexWriter should be thread safe, I don't know how to achieve mutability without the RwLock. (cherry picked from commit b2987f5)

In Tantivy's example, it is stated that for a search server you typically want one reader for the lifetime of the application (cherry picked from commit 6313c4e)

Indexing at startup is faster than indexing in an endpoint. 114k crates from crates.io get indexed within less than 10s. (cherry picked from commit efbcb7c)

(cherry picked from commit 2759c53)

(cherry picked from commit 3aad767)

(cherry picked from commit 3054040)

Co-authored-by: Nicolas Polomack <[email protected]>

Resolve conflict on Cargo.lock. Ran cargo update.

Dalvany · 2023-07-14T10:09:43Z

Hello @Hirevo, I made all the changes

Hirevo

Thanks for pinging me, I am sorry for being the bottleneck here.
This looks all good to me, I am OK to merge this.
I'll get to writing a dedicated section in the docs about the features enabled by this search engine.
Thank you so much for the work you've done on this and for putting up with my sluggishness.

Dalvany · 2023-07-19T16:56:21Z

Hello, don't worry :-)
Don't hesitate to ping me if there's something wrong or anything.

Dalvany force-pushed the add_tantivy_index branch from 383fdce to 5933fc1 Compare June 2, 2023 19:56

Hirevo added C-enhancement Category: Enhancement P-medium Priority: Medium M-frontend Module: Frontend labels Jun 7, 2023

Hirevo marked this pull request as draft June 7, 2023 19:57

Dalvany changed the title ~~Draft Add tantivy index~~ Add tantivy index Jun 11, 2023

Dalvany marked this pull request as ready for review June 11, 2023 16:29

Hirevo requested changes Jun 11, 2023

View reviewed changes

Dalvany added 8 commits June 27, 2023 19:54

Add tantivy index

801a0ed

(cherry picked from commit 22227a2)

Run ft

021dbc7

(cherry picked from commit e5bb5b0)

Fix clippy warning

56136dc

(cherry picked from commit 217fc68)

Reduce logs

96fc11f

(cherry picked from commit 6706797)

Remove RwLock on Tantivy

b6d6959

Instead, the RwLock is in the index writer to allow interior mutability for commit operation. Though IndexWriter should be thread safe, I don't know how to achieve mutability without the RwLock. (cherry picked from commit b2987f5)

Remove Index and add IndexReader

e761e4a

In Tantivy's example, it is stated that for a search server you typically want one reader for the lifetime of the application (cherry picked from commit 6313c4e)

Index at startup and remove endpoint

abb8bf3

Indexing at startup is faster than indexing in an endpoint. 114k crates from crates.io get indexed within less than 10s. (cherry picked from commit efbcb7c)

Run fmt

19e1be4

(cherry picked from commit 2759c53)

Dalvany and others added 14 commits June 27, 2023 19:54

Improve merging crates with keywords and categories

495ebf7

(cherry picked from commit 3aad767)

Fix and handle publish

d09b235

(cherry picked from commit 3054040)

Fix e2e test

097afff

Tantivy 0.20

494ecb6

Use tantivy on search endpoint

c6ec4f4

Improvement

58db05a

Fix search sort

c5752f2

Fix fmt

43b3c32

Apply suggestions from code review

5a998b9

Co-authored-by: Nicolas Polomack <[email protected]>

Fix following review

e3ce284

Remove readme from tantivy

48ab92a

Handle empty query

1e7f80e

Apply suggestions from code review

14ff5a9

Co-authored-by: Nicolas Polomack <[email protected]>

Update Cargo.lock

aa296f1

Resolve conflict on Cargo.lock. Ran cargo update.

Dalvany force-pushed the add_tantivy_index branch from feec52e to aa296f1 Compare June 27, 2023 17:59

Dalvany requested a review from Hirevo June 30, 2023 17:10

Hirevo approved these changes Jul 18, 2023

View reviewed changes

Hirevo merged commit 5629df1 into Hirevo:master Jul 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tantivy index #161

Add tantivy index #161

Dalvany commented Jun 2, 2023 •

edited

Loading

Hirevo commented Jun 7, 2023

Dalvany commented Jun 7, 2023

Hirevo commented Jun 7, 2023 •

edited

Loading

Dalvany commented Jun 7, 2023 •

edited

Loading

Dalvany commented Jun 7, 2023

Dalvany commented Jun 11, 2023

Dalvany commented Jun 11, 2023

Hirevo commented Jun 11, 2023

Dalvany commented Jun 11, 2023

Hirevo left a comment •

edited

Loading

Hirevo commented Jun 11, 2023

Dalvany commented Jun 12, 2023

Dalvany commented Jun 12, 2023

Dalvany commented Jun 12, 2023 •

edited

Loading

Dalvany commented Jun 12, 2023

Dalvany commented Jul 14, 2023

Hirevo left a comment

Dalvany commented Jul 19, 2023

Add tantivy index #161

Add tantivy index #161

Conversation

Dalvany commented Jun 2, 2023 • edited Loading

Configuration

Search

Implementation

fts module

Indices

How to index

Things that still need work

Hirevo commented Jun 7, 2023

Dalvany commented Jun 7, 2023

Hirevo commented Jun 7, 2023 • edited Loading

Dalvany commented Jun 7, 2023 • edited Loading

Dalvany commented Jun 7, 2023

Dalvany commented Jun 11, 2023

Dalvany commented Jun 11, 2023

Hirevo commented Jun 11, 2023

Dalvany commented Jun 11, 2023

Hirevo left a comment • edited Loading

Choose a reason for hiding this comment

Hirevo commented Jun 11, 2023

Dalvany commented Jun 12, 2023

Dalvany commented Jun 12, 2023

Dalvany commented Jun 12, 2023 • edited Loading

Dalvany commented Jun 12, 2023

Dalvany commented Jul 14, 2023

Hirevo left a comment

Choose a reason for hiding this comment

Dalvany commented Jul 19, 2023

Dalvany commented Jun 2, 2023 •

edited

Loading

`fts` module

Hirevo commented Jun 7, 2023 •

edited

Loading

Dalvany commented Jun 7, 2023 •

edited

Loading

Hirevo left a comment •

edited

Loading

Dalvany commented Jun 12, 2023 •

edited

Loading