Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tantivy index #161

Merged
merged 22 commits into from
Jul 18, 2023
Merged

Add tantivy index #161

merged 22 commits into from
Jul 18, 2023

Conversation

Dalvany
Copy link
Contributor

@Dalvany Dalvany commented Jun 2, 2023

This pull request aims to address #19 issue by adding a Tantivy index.
While there is still room for improvement, it might be a first step.

Configuration

Add a section [search] with a field directory that contains where Tantivy should store its files.

[search]
directory = "/tmp/alexandrie/tantivy

Search

As it uses QueryParser you can use full Tantivy query language.

By default search use all fields except name.prefix (see below) and suggester search amongst name, name.full and name.prefix (see below).

Implementation

fts module

This module contains all full text search related structures.

Tantivy structure handles all boiler plate to setup an index, search and suggest. It also delegate method to index document, commit documents.

TantivyDocument is a structure that represents a crate and can be converted into a Tantivy's Document

Indices

Crate's name are index multiple times to improve both result relevance of suggester and search.

  • name : a simple tokenized version of crate's name :
    • tokenize on non alphanumeric character using SimpleTokenizer
    • apply English stop words
    • apply lower-casing to make search case insensitive
  • name.full : not tokenized, only lower-cased. It's main purpose is to increase relevance when the searched text match exactly a crate name
  • name.prefix : index word prefix to handle suggester.
    • tokenize on non alphanumeric characters
    • lower case
    • apply a custom filter, edge ngram to index word prefixes.

Other fields that are indexed :

  • categories are index using the same pipeline as name.full as they should be amongst a precise list
  • keywords are index using the same pipeline as ̀name` as they are free text
  • description and readme use the same pipeline as name.

Note that at search time, we should not apply apply the edge ngram filter to reduce noise.

How to index

When Alexandrie starts, it index everything.

Things that still need work

  • Actually running indexer endpoint causes 500 HTTP error when trying to access UI. It comes from a lock on the database since I browse all crates for indexing in a single transaction. Use run method and index at startup instead in an endpoint.
  • Need to change API search endpoint as I only change frontend search
  • New crates aren't yet indexed

@Dalvany Dalvany force-pushed the add_tantivy_index branch from 383fdce to 5933fc1 Compare June 2, 2023 19:56
@Hirevo Hirevo added C-enhancement Category: Enhancement P-medium Priority: Medium M-frontend Module: Frontend labels Jun 7, 2023
@Hirevo
Copy link
Owner

Hirevo commented Jun 7, 2023

Hi,

I've seen that you closed the old PR and named this one as draft.
Do you wish for me to convert this PR to an actual draft PR ?

I'm sorry for not having commented on your old PR, I wasn't sure if you were expecting/ready for a review or just general feedback.

As someone who is new to working with search engines, It is very fascinating for me to learn from your code how full text indexing works and how to optimize the relevancy of the results.

I have tried your branch on my local testing setup and the automatic and fast indexing of the crates (~34000 crates) and the very notable improvement of the search results, all without needing much configuration at all blew me away !

Feel free to ping me anytime if you want me to start reviewing some code or other more general feedback !

@Dalvany
Copy link
Contributor Author

Dalvany commented Jun 7, 2023

Hi @Hirevo I closed the old PR as there was conflicts and find easier to redo thing on a proper branch instead of the forked master.

If you could, yes it would be nice to make it a proper draft as I still have to work on the API search endpoint but I must confess I'm doing a pause right now.
If you wish, make any comments, feedback or ask for any explanation, I'll take care of things when I'll get back working on it :-)

For the performance part, I tested it with crates.io index (only indexing crate's name, no keywords or categories) and it got indexed in 15s (if I remember) when I do it on startup, while when I made a custom endpoint it took something like 10 minutes (I don't know how to explain such a difference, perhaps I didn't ran Alexandrie with release flag and/or it comes from the RWLock on the indexer).

@Hirevo Hirevo marked this pull request as draft June 7, 2023 19:57
@Hirevo
Copy link
Owner

Hirevo commented Jun 7, 2023

I must confess I'm doing a pause right now

Understandable, no need to pressure yourself, there is no rush at all. :)

For the performance part, I tested it with crates.io index (only indexing crate's name, no keywords or categories) and it got indexed in 15s (if I remember) when I do it on startup

That's amazing !

My testing setup with the 34 000 crates is from the DB dump from crates.io and I have the rendered READMEs for all of them, but it is quite old, which is why it isn't up to the 116 000+ crates that crates.io has today.

I should get around to bring my setup more up-to-date to potentially also be able to contribute more substantial performance testing (even though I am not even sure if this kind of scale will ever be needed, and I suspect that people won't be constantly restarting their registry with 100 000+ crates).

@Dalvany
Copy link
Contributor Author

Dalvany commented Jun 7, 2023

I suspect that people won't be constantly restarting their registry with 100 000+ crates)

Yeah, you're right, especially since Alexandrie works great ! Good job by the way !

@Dalvany
Copy link
Contributor Author

Dalvany commented Jun 7, 2023

I have a remark for e2e testing, I think you should consider looking for integration testing framework. I only used two of them python's behave and cucumber-rs (they work the same way).
I don't know if they're really a good fit here but I think it might worth looking into it.

@Dalvany
Copy link
Contributor Author

Dalvany commented Jun 11, 2023

Hi @Hirevo
I added search to the api search endpoint. I had to introduce a limit to the number of result per page since you can't construct a Tantivy's TopDocs without. I set it to 15 and reflect that change in the book.

I don't index README as I don't think it's really useful: it will serve few use case. But if you want I can work on that.

@Dalvany
Copy link
Contributor Author

Dalvany commented Jun 11, 2023

There's a thing you should know, I use a crate I made, tantivy-analysis-contrib, to make the suggester work the way I intended to (doing the "word starts with XXX") as Tantivy provide's a ngram only as tokenizer not as a filter.
A tokenizer is responsible of breaking the whole text into tokens (basically "words"), when a filter is responsible to "transform" each token (lowercasing, generating synonyms, ...etc).

@Dalvany Dalvany changed the title Draft Add tantivy index Add tantivy index Jun 11, 2023
@Dalvany Dalvany marked this pull request as ready for review June 11, 2023 16:29
@Hirevo
Copy link
Owner

Hirevo commented Jun 11, 2023

I have a remark for e2e testing, I think you should consider looking for integration testing framework. I only used two of them python's behave and cucumber-rs (they work the same way). I don't know if they're really a good fit here but I think it might worth looking into it.

There is definitely lots of improvements to be made to the E2E testing setup, even just in terms of performance, as it takes quite some time for them to run to completion (~95 % of it is spent just building the docker image).
What I like with the current setup though is that the scenarios are simple shell scripts in which we can run cargo commands just as a user would do them.

I had to introduce a limit to the number of result per page since you can't construct a Tantivy's TopDocs without. I set it to 15 and reflect that change in the book.

Yeah, thank you for doing that. Doing infinite results by default seems like a poor choice anyway.
One of my old comments in the code you removed seems to indicate that I was already a bit doubtful about this choice back then, not sure what made me go forward with it. :/

I don't index README as I don't think it's really useful: it will serve few use case. But if you want I can work on that.

No worries, my guess is that it can be added later if the need for it comes. You're right that the combination of the crate name, categories and keywords should be enough (making the crate more easily discoverable is their primary purpose already, so crate authors are likely to use them appropriately).

There's a thing you should know, I use a crate I made, tantivy-analysis-contrib, to make the suggester work the way I intended to (doing the "word starts with XXX") as Tantivy provide's a ngram only as tokenizer not as a filter.

That's not an issue, I am completely fine with that, but thanks for the heads up.
Nice work on that crate by the way, it seems like it adds some useful and non-trivial tokenizers.

I'll jump on to reviewing the PR, I'll try to have it completed either tonight or tomorrow.

@Dalvany
Copy link
Contributor Author

Dalvany commented Jun 11, 2023

A thing that I think I didn't describe: we can restrict search to a specific field by typing

field:text

For example looking for crates that contains the keyword log :

keywords:log

Copy link
Owner

@Hirevo Hirevo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have that much to comment about your implementation.
It all seems pretty good.

It is also really good functionality-wise.

The only potential remark on that front would be that an empty search in the API (when q is an empty string) leads to 0 results, instead of matching every crates in the registry, which can happen when doing cargo search with no terms.

But while crates.io behaves that way, the API docs found in the Cargo book don't mandate it to work that way (it is up to the "criteria defined on the server"), so I don't think it should be blocking anything.

It can be addressed later without much trouble, if we want to restore this behavior.

alexandrie.toml Outdated Show resolved Hide resolved
crates/alexandrie/src/config/mod.rs Outdated Show resolved Hide resolved
crates/alexandrie/src/frontend/search.rs Outdated Show resolved Hide resolved
crates/alexandrie/src/frontend/search.rs Outdated Show resolved Hide resolved
crates/alexandrie/src/frontend/search.rs Outdated Show resolved Hide resolved
docker/sqlite/alexandrie.toml Outdated Show resolved Hide resolved
crates/alexandrie/src/api/crates/search.rs Outdated Show resolved Hide resolved
crates/alexandrie/src/api/crates/search.rs Outdated Show resolved Hide resolved
crates/alexandrie/src/fts/document.rs Outdated Show resolved Hide resolved
crates/alexandrie/src/fts/index.rs Show resolved Hide resolved
@Hirevo
Copy link
Owner

Hirevo commented Jun 11, 2023

A thing that I think I didn't describe: we can restrict search to a specific field by typing

field:text

For example looking for crates that contains the keyword log :

keywords:log

Oh wow, I didn't notice that, that's really cool !

@Dalvany
Copy link
Contributor Author

Dalvany commented Jun 12, 2023

Thanks, I'll try to make the changes this week

@Dalvany
Copy link
Contributor Author

Dalvany commented Jun 12, 2023

I thought there was a comment on the zero result on empty search. It should be easy to solve using an AllQuery. I can make the change if you wish.

@Dalvany
Copy link
Contributor Author

Dalvany commented Jun 12, 2023

I applied everything, I hope I didn't missed something.
I added a commit to handle empty query using an AllQuery.
I made a few manual check to see if everything is correct :

  • typing in the search box to see if suggestion work properly
  • search for "log" in the frontend
  • running an empty search. Perhaps we should improve that by sorting alphabetically (but I don't know how to do that with tantivy, perhaps using IndexSortByField in index settings, but that's not done at query time).
  • call api/v1/search?q=log

@Dalvany
Copy link
Contributor Author

Dalvany commented Jun 12, 2023

Perhaps it would be a goof thing to add a few explanation on search in the book :

  • though it search also through keywords and category, it's mostly designed to search by crate's name
  • an or is made between words (it Tantivy's default see set conjonction)
  • explaining how to "filter" by keywords and categories (eg. actix +categories:"web")
  • perhaps adding a link to QueryParser documentation

Dalvany added 8 commits June 27, 2023 19:54
(cherry picked from commit 22227a2)
(cherry picked from commit e5bb5b0)
(cherry picked from commit 217fc68)
(cherry picked from commit 6706797)
Instead, the RwLock is in the index writer to allow interior mutability
for commit operation. Though IndexWriter should be thread safe, I don't
know how to achieve mutability without the RwLock.

(cherry picked from commit b2987f5)
In Tantivy's example, it is stated that for a search server you typically
want one reader for the lifetime of the application

(cherry picked from commit 6313c4e)
Indexing at startup is faster than indexing in an endpoint. 114k crates
from crates.io get indexed within less than 10s.

(cherry picked from commit efbcb7c)
(cherry picked from commit 2759c53)
@Dalvany Dalvany force-pushed the add_tantivy_index branch from feec52e to aa296f1 Compare June 27, 2023 17:59
@Dalvany Dalvany requested a review from Hirevo June 30, 2023 17:10
@Dalvany
Copy link
Contributor Author

Dalvany commented Jul 14, 2023

Hello @Hirevo, I made all the changes

Copy link
Owner

@Hirevo Hirevo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pinging me, I am sorry for being the bottleneck here.
This looks all good to me, I am OK to merge this.
I'll get to writing a dedicated section in the docs about the features enabled by this search engine.
Thank you so much for the work you've done on this and for putting up with my sluggishness.

@Hirevo Hirevo merged commit 5629df1 into Hirevo:master Jul 18, 2023
@Dalvany
Copy link
Contributor Author

Dalvany commented Jul 19, 2023

Hello, don't worry :-)
Don't hesitate to ping me if there's something wrong or anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: Enhancement M-frontend Module: Frontend P-medium Priority: Medium
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants