-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tantivy index #161
Add tantivy index #161
Conversation
383fdce
to
5933fc1
Compare
Hi, I've seen that you closed the old PR and named this one as draft. I'm sorry for not having commented on your old PR, I wasn't sure if you were expecting/ready for a review or just general feedback. As someone who is new to working with search engines, It is very fascinating for me to learn from your code how full text indexing works and how to optimize the relevancy of the results. I have tried your branch on my local testing setup and the automatic and fast indexing of the crates (~34000 crates) and the very notable improvement of the search results, all without needing much configuration at all blew me away ! Feel free to ping me anytime if you want me to start reviewing some code or other more general feedback ! |
Hi @Hirevo I closed the old PR as there was conflicts and find easier to redo thing on a proper branch instead of the forked master. If you could, yes it would be nice to make it a proper draft as I still have to work on the API search endpoint but I must confess I'm doing a pause right now. For the performance part, I tested it with crates.io index (only indexing crate's name, no keywords or categories) and it got indexed in 15s (if I remember) when I do it on startup, while when I made a custom endpoint it took something like 10 minutes (I don't know how to explain such a difference, perhaps I didn't ran Alexandrie with release flag and/or it comes from the RWLock on the indexer). |
Understandable, no need to pressure yourself, there is no rush at all. :)
That's amazing ! My testing setup with the 34 000 crates is from the DB dump from crates.io and I have the rendered READMEs for all of them, but it is quite old, which is why it isn't up to the 116 000+ crates that crates.io has today. I should get around to bring my setup more up-to-date to potentially also be able to contribute more substantial performance testing (even though I am not even sure if this kind of scale will ever be needed, and I suspect that people won't be constantly restarting their registry with 100 000+ crates). |
Yeah, you're right, especially since Alexandrie works great ! Good job by the way ! |
I have a remark for e2e testing, I think you should consider looking for integration testing framework. I only used two of them python's behave and cucumber-rs (they work the same way). |
Hi @Hirevo I don't index README as I don't think it's really useful: it will serve few use case. But if you want I can work on that. |
There's a thing you should know, I use a crate I made, |
There is definitely lots of improvements to be made to the E2E testing setup, even just in terms of performance, as it takes quite some time for them to run to completion (~95 % of it is spent just building the docker image).
Yeah, thank you for doing that. Doing infinite results by default seems like a poor choice anyway.
No worries, my guess is that it can be added later if the need for it comes. You're right that the combination of the crate name, categories and keywords should be enough (making the crate more easily discoverable is their primary purpose already, so crate authors are likely to use them appropriately).
That's not an issue, I am completely fine with that, but thanks for the heads up. I'll jump on to reviewing the PR, I'll try to have it completed either tonight or tomorrow. |
A thing that I think I didn't describe: we can restrict search to a specific field by typing
For example looking for crates that contains the keyword
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have that much to comment about your implementation.
It all seems pretty good.
It is also really good functionality-wise.
The only potential remark on that front would be that an empty search in the API (when q
is an empty string) leads to 0 results, instead of matching every crates in the registry, which can happen when doing cargo search
with no terms.
But while crates.io behaves that way, the API docs found in the Cargo book don't mandate it to work that way (it is up to the "criteria defined on the server"), so I don't think it should be blocking anything.
It can be addressed later without much trouble, if we want to restore this behavior.
Oh wow, I didn't notice that, that's really cool ! |
Thanks, I'll try to make the changes this week |
I thought there was a comment on the zero result on empty search. It should be easy to solve using an AllQuery. I can make the change if you wish. |
I applied everything, I hope I didn't missed something.
|
Perhaps it would be a goof thing to add a few explanation on search in the book :
|
(cherry picked from commit 22227a2)
(cherry picked from commit 217fc68)
(cherry picked from commit 6706797)
Instead, the RwLock is in the index writer to allow interior mutability for commit operation. Though IndexWriter should be thread safe, I don't know how to achieve mutability without the RwLock. (cherry picked from commit b2987f5)
In Tantivy's example, it is stated that for a search server you typically want one reader for the lifetime of the application (cherry picked from commit 6313c4e)
Indexing at startup is faster than indexing in an endpoint. 114k crates from crates.io get indexed within less than 10s. (cherry picked from commit efbcb7c)
(cherry picked from commit 3aad767)
(cherry picked from commit 3054040)
Co-authored-by: Nicolas Polomack <[email protected]>
Co-authored-by: Nicolas Polomack <[email protected]>
Resolve conflict on Cargo.lock. Ran cargo update.
feec52e
to
aa296f1
Compare
Hello @Hirevo, I made all the changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pinging me, I am sorry for being the bottleneck here.
This looks all good to me, I am OK to merge this.
I'll get to writing a dedicated section in the docs about the features enabled by this search engine.
Thank you so much for the work you've done on this and for putting up with my sluggishness.
Hello, don't worry :-) |
This pull request aims to address #19 issue by adding a Tantivy index.
While there is still room for improvement, it might be a first step.
Configuration
Add a section [search] with a field
directory
that contains where Tantivy should store its files.Search
As it uses QueryParser you can use full Tantivy query language.
By default search use all fields except
name.prefix
(see below) and suggester search amongstname
,name.full
andname.prefix
(see below).Implementation
fts
moduleThis module contains all full text search related structures.
Tantivy
structure handles all boiler plate to setup an index, search and suggest. It also delegate method to index document, commit documents.TantivyDocument
is a structure that represents a crate and can be converted into a Tantivy's DocumentIndices
Crate's name are index multiple times to improve both result relevance of suggester and search.
name
: a simple tokenized version of crate's name :name.full
: not tokenized, only lower-cased. It's main purpose is to increase relevance when the searched text match exactly a crate namename.prefix
: index word prefix to handle suggester.Other fields that are indexed :
categories
are index using the same pipeline asname.full
as they should be amongst a precise listkeywords
are index using the same pipeline as ̀name` as they are free textdescription
andreadme
use the same pipeline asname
.Note that at search time, we should not apply apply the edge ngram filter to reduce noise.
How to index
When Alexandrie starts, it index everything.
Things that still need work
Actually running indexer endpoint causes 500 HTTP error when trying to access UI. It comes from a lock on the database since I browse all crates for indexing in a single transaction.Use run method and index at startup instead in an endpoint.Need to change API search endpoint as I only change frontend searchNew crates aren't yet indexed