-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate document and query token pipelines? #105
Comments
That is an interesting concept. I don't think I've come across this one applied to full-text searching before. The issue I am seeing with this is that, while the index size would be roughly the same (a stemmer does not create additional terms, but some terms may end up getting merged by stemming), the query cost would potentially skyrocket. Let's use the At query time, this would be expanded into every single possible declination. This means that rather than stemming and getting a one-term query, you're looking at a 20+ boolean should query. It definitely is an interesting case, although one I think most people would not be leveraging, particularly for non-english languages. In particular, german would get mighty spicy with this seeing as words can be part of other words. I've done this on the PR (#106) I have waiting, which substantially changes a whole bunch of things under the hood. When it is merged, you'll be able to use it; feel free to test out the branch if you're curious. |
My use case was for synonyms, rather than the stemming-style variations shown in the whoosh example. So I have a more limited set of variations expanded into the query (which are then matched against stemmed words in the index like normal) Perhaps there are other use cases where you would want to dynamically change the available variations at runtime without reindexing I will watch your PR 👍 |
@anentropic there has been no sign of updates on the PR. Would you like me to prop it up as its own package on npm so you can get a feel for the changes and see if it fits your goals? |
Does this mean there is one pipeline that is applied to both document tokens and query tokens?
For some use cases it would be useful to have separate pipelines for documents and queries, for example implementing something like "variations" in Whoosh, where a single token is expanded to a list of tokens at query time:
https://whoosh.readthedocs.io/en/latest/stemming.html#variations
Is that possible?
The text was updated successfully, but these errors were encountered: