-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Created an initial pluggable tokenizer with ngram support in order to allow using lunr to drive autocomplete style search boxes. #63
base: master
Are you sure you want to change the base?
Conversation
allow using lunr to drive autocomplete style search boxes.
} | ||
|
||
/** | ||
* A tokenizer tha indexes on character bigrams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/tha/that/
Thanks -- I can see how I totally copy-pasta that same doc error. |
Many thanks for taking the time to look into this. I think that an ngram tokeniser would make a great plugin for lunr, as part of the changes I am making for better i18n support I am add a very simple plugin system that I think you could take advantage of. It's great to have another potential use case for a plugin so that I make sure the API is flexible enough. Let me take a closer look through your changes and see if I can make some suggestions of how to extract this as a plugin. Thanks again! |
Any update on this? |
Hey, is there an ETA for merging this or the plugin system mentioned? Would love to use it! |
@olivernn can this be merged in or is the plugin system ready yet? |
I would also like to contribute ngram analyzers for autocomplete. what is the status of this? it's been open for a year now and so I'm hesitant to do any more work on it. |
The means to add plugins to lunr already exists. The main extension point is to modify an indexes text processing pipeline. Each index has its own pipeline, and so a plugin can safely modify the pipeline of the index it is being applied to. I think though that in these cases the tokenizer needs to be modified. This is possible but for reasons the tokeniser is global, not individual per index. So all indexes will then be forced to use the replacement tokenizer, this may or may not be a problem. An example: var myNgramTokenizer = function () {
lunr.tokenizer = function (obj) {
// ngram implementation
}
}
idx.use(myNgramTokenizer) I'm not sure why the tokenizer is not a property of the instance of lunr index, I will take a look at this. |
@olivernn Great work! Any chance this could be merged? ngram and edgengram are must have nowadays... I'd love to see it built-in or as a plugin. |
Is there anything we can do? |
I use this library all the time, thanks for making it available. One use case we keep doing more is client side autocomplete, and have found that ngram indexing on the server -- usually ElasticSearch -- is giving us the best results. I just need that functionality client side, and in node.js, and don't care to fuss with going out of process to Elastic Search if I can avoid it.
I tried to follow along with your style and formatting, and hopefully did so to your satisfaction.
This sets up an index level tokenizer, I didn't dive as far in as #21, as that implies field level pipelines and tokenizers -- which really then should have some extension to pipeline to 'start' with a tokenizer then stream through multiple filters in the pipeline -- or some other field object that combines a tokenizer and pipeline.