Indexing and search pipelines are mismatched with language support #149

dhdaines · 2024-07-04T16:36:57Z

I notice that when using language support, some words cannot be searched:

index = lunr(
    ref="id",
    fields=["texte"],
    documents=[{"id": "1", "texte": "Allô tout le monde!"}],
    languages="fr",
)
print(index.search("allo"))  # prints [], should print something!

This would seem to be due to the missing trimmer in the search pipeline:

print(get_default_builder("fr").pipeline)
# <Pipeline stack="lunr-multi-trimmer-fr,stopWordFilter-fr,stemmer-fr">
print(index.pipeline)
# <Pipeline stack="stemmer-fr">

Not sure really why, but it seems the trimmer thinks ô should be trimmed:

print(index.serialize()["invertedIndex"])
# [['all', {'texte': {'1': defaultdict(<class 'list'>, {})}, '_index': 0}], ['mond', {'texte': {'1': defaultdict(<class 'list'>, {})}, '_index': 2}], ['tout', {'texte': {'1': defaultdict(<class 'list'>, {})}, '_index': 1}]]

So, there are really two problems:

The trimmer has odd ideas about what characters are in the language (known problem, see https://github.com/yeraydiazdiaz/lunr.py/blob/master/lunr/languages/trimmer.py#L7 and https://github.com/yeraydiazdiaz/lunr.py/blob/master/lunr/languages/trimmer.py#L7)
The trimmer and stopword filters are not in the search pipeline.

The text was updated successfully, but these errors were encountered:

dhdaines · 2024-07-04T16:40:47Z

For (1) I can just extract them from the Node code, it's quite easy to do...

dhdaines · 2024-07-04T18:14:58Z

For (2), it seems like this might be on purpose:

https://github.com/yeraydiazdiaz/lunr.py/blob/master/lunr/lunr.py#L66

Can you explain why? Bug-compatibility with lunr.js? (EDIT: yes, bug-compatibility, it appears)

dhdaines · 2024-07-04T19:43:51Z

After digging a bit more it appears this is due to the difficulty of registering the necessary trimmers and stopword filters when the serialized index is reloaded? Only the stemmers are registered: https://github.com/yeraydiazdiaz/lunr.py/blob/master/lunr/languages/__init__.py#L99

The workaround I found is to explicitly add them to search_pipeline in the builder, then explicitly call get_nltk_builder for the language(s) in question before loading the serialized index, e.g.:

for funcname in ("lunr-multi-trimmer-fr", "stopWordFilter-fr",):
    builder.search_pipeline.before(
        builder.search_pipeline.registered_functions["stemmer-fr"],
        builder.search_pipeline.registered_functions[funcname],
    )

...

get_nltk_builder(["fr"])
index = Index.load(...)

dhdaines · 2024-07-05T16:29:37Z

(2) is addressed in #151 now

dhdaines · 2024-07-10T21:42:48Z

I've submitted a PR to lunr-langugages to fix the problem with the trimmer missing important characters (it wasn't passing its own test suite): MihaiValentin/lunr-languages#115

I think that we can re-use the same JS code that generates the lunr-languages trimmers, stemmers, and stopword filters to generate Python code for lunr.py, I hope to make a new PR to address this issue which does that soon!

dhdaines mentioned this issue Jul 4, 2024

feat: extract wordchars from lunr-languages #150

Closed

dhdaines mentioned this issue Jul 5, 2024

Pipeline should have an insert and/or a replace method #155

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing and search pipelines are mismatched with language support #149

Indexing and search pipelines are mismatched with language support #149

dhdaines commented Jul 4, 2024 •

edited

Loading

dhdaines commented Jul 4, 2024

dhdaines commented Jul 4, 2024 •

edited

Loading

dhdaines commented Jul 4, 2024 •

edited

Loading

dhdaines commented Jul 5, 2024

dhdaines commented Jul 10, 2024

Indexing and search pipelines are mismatched with language support #149

Indexing and search pipelines are mismatched with language support #149

Comments

dhdaines commented Jul 4, 2024 • edited Loading

dhdaines commented Jul 4, 2024

dhdaines commented Jul 4, 2024 • edited Loading

dhdaines commented Jul 4, 2024 • edited Loading

dhdaines commented Jul 5, 2024

dhdaines commented Jul 10, 2024

dhdaines commented Jul 4, 2024 •

edited

Loading

dhdaines commented Jul 4, 2024 •

edited

Loading

dhdaines commented Jul 4, 2024 •

edited

Loading