Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem in spanish: doesn't work if word isn't using accent mark. #59

Open
jigarzon opened this issue Oct 7, 2019 · 2 comments
Open

Comments

@jigarzon
Copy link

jigarzon commented Oct 7, 2019

Let's say I have an index created. the spanish word "Respiración" is stemmed as:
"respir"

Thats correct.

Now, I make a search, but the user doesn't use the accent mark, and he types: "respiracion" (without acent on last "o"). So lunr won't stem that word and it will let it as "respiracion", so no matches will be found.

I know that a basis around stemming is that the word is correctly spelled, BUT as nearly no user type accents correctly when searching for a string, this is really making lunr useless for many words.

@jigarzon
Copy link
Author

jigarzon commented Oct 7, 2019

I made a workaround, that is removing accents before stemmer in the pipeline (I remove accents with the use of normalize-strings.

But this also removes lot of benefits from stemming, because those words will never be stemmed.

var normalize = require('normalize-strings');


var normalizeLunrPlugin = function(builder, stemmer) {
  var pipelineFunction = function(token) {
    return token.update(function(word) {
      var normalized = normalize(word);
      return normalized;
    });
  };

  // Register the pipeline function so the index can be serialised
  lunr.Pipeline.registerFunction(pipelineFunction, 'normalizeLunrPlugin');

  // Add the pipeline function to both the indexing pipeline and the
  // searching pipeline
  builder.pipeline.before(stemmer, pipelineFunction);
  builder.searchPipeline.before(stemmer, pipelineFunction);
};

@jigarzon
Copy link
Author

jigarzon commented Oct 7, 2019

My suggestion is that two stemmers, with both accented and no-accented words run in the pipeline, so that the word "respiracion" without accents, that the first stemmer will leave intact, is picked by the second one and stemmed correctly...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant