Add flag to permit synset lookup without stemming #17

stevenbird · 2019-10-10T21:36:27Z

I propose that we add a stem=False flag to wn.synsets().

It means that default behaviour for English will change, but I see no other option, given that stemming only happens for English wordnet. This would make behaviour consistent across languages.

The text was updated successfully, but these errors were encountered:

alvations · 2019-10-28T00:08:22Z

This is actually a little complicated. The WordNet access has been heavily dependent on the morphy algorithm to fetch the synsets and setting it to stem=False would end up skipping all the exceptions that should have been in English, e.g.

>>> from wn import WordNet
>>> wn = WordNet()
>>> wn.synsets('geese')
[Synset('goose.n.01'), Synset('fathead.n.01'), Synset('goose.n.03')]
>>> wn.synsets('mice')
[Synset('mouse.n.01'), Synset('shiner.n.01'), Synset('mouse.n.03'), Synset('mouse.n.04')]

wn.synsets() is not exactly doing stemming but lemmatization through morphy(). I would suggest to expose a use_morphy=True default argument instead. If lemma is found directly from the users's input to wn.synsets() then skip the morphy. Otherwise, check if the use_morphy argument is on and morphy lemmatize when necessary.

alvations · 2019-10-28T00:09:35Z

Also I think the first argument input to wn.synsets() is a misnomer, it shouldn't be lemma but word. It should have been called "word" since most people have been using words as the function's input instead of lemmas =)

alvations · 2019-10-28T00:37:38Z

Okay, now this is awkward.

Actually without modification, the "eyeglasses" example has been "resolved". Unlike the cyclic nature of the old WordNet API, the new wordnet interface don't look at the lemma_names() function that relies on the morphy lemmatizer to fetch the synsets. So by default it's following only the lemma names of lemmas that are linked only directly from the WordNet data.* files.

Existing behavior of wn without disabling morphy:

>>> from wn import WordNet
>>> wn = WordNet()
>>> wn.synsets('eyeglasses')
[Synset('spectacles.n.01')]

Still, I think exposing an argument for users to disable morphy when needed is helpful. Thus #18

goodmami · 2019-10-28T07:09:45Z

This is nice, but use_morphy as a parameter name is unfortunately English-specific since the Morphy tool is (implicitly) only for English. Not only is the parameter completely irrelevant if lang is something other than eng, but what if later someone adds a lemmatizer for another language? I agree that stem as a parameter name is inaccurate as it's lemmatizing and not stemming, so why not lemmatize?

Also, I'm with @stevenbird that it's best to make behavior consistent for all languages instead of special-casing English. Since you're replacing the default WordNet module in the NLTK, this seems like a good time to introduce such a change. However, if NLTK follows semantic versioning and you're not ready to make a 4.0 release (because of the backward-compatibility breakage), you could make the default True and issue a warning (WordNetWarning("lemmatization is not provided for this language") or DeprecationWarning("lemmatization will be turned off by default in the next major version")), then make the change for a later release.

Finally, it would be even better if users could supply their own lemmatizer. E.g., wn.synsets(word, lang='xyz', lemmatize=lemmatize_xyz) where lemmatize_xyz is a compatible function for lemmatizing words in language xyz. This way users could even use other lemmatizers for English, too. For convenience, if lemmatize=True then it uses the default function depending on the value of lang, and if none exists for the language, an error is raised.

stevenbird · 2019-10-28T23:13:33Z

Thanks for these suggestions @goodmami.

So default behaviour would be to use a lemmatizer if available else proceed without (issuing the warning). The only change required will be for users of wordnets other than English who have to tweak their code to avoid the warning.

And if a function is passed, we use it.

alvations · 2020-01-06T08:28:09Z

@goodmami @stevenbird got some free time to look at this again.

Let me try to confirm the requirements before I reimplement stuff =)

The desired interface would be wn.synsets(word, lang='xyz', lemmatize_func=xyz_lemmatize) where xyz_lemmatize() is a compatible function that takes a token and return the lemmatized form.
By default, the language would be set to English lang='en' and the default lemmatize_func=None.
And for back-compatibility, we can for now, enforce morphy lemmatizer if lang=='en', and raise a warning that this would be remove in future versions.

Does the requirements sound about right?

goodmami · 2020-01-06T08:56:30Z

That's close to what I was thinking. But more specifically:

DEFAULT_LEMMATIZERS = {
    'eng': morphy,
    ...
}

def synsets(word, pos=None, lang='eng', check_exceptions=True, lemmatize=True):
    if lemmatize is True:
        if lang not in DEFAULT_LEMMATIZERS:
            warnings.warn(
                WordNetWarning,
                "No default lemmatizer for language '{}'".format(lang))
            lemmatize = False
        lemmatize = DEFAULT_LEMMATIZERS[lang]
    if lemmatize:
        word = lemmatize(word, pos=pos, check_exceptions=check_exceptions)
    ...

This way we keep the default behavior, but users can easily disable English lemmatization with lemmatize=False. For other values of lang, only a warning (not an error) will appear if there is no lemmatizer defined and they don't change the default value of lemmatize. And other lemmatizers can be used by passing a compatible function in directly. That function would have the signature lemmatize(word, pos=None, check_exceptions=True) for compatibility, but the latter two may not be relevant for other lemmatizers. Actually I'd rather get rid of check_exceptions and instead let users pass in things like lemmatize=morphy_no_exceptions or something, but I kept it in for backward compatibility.

Finally, I now wonder if "lemmatize" is even the right word, because I can imagine users only wanting simple normalization, like downcasing. Maybe normalize?

srhrshr mentioned this issue Oct 27, 2019

flag to control inclusion of wordnet synsets associated with stems nltk/nltk#2443

Closed

alvations self-assigned this Oct 27, 2019

alvations added the enhancement New feature or request label Oct 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flag to permit synset lookup without stemming #17

Add flag to permit synset lookup without stemming #17

stevenbird commented Oct 10, 2019

alvations commented Oct 28, 2019 •

edited

Loading

alvations commented Oct 28, 2019

alvations commented Oct 28, 2019 •

edited

Loading

goodmami commented Oct 28, 2019

stevenbird commented Oct 28, 2019 •

edited

Loading

alvations commented Jan 6, 2020

goodmami commented Jan 6, 2020

Add flag to permit synset lookup without stemming #17

Add flag to permit synset lookup without stemming #17

Comments

stevenbird commented Oct 10, 2019

alvations commented Oct 28, 2019 • edited Loading

alvations commented Oct 28, 2019

alvations commented Oct 28, 2019 • edited Loading

goodmami commented Oct 28, 2019

stevenbird commented Oct 28, 2019 • edited Loading

alvations commented Jan 6, 2020

goodmami commented Jan 6, 2020

alvations commented Oct 28, 2019 •

edited

Loading

alvations commented Oct 28, 2019 •

edited

Loading

stevenbird commented Oct 28, 2019 •

edited

Loading