Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flag to permit synset lookup without stemming #17

Open
stevenbird opened this issue Oct 10, 2019 · 7 comments
Open

Add flag to permit synset lookup without stemming #17

stevenbird opened this issue Oct 10, 2019 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@stevenbird
Copy link
Member

Cf nltk/nltk#2421

I propose that we add a stem=False flag to wn.synsets().

It means that default behaviour for English will change, but I see no other option, given that stemming only happens for English wordnet. This would make behaviour consistent across languages.

@alvations
Copy link
Collaborator

alvations commented Oct 28, 2019

This is actually a little complicated. The WordNet access has been heavily dependent on the morphy algorithm to fetch the synsets and setting it to stem=False would end up skipping all the exceptions that should have been in English, e.g.

>>> from wn import WordNet
>>> wn = WordNet()
>>> wn.synsets('geese')
[Synset('goose.n.01'), Synset('fathead.n.01'), Synset('goose.n.03')]
>>> wn.synsets('mice')
[Synset('mouse.n.01'), Synset('shiner.n.01'), Synset('mouse.n.03'), Synset('mouse.n.04')]

wn.synsets() is not exactly doing stemming but lemmatization through morphy(). I would suggest to expose a use_morphy=True default argument instead. If lemma is found directly from the users's input to wn.synsets() then skip the morphy. Otherwise, check if the use_morphy argument is on and morphy lemmatize when necessary.

@alvations
Copy link
Collaborator

Also I think the first argument input to wn.synsets() is a misnomer, it shouldn't be lemma but word. It should have been called "word" since most people have been using words as the function's input instead of lemmas =)

@alvations
Copy link
Collaborator

alvations commented Oct 28, 2019

Okay, now this is awkward.

Actually without modification, the "eyeglasses" example has been "resolved". Unlike the cyclic nature of the old WordNet API, the new wordnet interface don't look at the lemma_names() function that relies on the morphy lemmatizer to fetch the synsets. So by default it's following only the lemma names of lemmas that are linked only directly from the WordNet data.* files.

Existing behavior of wn without disabling morphy:

>>> from wn import WordNet
>>> wn = WordNet()
>>> wn.synsets('eyeglasses')
[Synset('spectacles.n.01')]

Still, I think exposing an argument for users to disable morphy when needed is helpful. Thus #18

@alvations alvations added the enhancement New feature or request label Oct 28, 2019
@goodmami
Copy link
Collaborator

This is nice, but use_morphy as a parameter name is unfortunately English-specific since the Morphy tool is (implicitly) only for English. Not only is the parameter completely irrelevant if lang is something other than eng, but what if later someone adds a lemmatizer for another language? I agree that stem as a parameter name is inaccurate as it's lemmatizing and not stemming, so why not lemmatize?

Also, I'm with @stevenbird that it's best to make behavior consistent for all languages instead of special-casing English. Since you're replacing the default WordNet module in the NLTK, this seems like a good time to introduce such a change. However, if NLTK follows semantic versioning and you're not ready to make a 4.0 release (because of the backward-compatibility breakage), you could make the default True and issue a warning (WordNetWarning("lemmatization is not provided for this language") or DeprecationWarning("lemmatization will be turned off by default in the next major version")), then make the change for a later release.

Finally, it would be even better if users could supply their own lemmatizer. E.g., wn.synsets(word, lang='xyz', lemmatize=lemmatize_xyz) where lemmatize_xyz is a compatible function for lemmatizing words in language xyz. This way users could even use other lemmatizers for English, too. For convenience, if lemmatize=True then it uses the default function depending on the value of lang, and if none exists for the language, an error is raised.

@stevenbird
Copy link
Member Author

stevenbird commented Oct 28, 2019

Thanks for these suggestions @goodmami.

So default behaviour would be to use a lemmatizer if available else proceed without (issuing the warning). The only change required will be for users of wordnets other than English who have to tweak their code to avoid the warning.

And if a function is passed, we use it.

@alvations
Copy link
Collaborator

@goodmami @stevenbird got some free time to look at this again.

Let me try to confirm the requirements before I reimplement stuff =)

  • The desired interface would be wn.synsets(word, lang='xyz', lemmatize_func=xyz_lemmatize) where xyz_lemmatize() is a compatible function that takes a token and return the lemmatized form.

  • By default, the language would be set to English lang='en' and the default lemmatize_func=None.

  • And for back-compatibility, we can for now, enforce morphy lemmatizer if lang=='en', and raise a warning that this would be remove in future versions.

Does the requirements sound about right?

@goodmami
Copy link
Collaborator

goodmami commented Jan 6, 2020

That's close to what I was thinking. But more specifically:

DEFAULT_LEMMATIZERS = {
    'eng': morphy,
    ...
}

def synsets(word, pos=None, lang='eng', check_exceptions=True, lemmatize=True):
    if lemmatize is True:
        if lang not in DEFAULT_LEMMATIZERS:
            warnings.warn(
                WordNetWarning,
                "No default lemmatizer for language '{}'".format(lang))
            lemmatize = False
        lemmatize = DEFAULT_LEMMATIZERS[lang]
    if lemmatize:
        word = lemmatize(word, pos=pos, check_exceptions=check_exceptions)
    ...

This way we keep the default behavior, but users can easily disable English lemmatization with lemmatize=False. For other values of lang, only a warning (not an error) will appear if there is no lemmatizer defined and they don't change the default value of lemmatize. And other lemmatizers can be used by passing a compatible function in directly. That function would have the signature lemmatize(word, pos=None, check_exceptions=True) for compatibility, but the latter two may not be relevant for other lemmatizers. Actually I'd rather get rid of check_exceptions and instead let users pass in things like lemmatize=morphy_no_exceptions or something, but I kept it in for backward compatibility.

Finally, I now wonder if "lemmatize" is even the right word, because I can imagine users only wanting simple normalization, like downcasing. Maybe normalize?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants