Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common names interpreted as verbs #24

Open
jokull opened this issue May 26, 2020 · 5 comments
Open

Common names interpreted as verbs #24

jokull opened this issue May 26, 2020 · 5 comments

Comments

@jokull
Copy link
Contributor

jokull commented May 26, 2020

I ran the 100 most common first names in Iceland through greynir.parse. No female names are interpreted as verbs but there are a few male ones. See this gist for the code.

https://gist.github.com/jokull/2c1048bbc845feb46c717ac7c77e0cc5

  • Einar → eina so
  • Árni → árna so
  • Helgi → helga so
  • Ragnar → ragna so
  • Óskar → óska so
  • Birgir → birgja so
  • Brynjar → brynja so
  • Rúnar → rúna so
  • Ómar → óma so
  • Reynir → reyna so
  • Garðar → garða so
  • Steinar → steina so
@jokull
Copy link
Contributor Author

jokull commented May 26, 2020

Ef ég geri greynir.parse_single(f'Forstjórinn heitir {name}') þá hjálpar það Greyni nægilega mikið til að átta sig á að um persónu er að ræða. Nema í einu tilfelli, fyrir nafnið Örn, þar verður "Örn." með punkti að terminal:

>>> sentence = greynir.parse_single('Forstjórinn heitir Örn.')
>>> sentence.terminals
[Terminal(text='Forstjórinn', lemma='forstjóri', category='no', variants=['et', 'gr', 'kk', 'nf'], index=0), Terminal(text='heitir', lemma='heita', category='so', variants=['1', 'nf
', 'et', 'fh', 'gm', 'nt', 'p3'], index=1), Terminal(text='Örn.', lemma='Örn.', category='no', variants=['et', 'hk', 'nf'], index=2)]

@vthorsteinsson
Copy link
Member

This is not a surprise, really, as Greynir has a preference for recognizing sentences (with verbs) rather than noun phrases, if both are possible. But for this use case, I would recommend using parse_noun_phrase() instead of parse_single() - that would always give preference to names instead of verbs. Would this solve your problem?

@vthorsteinsson
Copy link
Member

Having said that, the Örn. case is clearly a bug ;-)

@sveinbjornt
Copy link
Member

Could this be related to "örn." being an abbreviation recognised by the tokenizer?

@jokull
Copy link
Contributor Author

jokull commented May 26, 2020

I'm parsing search queries. I've resorted to just searching both the lemma and the original query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants