Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support word decompounding for inflecting words #37

Open
grhoten opened this issue Dec 11, 2024 · 1 comment
Open

Support word decompounding for inflecting words #37

grhoten opened this issue Dec 11, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@grhoten
Copy link
Member

grhoten commented Dec 11, 2024

In a lot of Germanic languages, it's common to compound arbitrary words. When inflecting words, this makes it harder to keep a small lexicon or to scale up to support many compound words. The code contribution in pull request #35 excluded such functionality primarily due to potential licensing issues of the data and the desire to expedite the review of the submission. However, at the recent Unicode Technical Workshop this year (2024), I was made aware of Unicode's Unilex repository, which is is the majority of the data that is needed to make word decompounding work. Some additional option values can be added to note when a word should or should not be decompounded in certain ways.

Due to the lack of this functionality, some tests were removed in the code contribution due to this lack of word decompounding and what was in the lexicon. The affected languages were da, de, fi, nb, nl, and sv.

I see that Unilex has data for da, de, fi, nl, and sv. I didn't see a file for any of the Norwegian variants.

We should consider adding word decompounding with data derived from Unilex to improve the inflection capabilities.

@grhoten grhoten added the enhancement New feature or request label Dec 11, 2024
@nciric
Copy link
Contributor

nciric commented Dec 18, 2024

I agree. I am surprised that Unilex has enough data (last time I checked it was fairly sparse across languages), but I am also glad that it covers enough for German language needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants