Support word decompounding for inflecting words #37

grhoten · 2024-12-11T07:19:36Z

In a lot of Germanic languages, it's common to compound arbitrary words. When inflecting words, this makes it harder to keep a small lexicon or to scale up to support many compound words. The code contribution in pull request #35 excluded such functionality primarily due to potential licensing issues of the data and the desire to expedite the review of the submission. However, at the recent Unicode Technical Workshop this year (2024), I was made aware of Unicode's Unilex repository, which is is the majority of the data that is needed to make word decompounding work. Some additional option values can be added to note when a word should or should not be decompounded in certain ways.

Due to the lack of this functionality, some tests were removed in the code contribution due to this lack of word decompounding and what was in the lexicon. The affected languages were da, de, fi, nb, nl, and sv.

I see that Unilex has data for da, de, fi, nl, and sv. I didn't see a file for any of the Norwegian variants.

We should consider adding word decompounding with data derived from Unilex to improve the inflection capabilities.

nciric · 2024-12-18T18:49:34Z

I agree. I am surprised that Unilex has enough data (last time I checked it was fairly sparse across languages), but I am also glad that it covers enough for German language needs.

grhoten added the enhancement New feature or request label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support word decompounding for inflecting words #37

Support word decompounding for inflecting words #37

grhoten commented Dec 11, 2024 •

edited

Loading

nciric commented Dec 18, 2024

Support word decompounding for inflecting words #37

Support word decompounding for inflecting words #37

Comments

grhoten commented Dec 11, 2024 • edited Loading

nciric commented Dec 18, 2024

grhoten commented Dec 11, 2024 •

edited

Loading