You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a lot of Germanic languages, it's common to compound arbitrary words. When inflecting words, this makes it harder to keep a small lexicon or to scale up to support many compound words. The code contribution in pull request #35 excluded such functionality primarily due to potential licensing issues of the data and the desire to expedite the review of the submission. However, at the recent Unicode Technical Workshop this year (2024), I was made aware of Unicode's Unilex repository, which is is the majority of the data that is needed to make word decompounding work. Some additional option values can be added to note when a word should or should not be decompounded in certain ways.
Due to the lack of this functionality, some tests were removed in the code contribution due to this lack of word decompounding and what was in the lexicon. The affected languages were da, de, fi, nb, nl, and sv.
I see that Unilex has data for da, de, fi, nl, and sv. I didn't see a file for any of the Norwegian variants.
We should consider adding word decompounding with data derived from Unilex to improve the inflection capabilities.
The text was updated successfully, but these errors were encountered:
I agree. I am surprised that Unilex has enough data (last time I checked it was fairly sparse across languages), but I am also glad that it covers enough for German language needs.
In a lot of Germanic languages, it's common to compound arbitrary words. When inflecting words, this makes it harder to keep a small lexicon or to scale up to support many compound words. The code contribution in pull request #35 excluded such functionality primarily due to potential licensing issues of the data and the desire to expedite the review of the submission. However, at the recent Unicode Technical Workshop this year (2024), I was made aware of Unicode's Unilex repository, which is is the majority of the data that is needed to make word decompounding work. Some additional option values can be added to note when a word should or should not be decompounded in certain ways.
Due to the lack of this functionality, some tests were removed in the code contribution due to this lack of word decompounding and what was in the lexicon. The affected languages were da, de, fi, nb, nl, and sv.
I see that Unilex has data for da, de, fi, nl, and sv. I didn't see a file for any of the Norwegian variants.
We should consider adding word decompounding with data derived from Unilex to improve the inflection capabilities.
The text was updated successfully, but these errors were encountered: