-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The best lexicon type/format to use #1
Comments
Another approach is to use UniMorph package. |
Before answering the great nciric@ questions, we also need to decide what is our endgoal: (1) do we want to build/store a lexicon? (e‧g.: store "house, n: house,sing/houses,plur") If we can clarify these points first, then we can choose the lexicon format |
Hello everyone, Jelena Mitrovic here, very excited to have been invited by @nciric to join this effort @BrunoCartoni I have worked with UNITEX and DELA dictionaries for Serbian during my PhD (a while back). If we choose to go this route, the answers to your questions would be: (1) the lexica DELAC (for simple words) DELAF (for compounds) do have all the forms available alongside the lemma already. (2) the dictionary format contains this already - the problem is that dictionaries are limited, domain specific, and project-specific - so people build their own and do not share them. We would thus have to find people who are willing to supplement the existing resources, or share the ones they have. (3) this would be ideal, and again, possible with DELA dictionaries. Regarding UniMorph package, I do not have experience with it, but it seems to go well with Universal Dependancies. Inflection is dependant on syntax, so it might make sense to include at least the simplest UDs for each language. The issues we are dealing with here are quite complex, and I hope to have a better understanding of the overall requirements for Unicode after our meeting. |
Thanks Bruno and Jelena, see my answers below:
My use case for the library is:
We can think of other scenarios, like building an index for search and asking for lemmatization, where your point 3) holds. |
It would be helpful to see a summary of what formats are available out there. Using a format that is compatible with the Unicode license is preferable. Mihai did point out DMLex, which seems promising. I like the idea of interoperability and leveraging existing repositories. On the other hand, I don't like restrictions on adding grammemes and other morphological information. For example, it's important to know the phonetic information or if a word starts or end with a vowel or consonant. That's important for grammatical agreement in several languages, including English, French and Korean. So if adding such information is difficult, I'd like to steer clear of such restrictions. If a lexicon format can help sort a list of English adjectives correctly, that would be a strong format to consider, but it's not a requirement. Adjective order in English is a helpful problem to solve. Conceptually, I'd like the data structured in a way to have a lemma associated with all its surface forms, and each surface form annotated with grammemes to differentiate it from other surface form under a lemma. As an example, Wiktionary has katt in Swedish, and it has a well annotated declension table and pronunciation. I'm sure there will be discussions on what should be implicitly and explicitly in such structured lexicons, and I'd prefer have that as a separate topic. I'd also like to defer API and code discussion into a separate topic too. |
Thanks Nebojša for sketching out the main use case, and thank you all for
the interesting conversation!
Based on Nebojša's use case, here is a first draft of the requirements:
1. lexicon:
a. store the main grammatical features of each entry
b. (eventually) store all the forms of an entry with their grammatical
features, specially for exception that are hard to handle with (2)
2. A morphological generator that can generate the correct form in a
placeholder, based on the grammatical features stored in the lexicon entry,
and the specifications in the message.
So in Nebojša's examples:
(1) will contain:
- Beograd: masculine, inanimate
- Rim: masculine, inanimate
(2) will generate the correct form in the message according to the
specification in the template (e.g. VOC, SINGULAR.)
and all the morphological or phonetic information mentioned by George
Rhoten will be stored in (1).
Please let me know if we all agreed on these first principles?
As per the internal structure of the lexicon, we can leverage the "lexical
masks" we developed (introduced in
https://aclanthology.org/2020.lrec-1.372.pdf) that is already used by
Wikidata.
Bruno
…On Tue, Feb 27, 2024 at 9:15 AM George Rhoten ***@***.***> wrote:
It would be helpful to see a summary of what formats are available out
there. Using a format that is compatible with the Unicode license is
preferable.
Mihai did point out DMLex
<https://docs.oasis-open.org/lexidma/dmlex/v1.0/csd02/dmlex-v1.0-csd02.html>,
which seems promising.
I like the idea of interoperability and leveraging existing repositories.
On the other hand, I don't like restrictions on adding grammemes and other
morphological information. For example, it's important to know the phonetic
information or if a word starts or end with a vowel or consonant. That's
important for grammatical agreement in several languages, including
English, French and Korean. So if adding such information is difficult, I'd
like to steer clear of such restrictions.
If a lexicon format can help sort a list of English adjectives correctly,
that would be a strong format to consider, but it's not a requirement. Adjective
order <https://en.wikipedia.org/wiki/Adjective#Order> in English is a
helpful problem to solve.
Conceptually, I'd like the data structured in a way to have a lemma
associated with all its surface forms, and each surface form annotated with
grammemes to differentiate it from other surface form under a lemma. As an
example, Wiktionary has katt <https://en.wiktionary.org/wiki/katt#Swedish>
in Swedish, and it has a well annotated declension table and pronunciation.
I'm sure there will be discussions on what should be implicitly and
explicitly in such structured lexicons, and I'd prefer have that as a
separate topic. I'd also like to defer API and code discussion into a
separate topic too.
—
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BGM2AFC7BQI757HZ3YOUEDDYVWI3RAVCNFSM6AAAAABDT5WPB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRWGAYDGNZXGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Bruno Cartoni | (he/him) | Staff Linguist | Pride at Google Zürich Lead |
***@***.*** | +41.79.246.80.46
|
DMLex also sounds promising, thanks for linking.
I opened #3 for discussing APIs (and use cases). |
A different format is being used in pull request #35 that was merged today. Changing to a different format is possible, but the tools should create the same compiled binary format. A lot of effort has gone into making the binary format as small as and as fast as possible for each language. The binary format compresses better on very regular inflection patterns of lemmas with different suffixes. If a different source format will be used, techniques should be implemented for scaleable filtering of the data. For example, you may want to combine 2 data sources with one taking precedence over another data source. You may want to add, remove, or reorder some inflection tables for ambiguous words. Uncommon, non-standard or rare forms may need to be omitted to save space, like adjective forms of city names or plural forms of people's names. Forms that are algorithmically derived may want to be omitted, like deciding when to add |
Lexicons are a critical part of the inflection project. They need to be used at the runtime, and will also be used by our tools for potential ML training.
We need to decide on the format we collect the data in. This decision needs to be based on multiple criteria:
An example tool and format, used in some universities (see languages):
They use Dela class of dictionaries (couldn't find a better link to describe Dela format).
What are other options we can use? Other criteria for selecting a lexicon/dictionary?
The text was updated successfully, but these errors were encountered: