The best lexicon type/format to use #1

nciric · 2024-02-21T23:23:52Z

Lexicons are a critical part of the inflection project. They need to be used at the runtime, and will also be used by our tools for potential ML training.

We need to decide on the format we collect the data in. This decision needs to be based on multiple criteria:

Is the format open, or under a friendly license?
Can other lexicons be converted into that format, so we have consistent data>
Is the format efficient, to reduce size & allow quick lookup?
Quality of existing tools to operate on lexicon
Can the lexicon data be easily pruned to what the user needs, to reduce deployment size?

An example tool and format, used in some universities (see languages):

Unitex/GramLab from university in France, https://unitexgramlab.org/ (LGPL)
Unitex lexicons (22 languages, with varied coverage), https://unitexgramlab.org/language-resources (LGPLLR)

They use Dela class of dictionaries (couldn't find a better link to describe Dela format).

What are other options we can use? Other criteria for selecting a lexicon/dictionary?

nciric · 2024-02-22T18:35:56Z

Another approach is to use UniMorph package.

BrunoCartoni · 2024-02-26T07:07:04Z

Before answering the great nciric@ questions, we also need to decide what is our endgoal:

(1) do we want to build/store a lexicon? (e‧g.: store "house, n: house,sing/houses,plur")
(2) do we want to be able to generate inflection forms based on a "lemma" and grammatical info. (e‧g: input: house + plural output:houses
(3) do we want to be able to analyse an inflected form and provide its lemma and grammatical info. (e‧g.: input: houses output: house, plur).
(4) ... other?

If we can clarify these points first, then we can choose the lexicon format

JelenaMitrovic · 2024-02-26T11:59:58Z

Hello everyone, Jelena Mitrovic here, very excited to have been invited by @nciric to join this effort

@BrunoCartoni I have worked with UNITEX and DELA dictionaries for Serbian during my PhD (a while back). If we choose to go this route, the answers to your questions would be:

(1) the lexica DELAC (for simple words) DELAF (for compounds) do have all the forms available alongside the lemma already.

(2) the dictionary format contains this already - the problem is that dictionaries are limited, domain specific, and project-specific - so people build their own and do not share them. We would thus have to find people who are willing to supplement the existing resources, or share the ones they have.

(3) this would be ideal, and again, possible with DELA dictionaries.

Regarding UniMorph package, I do not have experience with it, but it seems to go well with Universal Dependancies. Inflection is dependant on syntax, so it might make sense to include at least the simplest UDs for each language.

The issues we are dealing with here are quite complex, and I hope to have a better understanding of the overall requirements for Unicode after our meeting.

nciric · 2024-02-27T01:23:28Z

Thanks Bruno and Jelena, see my answers below:

We will need to build/store some words in the lexicon - e.g. exceptions to the rules.
I would prefer to generate inflection forms where possible, to reduce the size of the lexicon (and lookup time). For example, for Serbian we would need 14 forms, including lemma. Generating them would reduce number to only 1.
This is a secondary goal at the moment, but it's something we may get for free using Finite State Transducers as they often work both ways. I am sure there's a number of exceptions that would have to be stored in the lexicon. We need to see what trade off we need to make, if any, when deciding on this point.

My use case for the library is:

Take a message format message with placeholders. Placeholders have annotated case, e.g. VOC, SINGULAR.
Take in the parameters from the user, e.g. Beograd (Belgrade), Rim (Rome)
Look up grammatical info from the lexicon, e.g. Beograd -> masculine, inanimate, Rim -> masculine, inanimate
Pass the parameters to our new API, inflect("Beograd", grammatical_info), same for Rim, or directly to message format (which would then automatically do the necessary call)
Get the formatted message

We can think of other scenarios, like building an index for search and asking for lemmatization, where your point 3) holds.

grhoten · 2024-02-27T08:15:40Z

It would be helpful to see a summary of what formats are available out there. Using a format that is compatible with the Unicode license is preferable.

Mihai did point out DMLex, which seems promising.

I like the idea of interoperability and leveraging existing repositories. On the other hand, I don't like restrictions on adding grammemes and other morphological information. For example, it's important to know the phonetic information or if a word starts or end with a vowel or consonant. That's important for grammatical agreement in several languages, including English, French and Korean. So if adding such information is difficult, I'd like to steer clear of such restrictions.

If a lexicon format can help sort a list of English adjectives correctly, that would be a strong format to consider, but it's not a requirement. Adjective order in English is a helpful problem to solve.

Conceptually, I'd like the data structured in a way to have a lemma associated with all its surface forms, and each surface form annotated with grammemes to differentiate it from other surface form under a lemma. As an example, Wiktionary has katt in Swedish, and it has a well annotated declension table and pronunciation.

I'm sure there will be discussions on what should be implicitly and explicitly in such structured lexicons, and I'd prefer have that as a separate topic. I'd also like to defer API and code discussion into a separate topic too.

BrunoCartoni · 2024-02-27T12:38:29Z

Thanks Nebojša for sketching out the main use case, and thank you all for the interesting conversation! Based on Nebojša's use case, here is a first draft of the requirements: 1. lexicon: a. store the main grammatical features of each entry b. (eventually) store all the forms of an entry with their grammatical features, specially for exception that are hard to handle with (2) 2. A morphological generator that can generate the correct form in a placeholder, based on the grammatical features stored in the lexicon entry, and the specifications in the message. So in Nebojša's examples: (1) will contain: - Beograd: masculine, inanimate - Rim: masculine, inanimate (2) will generate the correct form in the message according to the specification in the template (e.g. VOC, SINGULAR.) and all the morphological or phonetic information mentioned by George Rhoten will be stored in (1). Please let me know if we all agreed on these first principles? As per the internal structure of the lexicon, we can leverage the "lexical masks" we developed (introduced in https://aclanthology.org/2020.lrec-1.372.pdf) that is already used by Wikidata. Bruno

…

On Tue, Feb 27, 2024 at 9:15 AM George Rhoten ***@***.***> wrote: It would be helpful to see a summary of what formats are available out there. Using a format that is compatible with the Unicode license is preferable. Mihai did point out DMLex <https://docs.oasis-open.org/lexidma/dmlex/v1.0/csd02/dmlex-v1.0-csd02.html>, which seems promising. I like the idea of interoperability and leveraging existing repositories. On the other hand, I don't like restrictions on adding grammemes and other morphological information. For example, it's important to know the phonetic information or if a word starts or end with a vowel or consonant. That's important for grammatical agreement in several languages, including English, French and Korean. So if adding such information is difficult, I'd like to steer clear of such restrictions. If a lexicon format can help sort a list of English adjectives correctly, that would be a strong format to consider, but it's not a requirement. Adjective order <https://en.wikipedia.org/wiki/Adjective#Order> in English is a helpful problem to solve. Conceptually, I'd like the data structured in a way to have a lemma associated with all its surface forms, and each surface form annotated with grammemes to differentiate it from other surface form under a lemma. As an example, Wiktionary has katt <https://en.wiktionary.org/wiki/katt#Swedish> in Swedish, and it has a well annotated declension table and pronunciation. I'm sure there will be discussions on what should be implicitly and explicitly in such structured lexicons, and I'd prefer have that as a separate topic. I'd also like to defer API and code discussion into a separate topic too. — Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BGM2AFC7BQI757HZ3YOUEDDYVWI3RAVCNFSM6AAAAABDT5WPB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRWGAYDGNZXGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

nciric · 2024-02-27T19:41:19Z

Mihai did point out DMLex, which seems promising.

DMLex also sounds promising, thanks for linking.

I'm sure there will be discussions on what should be implicitly and explicitly in such structured lexicons, and I'd prefer have that as a separate topic. I'd also like to defer API and code discussion into a separate topic too.

I opened #3 for discussing APIs (and use cases).

grhoten · 2024-12-10T18:18:05Z

A different format is being used in pull request #35 that was merged today. Changing to a different format is possible, but the tools should create the same compiled binary format. A lot of effort has gone into making the binary format as small as and as fast as possible for each language. The binary format compresses better on very regular inflection patterns of lemmas with different suffixes.

If a different source format will be used, techniques should be implemented for scaleable filtering of the data. For example, you may want to combine 2 data sources with one taking precedence over another data source. You may want to add, remove, or reorder some inflection tables for ambiguous words. Uncommon, non-standard or rare forms may need to be omitted to save space, like adjective forms of city names or plural forms of people's names. Forms that are algorithmically derived may want to be omitted, like deciding when to add 's or s' in English. Those are some things to consider when switching to a different lexicon format.

nciric added the question Further information is requested label Feb 21, 2024

nciric mentioned this issue Feb 27, 2024

What is our API surface? #3

Open

nciric added discuss Discussion item and removed question Further information is requested labels Feb 28, 2024

macchiati mentioned this issue Mar 15, 2024

Support making words definite, indefinite or construct #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The best lexicon type/format to use #1

The best lexicon type/format to use #1

nciric commented Feb 21, 2024

nciric commented Feb 22, 2024

BrunoCartoni commented Feb 26, 2024

JelenaMitrovic commented Feb 26, 2024 •

edited

Loading

nciric commented Feb 27, 2024

grhoten commented Feb 27, 2024

BrunoCartoni commented Feb 27, 2024 via email

nciric commented Feb 27, 2024 •

edited

Loading

grhoten commented Dec 10, 2024

The best lexicon type/format to use #1

The best lexicon type/format to use #1

Comments

nciric commented Feb 21, 2024

nciric commented Feb 22, 2024

BrunoCartoni commented Feb 26, 2024

JelenaMitrovic commented Feb 26, 2024 • edited Loading

nciric commented Feb 27, 2024

grhoten commented Feb 27, 2024

BrunoCartoni commented Feb 27, 2024 via email

nciric commented Feb 27, 2024 • edited Loading

grhoten commented Dec 10, 2024

JelenaMitrovic commented Feb 26, 2024 •

edited

Loading

nciric commented Feb 27, 2024 •

edited

Loading