Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The best lexicon type/format to use #1

Open
nciric opened this issue Feb 21, 2024 · 8 comments
Open

The best lexicon type/format to use #1

nciric opened this issue Feb 21, 2024 · 8 comments
Labels
discuss Discussion item

Comments

@nciric
Copy link
Contributor

nciric commented Feb 21, 2024

Lexicons are a critical part of the inflection project. They need to be used at the runtime, and will also be used by our tools for potential ML training.

We need to decide on the format we collect the data in. This decision needs to be based on multiple criteria:

  1. Is the format open, or under a friendly license?
  2. Can other lexicons be converted into that format, so we have consistent data>
  3. Is the format efficient, to reduce size & allow quick lookup?
  4. Quality of existing tools to operate on lexicon
  5. Can the lexicon data be easily pruned to what the user needs, to reduce deployment size?

An example tool and format, used in some universities (see languages):

  1. Unitex/GramLab from university in France, https://unitexgramlab.org/ (LGPL)
  2. Unitex lexicons (22 languages, with varied coverage), https://unitexgramlab.org/language-resources (LGPLLR)

They use Dela class of dictionaries (couldn't find a better link to describe Dela format).

What are other options we can use? Other criteria for selecting a lexicon/dictionary?

@nciric nciric added the question Further information is requested label Feb 21, 2024
@nciric
Copy link
Contributor Author

nciric commented Feb 22, 2024

Another approach is to use UniMorph package.

@BrunoCartoni
Copy link

Before answering the great nciric@ questions, we also need to decide what is our endgoal:

(1) do we want to build/store a lexicon? (e‧g.: store "house, n: house,sing/houses,plur")
(2) do we want to be able to generate inflection forms based on a "lemma" and grammatical info. (e‧g: input: house + plural output:houses
(3) do we want to be able to analyse an inflected form and provide its lemma and grammatical info. (e‧g.: input: houses output: house, plur).
(4) ... other?

If we can clarify these points first, then we can choose the lexicon format

@JelenaMitrovic
Copy link

JelenaMitrovic commented Feb 26, 2024

Hello everyone, Jelena Mitrovic here, very excited to have been invited by @nciric to join this effort

@BrunoCartoni I have worked with UNITEX and DELA dictionaries for Serbian during my PhD (a while back). If we choose to go this route, the answers to your questions would be:

(1) the lexica DELAC (for simple words) DELAF (for compounds) do have all the forms available alongside the lemma already.

(2) the dictionary format contains this already - the problem is that dictionaries are limited, domain specific, and project-specific - so people build their own and do not share them. We would thus have to find people who are willing to supplement the existing resources, or share the ones they have.

(3) this would be ideal, and again, possible with DELA dictionaries.

Regarding UniMorph package, I do not have experience with it, but it seems to go well with Universal Dependancies. Inflection is dependant on syntax, so it might make sense to include at least the simplest UDs for each language.

The issues we are dealing with here are quite complex, and I hope to have a better understanding of the overall requirements for Unicode after our meeting.

@nciric
Copy link
Contributor Author

nciric commented Feb 27, 2024

Thanks Bruno and Jelena, see my answers below:

  1. We will need to build/store some words in the lexicon - e.g. exceptions to the rules.
  2. I would prefer to generate inflection forms where possible, to reduce the size of the lexicon (and lookup time). For example, for Serbian we would need 14 forms, including lemma. Generating them would reduce number to only 1.
  3. This is a secondary goal at the moment, but it's something we may get for free using Finite State Transducers as they often work both ways. I am sure there's a number of exceptions that would have to be stored in the lexicon. We need to see what trade off we need to make, if any, when deciding on this point.

My use case for the library is:

  1. Take a message format message with placeholders. Placeholders have annotated case, e.g. VOC, SINGULAR.
  2. Take in the parameters from the user, e.g. Beograd (Belgrade), Rim (Rome)
  3. Look up grammatical info from the lexicon, e.g. Beograd -> masculine, inanimate, Rim -> masculine, inanimate
  4. Pass the parameters to our new API, inflect("Beograd", grammatical_info), same for Rim, or directly to message format (which would then automatically do the necessary call)
  5. Get the formatted message

We can think of other scenarios, like building an index for search and asking for lemmatization, where your point 3) holds.

@grhoten
Copy link
Member

grhoten commented Feb 27, 2024

It would be helpful to see a summary of what formats are available out there. Using a format that is compatible with the Unicode license is preferable.

Mihai did point out DMLex, which seems promising.

I like the idea of interoperability and leveraging existing repositories. On the other hand, I don't like restrictions on adding grammemes and other morphological information. For example, it's important to know the phonetic information or if a word starts or end with a vowel or consonant. That's important for grammatical agreement in several languages, including English, French and Korean. So if adding such information is difficult, I'd like to steer clear of such restrictions.

If a lexicon format can help sort a list of English adjectives correctly, that would be a strong format to consider, but it's not a requirement. Adjective order in English is a helpful problem to solve.

Conceptually, I'd like the data structured in a way to have a lemma associated with all its surface forms, and each surface form annotated with grammemes to differentiate it from other surface form under a lemma. As an example, Wiktionary has katt in Swedish, and it has a well annotated declension table and pronunciation.

I'm sure there will be discussions on what should be implicitly and explicitly in such structured lexicons, and I'd prefer have that as a separate topic. I'd also like to defer API and code discussion into a separate topic too.

@BrunoCartoni
Copy link

BrunoCartoni commented Feb 27, 2024 via email

@nciric
Copy link
Contributor Author

nciric commented Feb 27, 2024

Mihai did point out DMLex, which seems promising.

DMLex also sounds promising, thanks for linking.

I'm sure there will be discussions on what should be implicitly and explicitly in such structured lexicons, and I'd prefer have that as a separate topic. I'd also like to defer API and code discussion into a separate topic too.

I opened #3 for discussing APIs (and use cases).

@nciric nciric added discuss Discussion item and removed question Further information is requested labels Feb 28, 2024
@grhoten
Copy link
Member

grhoten commented Dec 10, 2024

A different format is being used in pull request #35 that was merged today. Changing to a different format is possible, but the tools should create the same compiled binary format. A lot of effort has gone into making the binary format as small as and as fast as possible for each language. The binary format compresses better on very regular inflection patterns of lemmas with different suffixes.

If a different source format will be used, techniques should be implemented for scaleable filtering of the data. For example, you may want to combine 2 data sources with one taking precedence over another data source. You may want to add, remove, or reorder some inflection tables for ambiguous words. Uncommon, non-standard or rare forms may need to be omitted to save space, like adjective forms of city names or plural forms of people's names. Forms that are algorithmically derived may want to be omitted, like deciding when to add 's or s' in English. Those are some things to consider when switching to a different lexicon format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Discussion item
Projects
None yet
Development

No branches or pull requests

4 participants