Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deinflection data format update #581

Closed
toasted-nutbread opened this issue Jan 27, 2024 · 1 comment
Closed

Deinflection data format update #581

toasted-nutbread opened this issue Jan 27, 2024 · 1 comment

Comments

@toasted-nutbread
Copy link

I'm messing around with the idea of updating the structure of the deinflection file to support a few things:

  1. Better clarity - things like fix deinflection bug #547 would be a bit more explicitly defined rather than having to manually define it using bitflags magic.
  2. More generalized - should be more generalized for other languages and use less Japanese-specific naming.
  3. Internationalization - names/descriptions can have different variations provided for other languages.
  4. Extensibility - Internationalization features and new rules can eventually be imported into a single deinflector. This will require changes to the deinflector code obviously, but the intent is to make the source data format more conducive for this.
  5. Cleaner code - less manual definition of bitflags will be needed; the bitflags can be automatically generated from the input file(s).

So here's somewhat of a preview of what might work:

{
    "rules": {
        "v1": {
            "name": "Ichidan verb",
            "partsOfSpeech": ["v1"],
            "i18n": [
                {
                    "language": "ja",
                    "name": "一段動詞"
                }
            ],
            "subRules": ["v1d", "v1p"]
        },
        "v1d": {
            "name": "Ichidan verb, dictionary form",
            "partsOfSpeech": ["v1"],
            "i18n": [
                {
                    "language": "ja",
                    "name": "一段動詞、辞書形"
                }
            ]
        },
        "v1p": {
            "name": "Ichidan verb, progressive or perfect form",
            "partsOfSpeech": ["v1"],
            "i18n": [
                {
                    "language": "ja",
                    "name": "一段動詞、進行形または完了形"
                }
            ]
        },
        "v5": {
            "name": "Godan verb",
            "partsOfSpeech": "v5",
            "i18n": [
                {
                    "language": "ja",
                    "name": "五段動詞"
                }
            ]
        },
        "vk": {
            "name": "Kuru verb",
            "partsOfSpeech": "vk",
            "i18n": [
                {
                    "language": "ja",
                    "name": "来る動詞"
                }
            ]
        },
        "vs": {
            "name": "Suru verb",
            "partsOfSpeech": "vs",
            "i18n": [
                {
                    "language": "ja",
                    "name": "する動詞"
                }
            ]
        },
        "vz": {
            "name": "Zuru verb",
            "partsOfSpeech": "vz",
            "i18n": [
                {
                    "language": "ja",
                    "name": "ずる動詞"
                }
            ]
        },
        "adj-i": {
            "name": "Adjective with i ending",
            "partsOfSpeech": ["adj-i"],
            "i18n": [
                {
                    "language": "ja",
                    "name": "形容詞"
                }
            ]
        },
        "iru": {
            "name": "Intermediate -iru endings for progressive or perfect tense",
            "partsOfSpeech": []
        }
    },
    "transforms": [
        {
            "name": "-ba",
            "description": "Conditional",
            "i18n": [
                {
                    "language": "ja",
                    "name": "ば",
                    "description": "仮定形"
                }
            ],
            "variants": [
                {"suffixIn": "ければ", "suffixOut": "い", "rulesIn": [], "rulesOut": ["adj-i"]},
                {"suffixIn": "えば", "suffixOut": "う", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "けば", "suffixOut": "く", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "げば", "suffixOut": "ぐ", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "せば", "suffixOut": "す", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "てば", "suffixOut": "つ", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "ねば", "suffixOut": "ぬ", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "べば", "suffixOut": "ぶ", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "めば", "suffixOut": "む", "rulesIn": [], "rulesOut": ["v5"]},
                {"suffixIn": "れば", "suffixOut": "る", "rulesIn": [], "rulesOut": ["v1", "v5", "vk", "vs", "vz"]}
            ]
        }
    ]
}

A few notes:

  • The rules.*.partsOfSpeech corresponds to the [3] field of each source definition in dictionaries. In general, these correlate with the name of the rule, but do not need to match.
  • In the rules.v1 entry, note the "subRules": ["v1d", "v1p"] declaration. This is intended to map to the changes in fix deinflection bug #547.
  • The i18n will be used to provide language translations, which are optional. Dictionary or localization providers should also be able to eventually provide these separately if desired.
  • As mentioned in this comment, I named the deinflection list to transforms for generalization.
  • The individual declarations themselves are just listed as variants.
  • kana(In|Out) has been renamed to suffix(In|Out).
  • (I'm not that good at technical Japanese to say that all of my i18n's are correct, so feel free to correct anything that's wrong/missing if we proceed with this.)

Thoughts on naming:

Overall, I'm not sure what the best naming strategy is for everything in here, so I'm open to suggestion. Primarily, I'm not sure if "rule" is a good name for how it's being used here. Similar for "variants", but I couldn't immediately think of anything that is more clear. I tried to avoid having both "rule" and "reason" since I think the two can be easily confused. So some of the current types I'm looking at for the raw JSON file would be something like:

  • Transformation
  • TransformationVariant
  • TransformationRule

Again please provide any thoughts on alternate ways to name these.

Related links:

@Casheeew
Copy link
Member

Casheeew commented Jan 28, 2024

Thoughts on naming:

Overall, I'm not sure what the best naming strategy is for everything in here, so I'm open to suggestion. Primarily, I'm not sure if "rule" is a good name for how it's being used here. Similar for "variants", but I couldn't immediately think of anything that is more clear. I tried to avoid having both "rule" and "reason" since I think the two can be easily confused. So some of the current types I'm looking at for the raw JSON file would be something like:

  • Transformation
  • TransformationVariant
  • TransformationRule

Again please provide any thoughts on alternate ways to name these.

Related links:

I think a good name should express that Transformation should be bigger than TransformationVariant. I was pretty confused as to what TransformationVariant was supposed to be when I first read the deinflector code (I didn't think it as a subclass of transformations)

some names i can think of:
Transformation and AtomicTransformation (inspired from Rust and cpp, also has the benefit that atomics mean that this is the smallest case possible (so the relationship can be easily inferred from Transformation or TransformationChain etc))
TransformationGroup and Transformation
Transformation and TransformationCase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants