Skip to content

Commit

Permalink
Update language features docs (#1148)
Browse files Browse the repository at this point in the history
* update docs

* delete unwanted code

* update link

* add fullstop

* add may

* change rule to condition

* change confer to refer to
  • Loading branch information
Casheeew authored Jun 29, 2024
1 parent b4e885d commit 4a0689a
Show file tree
Hide file tree
Showing 2 changed files with 87 additions and 14 deletions.
93 changes: 79 additions & 14 deletions docs/development/language-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ That's it! Your language should now be selectable from the dropdown, and may wor

You should first have the repo set up locally according to the instructions in the [contributing guidelines](../../CONTRIBUTING.md).

A language descriptor in `language-descriptors.js` has several optional fields for more advanced features. We've already mentioned `textPreprocessors`, but there are also `languageTransforms`, `textPostprocessors`, and `isTextLookupWorthy`. Let's go through them (see also the full type definition in `language-descriptors.d.ts`).
A language descriptor in `language-descriptors.js` has several optional fields for more advanced features. We've already mentioned `textPreprocessors`, but there are also `languageTransforms`, `textPostprocessors`, `isTextLookupWorthy`, and `readingNormalizer`. Let's go through them (see also the full type definition in `language-descriptors.d.ts`).

### Text Preprocessors

Expand Down Expand Up @@ -133,7 +133,7 @@ Deinflection is the process of converting a word to its base or dictionary form.

These grammatical rules are located in files such as `english-transforms.js`.

> Not all the grammatical rules of a language can or need to be implemented in the transforms file. Even a little bit goes a long way, and you can always add more rules later. For every couple rules you add, write some tests in the respective file (see `english-transforms.test.js`), and run `npm run test:unit`. This will help you verify that your rules are correct, and make sure nothing is accidentally broken along the way.
> Not all the grammatical rules of a language can or need to be implemented in the transforms file. Even a little bit goes a long way, and you can always add more rules later. For every couple rules you add, write some tests in the respective file ([see the writing tests section below](#writing-deinflection-tests)). This will help you verify that your rules are correct, and make sure nothing is accidentally broken along the way.
Transforms files should export a `LanguageTransformDescriptor`, which is then imported in `language-descriptors.js`.

Expand All @@ -142,12 +142,14 @@ Transforms files should export a `LanguageTransformDescriptor`, which is then im
export type LanguageTransformDescriptor = {
language: string;
conditions: ConditionMapObject;
transforms: Transform[];
transforms: {
[name: string]: Transform;
};
};
```

- `language` is the ISO code of the language
- `conditions` are an array of parts of speech, and grammatical forms that are used to check which deinflections make sense. They are referenced by the deinflection rules.
- `conditions` are an object containing parts of speech and grammatical forms that are used to check which deinflections make sense. They are referenced by the deinflection rules.
- `transforms` are the actual deinflection rules

Let's try and write a bit of deinflection for English, from scratch.
Expand All @@ -159,17 +161,17 @@ import { suffixInflection } from "../language-transforms.js";
export const englishTransforms = {
language: "en",
conditions: {},
transforms: [
{
transforms: {
plural: {
name: "plural",
description: "Plural form of a noun",
rules: [suffixInflection("s", "", [], [])],
},
],
},
};
```

This is a simple example for English, where the only deinflection rule is to remove the "s" from the end of a noun to get the singular form. The `suffixInflection` function is a helper that creates a deinflection rule for a suffix. It takes the suffix to remove, what to replace it with, and two more parameters for conditions, which we will look at next. The `suffixInflection` is the most common type of deinflection rule across languages.
This is a simple example for English, where the only deinflection rule is to remove the "s" from the end of a noun to get the singular form. The `suffixInflection` function is a helper that creates a deinflection rule for a suffix. It takes the suffix to remove, what to replace it with, and two more parameters for conditions, which we will look at next. The `suffixInflection` is the most common type of deinflection rule across languages. The inner `plural` is the displayed description while looking up, and the outer `plural` is a name only to be referenced internally within the file.

For the input string "cats", the following strings will be looked up:

Expand All @@ -187,15 +189,15 @@ conditions: {
isDictionaryForm: true,
},
},
transforms: [
{
transforms: {
"plural": {
name: "plural",
description: "Plural form of a noun",
rules: [
suffixInflection("s", "", [], ["n"])
],
},
],
},
```

Now, only dictionary entries marked with the same "n" condition will be eligible for matching the `plural` rule. The verb "read" should be marked as "v" in the dictionary, and will no longer be matched by the `plural` rule. The entries in the dictionary need to be marked with the exact same conditions defined in the `conditions` object. The `isDictionaryForm` field can be set to `false`, to allow some conditions to be sued only in between rules, and not in the dictionary. In most cases however, it will be set to `true`.
Expand Down Expand Up @@ -244,21 +246,80 @@ conditions: {
isDictionaryForm: true,
},
},
transforms: [
{
transforms: {
"plural": {
name: "plural",
description: "Plural form of a noun",
rules: [
suffixInflection("s", "", ["np"], ["ns"])
],
},
],
},
```

Since `ns` and `np` are subconditions of `n` they will both match with `n`, but not with each other. This covers all of the requirements we have considered.

The `suffixInflection` is one of a few helper functions - you can write more complex rules, using regex and a function for deinflecting. There are examples of this across the language transforms files.

#### Writing Deinflection Tests

Now that you have added a couple deinflection rules, you might want to start writing some tests to check if the deinflections are behaving correctly. Let's say we wanted to test the behavior of our `plural` and `possessive` rules and even them combined. Our test file should look like this:

```js
// english-transforms.test.js
import { englishTransforms } from "../../ext/js/language/en/english-transforms.js";
import { LanguageTransformer } from "../../ext/js/language/language-transformer.js";
import { testLanguageTransformer } from "../fixtures/language-transformer-test.js";

const tests = [
{
category: "plurals and possessive",
valid: true,
tests: [
{ term: "cat", source: "cats", rule: "ns", reasons: ["plural"] },
{ term: "cat", source: "cat's", rule: "ns", reasons: ["possessive"] },
{
term: "cat",
source: "cats'",
rule: "ns",
reasons: ["plural", "possessive"],
},
],
},
];

const languageTransformer = new LanguageTransformer();
languageTransformer.addDescriptor(englishTransforms);
testLanguageTransformer(languageTransformer, tests);
```

The part we want to examine is the `test` array. The other things are common across all test files.

- `term` is the final form of the deinflected word.
- `source` is the source word to be deinflected to `term`.
- `rule` is the final condition of `term`. Here, we used `ns` because `cat` is a singular noun.
- `reasons` represents the chain of deinflection rules needed to get from `source` to `term`.

You can check that all the tests pass by running `npm run test:unit`.

> This command runs all Yomitan unit test files. To only run a single test file, you can instead opt for `npx vitest <file-name>`.
Now, we may want to verify that `boss` really does not deinflect to `bo`. You can add to the `tests` array:

```js
{
category: 'invalid deinflections',
valid: false,
tests: [
{term: 'boss', source: 'bo', rule: 'ns', reasons: ['plural', 'plural']},
],
},
```

Here, by setting `valid` to `false`, we are telling the test function to fail this test case if only `boss` deinflects to `bo` with the `ns` condition under a double `plural` rule.

You can also optionally pass a `preprocess` helper function to `testLanguageTransformer`. Refer to the language transforms test files for its specific use case.

### Text Postprocessors

In special cases, text may need to be modified after deinflection. These work exactly like text preprocessors, but are applied after deinflection. Currently, this is only used for Korean, where the Hangul text is disassembled into jamo during preprocessing, and so must be reassembled after deinflection.
Expand All @@ -267,6 +328,10 @@ In special cases, text may need to be modified after deinflection. These work ex

Some features include checking whether a string is possibly a word in the language. For example, trying to look up, in an English dictionary, a word written with non-Latin characters (e.g. "日本語") will never yield any results. To prevent unnecessary lookups, an `isTextLookupWorthy` function can be provided, otherwise all text will be looked up.

### Reading Normalizers

In certain languages, dictionary entries may contain readings as a key to read words, e.g. Kana for Japanese and Pinyin for Chinese. Sometimes, dictionaries may be inconsistent in how they store these readings, leading to the word entries often being split when looked up even though they share the same reading. In these cases, you can use a `readingNormalizer` function to normalize the readings to a common format.

## Stuck?

If you have any questions, please feel free to open a Discussion on Github, or find us on [TheMoeWay Discord](https://discord.gg/UGNPMDE7zC) at [#yomitan-development](https://discord.com/channels/617136488840429598/1081538711742844980).
8 changes: 8 additions & 0 deletions test/language/english-transforms.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ const tests = [
tests: [
{term: 'cat', source: 'cats', rule: 'ns', reasons: ['plural']},
{term: 'cat', source: 'cat\'s', rule: 'ns', reasons: ['possessive']},
{term: 'cat', source: 'cats\'', rule: 'ns', reasons: ['plural', 'possessive']},
{term: 'cats', source: 'cats\'', rule: 'ns', reasons: ['possessive']},
{term: 'dirt', source: 'dirty', rule: 'ns', reasons: ['-y']},
{term: 'haze', source: 'hazy', rule: 'ns', reasons: ['-y']},
Expand Down Expand Up @@ -144,6 +145,13 @@ const tests = [
{term: 'humble', source: 'humbly', rule: 'adj', reasons: ['adverb']},
],
},
{
category: 'invalid deinflections',
valid: false,
tests: [
{term: 'boss', source: 'bo', rule: 'ns', reasons: ['plural', 'plural']},
],
},
];
/* eslint-enable @stylistic/no-multi-spaces */

Expand Down

0 comments on commit 4a0689a

Please sign in to comment.