Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to add Ukrainian and failing miserably #73

Closed
dgisser opened this issue Jun 18, 2024 · 4 comments · Fixed by #74
Closed

Trying to add Ukrainian and failing miserably #73

dgisser opened this issue Jun 18, 2024 · 4 comments · Fixed by #74

Comments

@dgisser
Copy link
Contributor

dgisser commented Jun 18, 2024

Thanks for creating this project! I'm trying to add Ukrainian, here's what I got so far:

  • .env file
DEBUG_WORD=критика
MAX_MEMORY_MB=16384000
DICT_NAME=test
  • added {"iso": "uk", "language": "Ukrainian", "flag": "🇺🇦"}, to languages.json

ran ./auto.sh Ukrainian English

This creates 2 zips, which if I put into Yomitan, suck. If you go to a random Ukrainian wiki page, very few of the words highlight, including words that are for sure in kaikki like критика.

We are skipping a ton of term tags, e.g.

{
  "alt-of": 439,
  "alternative": 282,
  "morpheme": 219,
  "broadly": 115,
  "collective": 115,
  "by extension": 114,
  "predicative": 110,
  "third-person": 105,
  "with-genitive": 94,
  "in-plural": 88,
  "no-comparative": 84,
  "with-dative": 79,
  "letter": 72,
  "third person only": 71,
  "with-instrumental": 69,
  "it is": 55,
  "noun-from-verb": 37,
  "plural-normally": 36,
  "uppercase": 33,
  "lowercase": 33,
  "Western Ukraine": 33,
  "proscribed": 30,
  "genitive": 26,

etc.
as well as skipped parts of speech

 "name": 2290,
"adv": 854,
"num": 129,
"intj": 125,
"prep": 105,

so maybe this is part of the problem. Look forward to any advice on how to resolve!

@StefanVukovic99
Copy link
Collaborator

You did everything right (except setting max memory to 16000 GB 😅). Words are likely not getting matched because wiktionary has diacritics on the headwords, and they aren't getting handled:
image
We'll need to add a case to the normalizeOrthography function (like #67).

As for the skipped term tags/parts of speech, that's normal. The parts of speech don't matter unless/until there are deinflection rules written for that language. Adding tags to a tag_bank_term controls whether they will remain in parentheses or be moved to a yomitan tag:
image
image
Here, anatomy gets recognized and parsed out, the rest are left as-is. I'm not too happy with how the tags look in yomitan, may have been better to leave them all in parentheses. There are also some tags that are invisible on wiktionary, but kaikki deduces them somehow, these won't be shown in the yomitan dict unless they are add to a tag_bank_term.

P.S. I remember reading this issue of yours back when the official policy in the yomitan readme was 'no other languages'. I might not have even tried to merge my fork with yomitan and do all this if it wasn't for that hint that there would be support for it, so thanks 🙏

@dgisser
Copy link
Contributor Author

dgisser commented Jun 18, 2024

Thanks!! Just copying the Russian normalizeOrthography rule greatly improves the performance. Let me know if you would like me to submit a PR with these very minor changes. Also I'm amazed that you remember that issue in Korean no less! I'm so happy that Korean is available in Yomitan and it is so powerful; much better than any other chrome extension out there!

@StefanVukovic99
Copy link
Collaborator

StefanVukovic99 commented Jun 18, 2024

Feel free to PR, then Ukrainian dicts will be included automatically from the next release!

Also check out the language docs to properly add Ukrainian to Yomitan. Texts with no diacritics or full diacritics should work with these dicts, but you'll want to add the same diacritics processing to yomitan (like yomidevs/yomitan#1057) so texts with partial diacritics and other dicts will work.

@dgisser dgisser mentioned this issue Jun 18, 2024
@dgisser
Copy link
Contributor Author

dgisser commented Jun 19, 2024

Yeah, normally I would be really into doing something like that but I'm just doing this for a friend who is learning Ukrainian. I don't have any knowledge of Ukrainian (the most I can do is read the Russian alphabet and read a few basic words) so just getting a dictionary set up is sufficient for my needs.

@StefanVukovic99 StefanVukovic99 linked a pull request Jun 19, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants