Skip to content

Latest commit

 

History

History
370 lines (279 loc) · 15 KB

readme.org

File metadata and controls

370 lines (279 loc) · 15 KB

Table of contents

mode for reading japanese

An attempt to bring yomichan/migaku features to emacs.

See a demo here.

Features:

  • tokenize Japanese text
    • support for compound words
  • keep track of reading status of words (similar to Anki’s morphan and Migaku)
  • overlay information about the current token
    • token info
    • fontify accordingly
  • dictionary support using myougiden. This means Eng-Jp , Jp-Eng.
    • overlaid (similar to yomichan)
    • keep a buffer with lookups
  • support for jumping to reference pages (e.g. jisho, kanjiDamage)

./screenshot.png

This is work in progress.

  • [X] tokenize and colour tokens by type
  • [X] typeset the text according to learning status (still to be completed)
    • [X] known status saved in a SQLite database
    • [X] ability to change status interactively
      • mark known, unknown, ignored and learning words
  • [X] support to identify and lookup compound terms (terms created by combining two or more morphs)
  • [X] display dictionary (similar to yomichan)
  • [ ] send sentence to deepl
  • [ ] support pitch accent
  • [-] add info for N-levels (N5, N4 and N3 already done)
  • [ ] make it async
  • [X] local mode
  • [X] operate in regions
  • [x] automatically display “quick” definition when cursor-sensor-mode is enabled
  • [ ] create reports of “readability”
  • [ ] support for morphman and anki
  • [X] rename everything to something more meaningful
  • [ ] remove unnecessary code
  • [ ] reorganize code
  • [ ] keep track of known kanji?

Nomenclature

  • a Japanese text is composed of a sequence of tokens, some of them non-japanese.
  • mecab converts the input text into a sequence of tokens
  • a morph (see morphman) is a Japanese word as it is learned
    • it is composed of 3 parts:
      1. surface. The version of the root known/learned
      2. root. the actual word
      3. type. the grammatical type of word
  • this module keeps track of the learned status of a word:
    • ignore: we don’t care for it, i.e. names, or interjections)
    • learning: we are trying to remember it
    • known: the surface of the word (i.e. seen version of the root, eg. たベル vs 食べる) is known
    • unknown: added to the db, but unknown
    • if the morph is not in the db, it means it is unknown

how to use

requirements

mecab

This module relies primarily on mecab to parse japanese. In particular it uses mecab unidict.

I think the easiest way is to download it from here: https://github.com/CrescentKohana/MecabUnidic/releases/tag/v1.2.1

Download the corresponding binary and install. You should have a file support/matrix.bin that is 459Mbytes.

Then create a script to run it from the command line using the corresponding files. I call this script m-mecab.sh

Pay attention to the directory where the support files are installed. Make sure that the file mecab is executable.

#!/usr/bin/env bash
MDIR="/Users/dmg/bin/osx/MecabUnidic/support"
export LD_LIBRARY_PATH="$MDIR":
export DYLD_LIBRARY_PATH="$MDIR":
#"$MDIR/mecab" -d $MDIR -h
if [ "$1" == "" ] ; then
    "$MDIR/mecab" -d $MDIR -r "$MDIR/mecabrc"
else
    "$MDIR/mecab" -d $MDIR -r "$MDIR/mecabrc" $*
fi
#unidic22

Make sure you get the following output (replace /Users/dmg/bin/osx/m-mecab.sh with the path to mecab in your installation). Note the number of columns in the output:

echo "猫が大好きです。" | /Users/dmg/bin/osx/m-mecab.sh 
猫	名詞,普通名詞,一般,*,*,*,ネコ,猫,猫,ネコ,猫,ネコ,和,*,*,*,*,*,*,体,ネコ,ネコ,ネコ,ネコ,1,C4,*,7918141678166528,28806
が	助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*,*,*,格助,ガ,ガ,ガ,ガ,*,"動詞%F2@0,名詞%F1",*,2168520431510016,7889
大好き	形状詞,一般,*,*,*,*,ダイスキ,大好き,大好き,ダイスキ,大好き,ダイスキ,混,*,*,*,*,*,*,相,ダイスキ,ダイスキ,ダイスキ,ダイスキ,1,C1,*,6326873407758848,23017
です	助動詞,*,*,*,助動詞-デス,終止形-一般,デス,です,です,デス,です,デス,和,*,*,*,*,*,*,助動,デス,デス,デス,デス,*,"形容詞%F2@-1,動詞%F2@0,名詞%F2@1",*,7051468750332587,25653
。	補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
EOS

myougiden (Japanese dictionary)

  • Install myougiden. this is done using pip. See its website. After running pip you need to download/install the dictionary.
  • make sure you can run it from the command line:
myougiden お願い

emacs libraries

These libraries can be installed using melpa

emacsql

https://github.com/magit/emacsql.

Note. This might no longer be required in new versions of emacs.

pos-tip

installation

  • update the variable yk-command with the location of the mecab executable. In my case it is a script that setup the proper resources used by mecab. You can see its contents here.
  • copy one of the status databases from ./dbs/ into your preferred location. The default location is ~/jp-status.db
  • copy and decompress the dictionary database ~/db/dictionary.db.bz2 to your preferred location. This database contains the quick definitions used in the cursor-sensor-mode. It is much faster than looking up words in the dictionary.
(require 'yomikun)
(require 'yomikun-dict)

;; replace with your path to mecab
(setq yk-mecab-command  "/Users/dmg/bin/osx/m-mecab.sh")

;; replace with your preferred name and location. If the database does not exist, it will be created.
(setq yk-db-status-file "~/jp-status.db")
(setq yk-db-dict-file "~/dictionary.db")

you will now have two commands available:

yk-do-buffer

this function will process the entire buffer.

and

yk-do-region

which will do only the current region.

Both commands can be run on text that has been already processed.

At this point you can then enter the yk-minor-mode. This mode has the following commands:

imark morph as ignored
kmark morph as known
lmark morph as learning
umark morph as unknown
jshow morph in jisho.org
kshow kanji in kanjidamage.com
pdisplay properties of morph at point
=mark current sentence
xexit minor mode
RETdefine term at point

Limitations

  • work in progress.
  • Tested only in macos but it should work without problems in linux
  • Processing of large text can take few seconds. For example Alice in Wonderland takes 8 seconds to process on an M1 mini.

status database

The status database is a sqlite database created and managed by emacsql. This means that all attributes are surrounded by double quotes.

The schema is fairly simple:

attributedescription
morphroot of the morph
mtypetype
surfacethe root as processed
statusone of several: known, unknown, learning
datedate the tuple was added to the relation

The primary key is (morph, mtype, surface)

there are databases with different JLPT levels at ./dbs/

mecab parsing

From each sentence we obtain the root, the type of word, and the surface (kanji/hiragana version seen). For example:

美味しい寿司を食べた。おいしくないすしはたべられない
echo "美味しい寿司を食べた。おいしくないすしはたべられない" | m-mecab.sh
美味しい	形容詞,一般,*,*,形容詞,連体形-一般,オイシイ,美味しい,美味しい,オイシー,美味しい,オイシー,和,*,*,*,*,*,*,相,オイシイ,オイシイ,オイシイ,オイシイ,"0,3",C2,*,1201225110528705,4370
寿司	名詞,普通名詞,一般,*,*,*,スシ,寿司,寿司,スシ,寿司,スシ,和,ス濁,基本形,*,*,*,*,体,スシ,スシ,スシ,スシ,"1,2",C3,*,5269967956222464,19172
を	助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*,*,*,格助,ヲ,ヲ,ヲ,ヲ,*,"動詞%F2@0,名詞%F1,形容詞%F2@-1",*,11381878116459008,41407
食べ	動詞,一般,*,*,下一段-バ行,連用形-一般,タベル,食べる,食べ,タベ,食べる,タベル,和,*,*,*,*,*,*,用,タベ,タベル,タベ,タベル,2,C1,M4@1,6220495691326081,22630
た	助動詞,*,*,*,助動詞-タ,終止形-一般,タ,た,た,タ,た,タ,和,*,*,*,*,*,*,助動,タ,タ,タ,タ,*,"動詞%F2@1,形容詞%F4@-2",*,5948916285711019,21642
。	補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
おいしく	形容詞,一般,*,*,形容詞,連用形-一般,オイシイ,美味しい,おいしく,オイシク,おいしい,オイシー,和,*,*,*,*,*,*,相,オイシク,オイシイ,オイシク,オイシイ,"0,3",C2,*,1201225076974209,4370
ない	形容詞,非自立可能,*,*,形容詞,連体形-一般,ナイ,無い,ない,ナイ,ない,ナイ,和,*,*,*,*,*,*,相,ナイ,ナイ,ナイ,ナイ,1,C3,*,7543208145986241,27442
すし	名詞,普通名詞,一般,*,*,*,スシ,寿司,すし,スシ,すし,スシ,和,ス濁,基本形,*,*,*,*,体,スシ,スシ,スシ,スシ,"1,2",C3,*,5269967855559168,19172
は	助詞,係助詞,*,*,*,*,ハ,は,は,ワ,は,ワ,和,*,*,*,*,*,*,係助,ハ,ハ,ハ,ハ,*,"動詞%F2@0,名詞%F1,形容詞%F2@-1",*,8059703733133824,29321
たべ	動詞,一般,*,*,下一段-バ行,未然形-一般,タベル,食べる,たべ,タベ,たべる,タベル,和,*,*,*,*,*,*,用,タベ,タベル,タベ,タベル,2,C1,M4@1,6220495657771585,22630
られ	助動詞,*,*,*,助動詞-レル,未然形-一般,ラレル,られる,られ,ラレ,られる,ラレル,和,*,*,*,*,*,*,助動,ラレ,ラレル,ラレ,ラレル,*,動詞%F3@2,M4@1,10936575907209793,39787
ない	助動詞,*,*,*,助動詞-ナイ,終止形-一般,ナイ,ない,ない,ナイ,ない,ナイ,和,*,*,*,*,*,*,助動,ナイ,ナイ,ナイ,ナイ,*,動詞%F3@0,*,7542108634358443,27438
EOS

This output is reduced to the following. The first column is the word as seen, the second the type, then the morph, and finally the surface. Compare 美味しい and おいしい.

echo "美味しい寿司を食べた。おいしくないすしはたべられない" | m-mecab.sh | csvcut -c 1,8,11
美味しい	形容詞,美味しい,美味しい
寿司	名詞,寿司,寿司
を	助詞,を,を
食べ	動詞,食べる,食べる
た	助動詞,た,た
。	補助記号,。,。
おいしく	形容詞,美味しい,おいしい
ない	形容詞,無い,ない
すし	名詞,寿司,すし
は	助詞,は,は
たべ	動詞,食べる,たべる
られ	助動詞,られる,られる
ない	助動詞,ない,ない
EOS,,

This text would be stored as follows in the database. Note that 寿司 and 美味しい are stored twice. One for each version (kanji and hiragana).

wtyperootsurface
助動詞
助動詞ないない
助動詞られるられる
助詞
助詞
動詞食べるたべる
動詞食べる食べる
名詞寿司すし
名詞寿司寿司
形容詞無いない
形容詞美味しいおいしい
形容詞美味しい美味しい

compound terms

to be written…

speed

Processing large amounts of text is slow. In my tests, emacs can do Alice in Wonderland in around 8 seconds in an M1 mini.

  • 4.5k morphs (probably wrong due to breaking lines in wrong place)
  • 98k characters
  • mecab outputs 64k lines

The bottleneck is receiving and processing mecab’s output.

Finding compounds is optional. It is a CPU intensive process. Processing of Alice takes approximately 10 seconds.

pitch accent (to be done)

to be done…

https://github.com/javdejong/nhk-pronunciation/blob/master/nhk_pronunciation.py

txt = e.midashigo1
strlen = len(txt)
acclen = len(e.ac)
accent = "0"*(strlen-acclen) + e.ac

dictionary (to be done)

Support via an external dictionary. Most likely myougiden

current bugs/todos

  • [ ] ignore fontification is not working
  • [ ] compounds
    • [ ] add compounds of 3 kanji to db (eg: 形状詞)
    • [ ] fontify compounds
    • [ ] mark compounds of a region, rather than all buffer
    • [ ] keep track of status of compounds
  • [ ] add dictionary lookup for region

notes to myself

select glosses.ent_seq, kanjis.frequent, gloss_id, kanji, reading, pos, gloss
       from kanjis join entries using (ent_seq)  join glosses using (ent_seq) join senses using (sense_id) join readings using (ent_seq)
       --where kanji = '雨'
       ;
drop table if exists d.entries;
create table d.entries as select distinct glosses.ent_seq, kanjis.frequent, gloss_id, kanji, reading, pos, gloss from kanjis join entries using (ent_seq)  join glosses using (ent_seq) join senses using (sense_id) join readings using (ent_seq) ;
drop table if exists d.fentries;
create table d.fentries as select * from d.entries join (select ent_seq, min(gloss_id) as gloss_id from d.entries group by ent_seq, kanji, reading, pos)  using (ent_seq, gloss_id);