- mode for reading japanese
- This is work in progress.
- Nomenclature
- how to use
- Limitations
- status database
- mecab parsing
- compound terms
- speed
- pitch accent (to be done)
- dictionary (to be done)
- current bugs/todos
- notes to myself
An attempt to bring yomichan/migaku features to emacs.
See a demo here.
Features:
- tokenize Japanese text
- support for compound words
- keep track of reading status of words (similar to Anki’s morphan and Migaku)
- overlay information about the current token
- token info
- fontify accordingly
- dictionary support using myougiden. This means Eng-Jp , Jp-Eng.
- overlaid (similar to yomichan)
- keep a buffer with lookups
- support for jumping to reference pages (e.g. jisho, kanjiDamage)
- [X] tokenize and colour tokens by type
- [X] typeset the text according to learning status (still to be completed)
- [X] known status saved in a SQLite database
- [X] ability to change status interactively
- mark known, unknown, ignored and learning words
- [X] support to identify and lookup compound terms (terms created by combining two or more morphs)
- [X] display dictionary (similar to yomichan)
- [ ] send sentence to deepl
- [ ] support pitch accent
- [-] add info for N-levels (N5, N4 and N3 already done)
- [ ] make it async
- [X] local mode
- [X] operate in regions
- [x] automatically display “quick” definition when cursor-sensor-mode is enabled
- [ ] create reports of “readability”
- [ ] support for morphman and anki
- [X] rename everything to something more meaningful
- [ ] remove unnecessary code
- [ ] reorganize code
- [ ] keep track of known kanji?
- a Japanese text is composed of a sequence of tokens, some of them non-japanese.
- mecab converts the input text into a sequence of tokens
- a morph (see morphman) is a Japanese word as it is learned
- it is composed of 3 parts:
- surface. The version of the root known/learned
- root. the actual word
- type. the grammatical type of word
- it is composed of 3 parts:
- this module keeps track of the learned status of a word:
- ignore: we don’t care for it, i.e. names, or interjections)
- learning: we are trying to remember it
- known: the surface of the word (i.e. seen version of the root, eg. たベル vs 食べる) is known
- unknown: added to the db, but unknown
- if the morph is not in the db, it means it is unknown
This module relies primarily on mecab to parse japanese. In particular it uses mecab unidict.
I think the easiest way is to download it from here: https://github.com/CrescentKohana/MecabUnidic/releases/tag/v1.2.1
Download the corresponding binary and install. You should have a file support/matrix.bin that is 459Mbytes.
Then create a script to run it from the command line using the corresponding files. I call this script m-mecab.sh
Pay attention to the directory where the support files are installed. Make sure that the file mecab is executable.
#!/usr/bin/env bash
MDIR="/Users/dmg/bin/osx/MecabUnidic/support"
export LD_LIBRARY_PATH="$MDIR":
export DYLD_LIBRARY_PATH="$MDIR":
#"$MDIR/mecab" -d $MDIR -h
if [ "$1" == "" ] ; then
"$MDIR/mecab" -d $MDIR -r "$MDIR/mecabrc"
else
"$MDIR/mecab" -d $MDIR -r "$MDIR/mecabrc" $*
fi
#unidic22
Make sure you get the following output (replace /Users/dmg/bin/osx/m-mecab.sh with the path to mecab in your installation). Note the number of columns in the output:
echo "猫が大好きです。" | /Users/dmg/bin/osx/m-mecab.sh
猫 名詞,普通名詞,一般,*,*,*,ネコ,猫,猫,ネコ,猫,ネコ,和,*,*,*,*,*,*,体,ネコ,ネコ,ネコ,ネコ,1,C4,*,7918141678166528,28806 が 助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*,*,*,格助,ガ,ガ,ガ,ガ,*,"動詞%F2@0,名詞%F1",*,2168520431510016,7889 大好き 形状詞,一般,*,*,*,*,ダイスキ,大好き,大好き,ダイスキ,大好き,ダイスキ,混,*,*,*,*,*,*,相,ダイスキ,ダイスキ,ダイスキ,ダイスキ,1,C1,*,6326873407758848,23017 です 助動詞,*,*,*,助動詞-デス,終止形-一般,デス,です,です,デス,です,デス,和,*,*,*,*,*,*,助動,デス,デス,デス,デス,*,"形容詞%F2@-1,動詞%F2@0,名詞%F2@1",*,7051468750332587,25653 。 補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25 EOS
- Install myougiden. this is done using pip. See its website. After running pip you need to download/install the dictionary.
- make sure you can run it from the command line:
myougiden お願い
These libraries can be installed using melpa
https://github.com/magit/emacsql.
Note. This might no longer be required in new versions of emacs.
- update the variable yk-command with the location of the mecab executable. In my case it is a script that setup the proper resources used by mecab. You can see its contents here.
- copy one of the status databases from ./dbs/ into your preferred location. The default location is ~/jp-status.db
- copy and decompress the dictionary database ~/db/dictionary.db.bz2 to your preferred location. This database contains the quick definitions used in the cursor-sensor-mode. It is much faster than looking up words in the dictionary.
(require 'yomikun)
(require 'yomikun-dict)
;; replace with your path to mecab
(setq yk-mecab-command "/Users/dmg/bin/osx/m-mecab.sh")
;; replace with your preferred name and location. If the database does not exist, it will be created.
(setq yk-db-status-file "~/jp-status.db")
(setq yk-db-dict-file "~/dictionary.db")
you will now have two commands available:
yk-do-buffer
this function will process the entire buffer.
and
yk-do-region
which will do only the current region.
Both commands can be run on text that has been already processed.
At this point you can then enter the yk-minor-mode. This mode has the following commands:
i | mark morph as ignored |
k | mark morph as known |
l | mark morph as learning |
u | mark morph as unknown |
j | show morph in jisho.org |
k | show kanji in kanjidamage.com |
p | display properties of morph at point |
= | mark current sentence |
x | exit minor mode |
RET | define term at point |
- work in progress.
- Tested only in macos but it should work without problems in linux
- Processing of large text can take few seconds. For example Alice in Wonderland takes 8 seconds to process on an M1 mini.
The status database is a sqlite database created and managed by emacsql. This means that all attributes are surrounded by double quotes.
The schema is fairly simple:
attribute | description |
---|---|
morph | root of the morph |
mtype | type |
surface | the root as processed |
status | one of several: known, unknown, learning |
date | date the tuple was added to the relation |
The primary key is (morph, mtype, surface)
there are databases with different JLPT levels at ./dbs/
From each sentence we obtain the root, the type of word, and the surface (kanji/hiragana version seen). For example:
美味しい寿司を食べた。おいしくないすしはたべられない
echo "美味しい寿司を食べた。おいしくないすしはたべられない" | m-mecab.sh
美味しい 形容詞,一般,*,*,形容詞,連体形-一般,オイシイ,美味しい,美味しい,オイシー,美味しい,オイシー,和,*,*,*,*,*,*,相,オイシイ,オイシイ,オイシイ,オイシイ,"0,3",C2,*,1201225110528705,4370 寿司 名詞,普通名詞,一般,*,*,*,スシ,寿司,寿司,スシ,寿司,スシ,和,ス濁,基本形,*,*,*,*,体,スシ,スシ,スシ,スシ,"1,2",C3,*,5269967956222464,19172 を 助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*,*,*,格助,ヲ,ヲ,ヲ,ヲ,*,"動詞%F2@0,名詞%F1,形容詞%F2@-1",*,11381878116459008,41407 食べ 動詞,一般,*,*,下一段-バ行,連用形-一般,タベル,食べる,食べ,タベ,食べる,タベル,和,*,*,*,*,*,*,用,タベ,タベル,タベ,タベル,2,C1,M4@1,6220495691326081,22630 た 助動詞,*,*,*,助動詞-タ,終止形-一般,タ,た,た,タ,た,タ,和,*,*,*,*,*,*,助動,タ,タ,タ,タ,*,"動詞%F2@1,形容詞%F4@-2",*,5948916285711019,21642 。 補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25 おいしく 形容詞,一般,*,*,形容詞,連用形-一般,オイシイ,美味しい,おいしく,オイシク,おいしい,オイシー,和,*,*,*,*,*,*,相,オイシク,オイシイ,オイシク,オイシイ,"0,3",C2,*,1201225076974209,4370 ない 形容詞,非自立可能,*,*,形容詞,連体形-一般,ナイ,無い,ない,ナイ,ない,ナイ,和,*,*,*,*,*,*,相,ナイ,ナイ,ナイ,ナイ,1,C3,*,7543208145986241,27442 すし 名詞,普通名詞,一般,*,*,*,スシ,寿司,すし,スシ,すし,スシ,和,ス濁,基本形,*,*,*,*,体,スシ,スシ,スシ,スシ,"1,2",C3,*,5269967855559168,19172 は 助詞,係助詞,*,*,*,*,ハ,は,は,ワ,は,ワ,和,*,*,*,*,*,*,係助,ハ,ハ,ハ,ハ,*,"動詞%F2@0,名詞%F1,形容詞%F2@-1",*,8059703733133824,29321 たべ 動詞,一般,*,*,下一段-バ行,未然形-一般,タベル,食べる,たべ,タベ,たべる,タベル,和,*,*,*,*,*,*,用,タベ,タベル,タベ,タベル,2,C1,M4@1,6220495657771585,22630 られ 助動詞,*,*,*,助動詞-レル,未然形-一般,ラレル,られる,られ,ラレ,られる,ラレル,和,*,*,*,*,*,*,助動,ラレ,ラレル,ラレ,ラレル,*,動詞%F3@2,M4@1,10936575907209793,39787 ない 助動詞,*,*,*,助動詞-ナイ,終止形-一般,ナイ,ない,ない,ナイ,ない,ナイ,和,*,*,*,*,*,*,助動,ナイ,ナイ,ナイ,ナイ,*,動詞%F3@0,*,7542108634358443,27438 EOS
This output is reduced to the following. The first column is the word as seen, the second the type, then the morph, and finally the surface. Compare 美味しい and おいしい.
echo "美味しい寿司を食べた。おいしくないすしはたべられない" | m-mecab.sh | csvcut -c 1,8,11
美味しい 形容詞,美味しい,美味しい 寿司 名詞,寿司,寿司 を 助詞,を,を 食べ 動詞,食べる,食べる た 助動詞,た,た 。 補助記号,。,。 おいしく 形容詞,美味しい,おいしい ない 形容詞,無い,ない すし 名詞,寿司,すし は 助詞,は,は たべ 動詞,食べる,たべる られ 助動詞,られる,られる ない 助動詞,ない,ない EOS,,
This text would be stored as follows in the database. Note that 寿司 and 美味しい are stored twice. One for each version (kanji and hiragana).
wtype | root | surface |
---|---|---|
助動詞 | た | た |
助動詞 | ない | ない |
助動詞 | られる | られる |
助詞 | は | は |
助詞 | を | を |
動詞 | 食べる | たべる |
動詞 | 食べる | 食べる |
名詞 | 寿司 | すし |
名詞 | 寿司 | 寿司 |
形容詞 | 無い | ない |
形容詞 | 美味しい | おいしい |
形容詞 | 美味しい | 美味しい |
to be written…
Processing large amounts of text is slow. In my tests, emacs can do Alice in Wonderland in around 8 seconds in an M1 mini.
- 4.5k morphs (probably wrong due to breaking lines in wrong place)
- 98k characters
- mecab outputs 64k lines
The bottleneck is receiving and processing mecab’s output.
Finding compounds is optional. It is a CPU intensive process. Processing of Alice takes approximately 10 seconds.
to be done…
https://github.com/javdejong/nhk-pronunciation/blob/master/nhk_pronunciation.py
txt = e.midashigo1
strlen = len(txt)
acclen = len(e.ac)
accent = "0"*(strlen-acclen) + e.ac
Support via an external dictionary. Most likely myougiden
- [ ] ignore fontification is not working
- [ ] compounds
- [ ] add compounds of 3 kanji to db (eg: 形状詞)
- [ ] fontify compounds
- [ ] mark compounds of a region, rather than all buffer
- [ ] keep track of status of compounds
- [ ] add dictionary lookup for region
select glosses.ent_seq, kanjis.frequent, gloss_id, kanji, reading, pos, gloss
from kanjis join entries using (ent_seq) join glosses using (ent_seq) join senses using (sense_id) join readings using (ent_seq)
--where kanji = '雨'
;
drop table if exists d.entries;
create table d.entries as select distinct glosses.ent_seq, kanjis.frequent, gloss_id, kanji, reading, pos, gloss from kanjis join entries using (ent_seq) join glosses using (ent_seq) join senses using (sense_id) join readings using (ent_seq) ;
drop table if exists d.fentries;
create table d.fentries as select * from d.entries join (select ent_seq, min(gloss_id) as gloss_id from d.entries group by ent_seq, kanji, reading, pos) using (ent_seq, gloss_id);