forked from tatuylonen/wiktextract
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
463 lines (347 loc) · 20.3 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
# XXX cf "word" translations under "watchword" and "proverb" - what kind of
# link/ext is this? {{trans-see|watchword}}
# XXX implement parsing {{ja-see-kango|xxx}} specially, see むひ
# XXX check use of sense numbers in translations (check "eagle"/English)
# XXX in Coordinate terms, handle bullet point titles like Previous:, Next:
# (see "ten", both Coordinate terms and Synonyms)
# XXX parse "alter" tags; see where they are used (they seem to be alternate
# forms)
# XXX parse "enum" tags (args: lang, prev, next, value)
# XXX make sure parenthesized parts in the middle of form-of descriptions
# in gloss are handled properly. Check "vise" (Swedish verb).
# XXX distinguish non-gloss definition from gloss, see e.g. βούλομαι
# XXX check "unsupported tag component 'E' warning / in "word"
# XXX parse "id" field in {{trans-top|id=...}} (and overall parse wikidata ids)
# XXX implement support for image files in {{mul-symbol|...}} head, e.g.,
# Unsupported titles/Cifrão
# XXX Some declination/conjugation template arguments contain templates, e.g.
# mi/Serbo-Croatian (pronoun). Should probably strip these templates, but
# not strip as aggressively as normal clean_value() does. HTML tags and links
# should be stripped.
# XXX parse "Derived characters" section. It may contain multiple
# space-separated characters in same list item.
# XXX extract Readings section (e.g., Kanji characters)
# XXX handle qualifiers starting with "of " specially. They are quite common
# for adjectives, describing what the adjective can characterize
# XXX parse "object of a preposition", "indirect object of a verb",
# "(direct object of a verb)", "(as the object of a preposition)",
# "(as the direct object of a verbal noun)",
# in parenthesis at the end of gloss
# XXX parse [+infinitive = XXX], [+accusative = XXX], etc in gloss
# see e.g. βούλομαι. These come from {{+obj|...}}, might also just parse
# the template.
# XXX parse "construed with XXX" from sense qualifiers or capture "construed with" template
# XXX how about gloss "Compound of XXX and YYY". There are 100k of these.
# - sometimes followed by semicolon and notes or "-" and notes
# - sometimes Compound of gerund of XXX and YYY
# - sometimes Compound of imperative (noi form) of XXX and YYY
# - sometimes Compound of imperative (tu form) of XXX and YYY
# - sometimes Compound of imperative (vo form) of XXX and YYY
# - sometimes Compound of imperative (voi form) of XXX and YYY
# - sometimes Compound of imperative of XXX and YYY
# - sometimes Compound of indicative present of XXX and YYY
# - sometimes Compound of masculine plural past participle of XXX and YYY
# - sometimes Compound of past participle of XXX and YYY
# - sometimes Compound of present indicative of XXX and YYY
# - sometimes Compound of past participle of XXX and YYY
# - sometimes Compound of plural past participle of XXX and YYY
# - sometimes Compound of second-person singular imperative of XXX and YYY
# - sometimes Compound of in base a and YYY
# - sometimes Compound of in merito a and YYY
# - sometimes Compound of in mezzo a and YYY
# - sometimes Compound of in seguito a and YYY
# - sometimes Compound of nel bel mezzo di and YYY
# - sometimes Compound of per mezzo di and YYY
# - sometimes Compound of per opera di and YYY
# - sometimes Compound of XXX, YYY and ZZZ
# - sometimes Compound of XXX + YYY
# - sometimes Compound of the gerund of XXX and YYY
# - sometimes Compound of the imperfect XXX and the pronoun YYY
# - sometimes Compound of the infinitive XXX and the pronoun YYY
# XXX handle "Wiktionary appendix of terms relating to animals" ("animal")
# XXX check "Kanji characters outside the ..."
# XXX recognize "See also X" from end of gloss and move to a "See also" section
# Look at "prenderle"/Italian - two heads under same part-of-speech. Second
# head ends up inside first gloss.
# XXX handle (classifier XXX) at beginning of gloss; also (classifier: XXX)
# and (classifiers: XXX, XXX, XXX)
# E.g.: 煙囪
# Also: 筷子 (this also seems to have synonym problems!)
# XXX parse redirects, create alt_of
# XXX Handle "; also YYY" syntax in initialisms (see AA/Proper name)
# XXX parse {{zh-see|XXX}} - see 共青团
# XXX is the Usage notes section parseable in any useful way? E.g., there
# is a template {{cmn-toneless-note}}. Are there other templates that would
# be useful and common enough to parse, e.g., into tags?
# Check te/Spanish/pron, See also section (warning, has unexpected format)
# Check alt_of with "." remaining. Find them all to a separate list
# and analyze.
# XXX wikipedia, Wikispecies links. See "permit"/English/Noun
# XXX There are way too many <noinclude> not properly closed warnings
# from Chinese, e.g., 內 - find out what causes these, and either fix or
# ignore
/derived terms (seem to be referenced with {{section link|...}}
數/derived terms
仔/derived terms
人/derived terms
龍/derived terms
馬/derived terms
# In htmlgen, create links from gloss, at minimum when whole gloss matches
# a word form in English (or maybe gloss as an alternative?)
htmlgen: form-of (and alt-of) links in generated html should look at
canonical form in addition to word - now many items with accents are
not properly linked
htmlgen: strange Categories non-disambiguated in: https://kaikki.org/dictionary/English/meaning/v/vo/volva.html#volva-noun
htmlgen: compound-of field of sense not displayed in generated html
htmlgen: implement support for redirects
htmlgen: render audio-ipa for audios
htmlgen: when looking for linkage references, also check canonical
word form
Word entires may be generated from entries like "Thesaurus:ar:make happy" that
are in a wrong language. Check if they are referenced from somewhere and
maybe restrict generation to entries that are actually referenced.
Consider caching template expansions in an LRU cache. Make sure
template arguments that can be accessed from Lua are considered
properly. Cache must be reset per-page. (Does DISPLAYPAGE break this?)
Check * at the end of some derived forms in wort/English/Noun (e.g., dropwort*)
- it actually is in the original wikitext and is used as a reference symbol,
with explanation at the end of the table
- should probably just remove the *
- check if there are other similar reference symbols and how common they are
Apparently the following is not parsed correctly:
{{ws sense|any of enclosing symbols "(", ")", "[", "]", "{", "}" etc.}}
Recognize and parse {{zh-dial|...}} in linkage, e.g., 鼎/Chinese
- Variety = language (these are more like translation than synonyms)
- Location = tags, parenthesized part separate tag?
- Words may have Notes references, which may need to be parsed as senses
(not clear how common they are with other words)
{{syn|...}} not properly handled in nested glosses, see frons/Latin
sense "the forehead, brow, front", search for "vultus"
In translations, should process {{trans-see|...}} (cf. "abound with",
the only translations are with {{trans-see|...}})
ख/Translingual/Letter: DEBUG: unhandled parenthesized prefix: (Devanagari script letters) अ, आ, इ, ई, उ, ऊ, ऋ, ए, ऐ, ओ, औ, अं, अः, अँ, क, ख, ग, घ, ङ, च, छ, ज, झ, ञ, ट, ठ, ड, ढ, ण, त, थ, द, ध, न, प, फ, ब, भ, म, य, र, ल, व, श, ष, स, ह, त्र, ज्ञ, क्ष, क़, ख़, ग़, ज़, झ़, ड़, ढ़, ष़ <nowiki>[</nowiki><nowiki>]</nowiki> at ['ख']
- is nowiki broken in templates? Perhaps core.py:_template_to_body should
call preprocess_text for the template body?
Check:
can/English/Verb: ERROR: no language name in translation item: : ... at ['can']
can/English/Verb: ERROR: no language name in translation item: : -bil- (tr) (verbal infix) at ['can']
Make sure rtl (right-to-left) and ltr markers are properly inserted
and matched in words (including translations and linkages), even when
splitting them.
Implement better parsing of compound_of base. There are over 135k
compound_of entries in Wiktionary, so this is common enough to care
about.
Use something similar to tr "note" detection in linkages (e.g.,
pitää/Finnish search +). Also, the parenthesized tags in Derived
terms as subheadings are currently not handled.
Check strange and incorrectly processed category link in
अधिगच्छति/Sanskrit (I suspect this may be coding error in Wiktionary,
looks like the category has a nested link)
Fix parsing linkages for letters in scripts, see P/English/Number
- longer sequences of characters
- P/Latvian "burti:" prefix handled incorrectly (should be separate linkage)
- P/Vietnamese first word incorrectly split (never split before semicolon)
- P/Vietnamese multi-char names inside parentheses handled incorrectly
(replace parentheses by comma? colon by comma?)
Handle " / " as a separator in linkages; see be/Faroese/related
(Latin-script letter names)
have/English/Translations/"to possess"/Arabic: have "- you have" at end of tr;
similarly "- I have"
language/English has weird IPAs coming from qualifier (see
/ae/ raising)
Add more sanity checks:
- number of words extracted for a few major languages (English, Korean, Zulu)
- number of extraction errors of various kinds
- number of translations extracted for major languages
- number of linkages extracted for major languages
- number of suspicious linkages
- number of suspicious alt-of/form-of
- number of alt-of
- number of form-of
- number of IPAs
- number of forms
- number of inflection templates
- number of non-disambiguated translations
- number of non-disambiguated linkages
- number of senses with tags
- number of senses with topics
お茶/Japanese/Verb:
- "ocha suru" as romanization, "suru" should not be there
- "intransitive suru" gets into end of canonical form
- stem and past are not correctly parsed
- related terms are not correctly parsed - English goes into alt
Handle and remove right-to-left mark (and left-to-right mark) from titles,
forms, translations, linkages. Add tag "right-to-left" where they were
present. (Or should we keep the marks but still add tag?)
Fix support for DISPLAYTITLE - sets page default display title, which
would be useful. Maybe extract too? Note that this may prevent
caching templates/Lua calls.
Check "class II noun" and similar "note" in translations, cf. form desc,
define required tags/mappings
Consider improving romanization recognition in translation by
capturing strings inside <span class="tr [Latn]"> and never
interpreting such strings as English. This would would probably hook
into clean_node().
Review handling of 0x200e (left-to-right mark) and 0x200f
(right-to-left mark) in linkages. Also, what if only one is present?
cf. test_script5 in test_linkages.py
B/Portugues list of characters is broken
Finish writing tests for character lists
Synonyms are not collected from nested senses. See cat/English/Noun
sense 1/1 "A domesticated species..."
Move hieroglyph conversion to postprocessing
婬/Japanese/Kanji/Definitions has a table that is not correctly handled,
and is being parsed as head form.
Handle (* Poetic.) and (** Antiquanted.) and */** at end of form, e.g.
камень/Russian
Check topics in word head, cf. अंगूठा/Hindi, which has (anatomy) in head
Fix parsing "wait on hand, foot and finger"/English head parsing
(comma splits related forms). Maybe similar magic character kludge as we do
for linkages? No, that won't work here... maybe just merge standalone
parts that are comma-separated parts of the title to the previous entry
with a comma (as a pre-processing step after splitting?). Inflection though
might be in any of the parts, but I think usually the first.
Also: "wait on someone hand, foot and finger"/English.
Ah, similar bug in alt-of/inflection-of parsing.
Translation lists missing a comma between translations are handled incorrectly.
See tree/English/Tr/Punjabi (both Gurmukhi and Shahmukhi).
Implement special case in parsing word head for paren desc:
"на (na) accusative, by"; see помножать/Russian
Something wrong in parsing آتي/Arabic "coming"
Finish datautils.py:split_slashes and use it for translations/linkages/heads
- warn when clearly ambigious
- add more tests in test_utils.py
Which language should be used in translation when langcode and language in the
entry differ? This is particularly an issue for Chinese. Or should we treat
the second-level entries (e.g., Mandarin) as tags?
Change clean_value in clean.py to receive ctx. Add "hieroglyphs" tag
if the string contains <hiero> tag. (Or maybe this needs to go in
clean_node in page.py? Or maybe these two should be merged?)
Handle Chinese usage examples better - separate Chinese (traditional,
simplified), Pinyin, translation.
Translations for Mandarin seem to have Chinese language but cmn code.
What is correct?
Parse eumhun related forms (Korean) in more detail, particularly
separate romanization and hanja forms, e.g. 度/Korean
Collect RGB values and other similar data
XXX chinese glyphs, see 內
- I'm getting dial-syn page does not exist in synonyms, but Wiktionary
shows a big list of synonyms
Some Chinese words have weird /wiki.local/ in sounds, see: 屋
- also in pronunciations, several IPAs are captured, but they are not
annotated with dialect information
- this page also has dial-syn warning in synonyms (also noted elsewhere)
- in Japanese compounds, roman transliterations incorrectly go into tags
- in Japanese, does it make sense to have most senses annotated with "kanji"?
Parse readings subsection in Japanese kanji characters, see 內
- these are apparently pronunciations, but not IPA
- perhaps these might be parsed as forms with "reading" tag and tag
for the type of reading?
Korean kanji seem to have Phonetic hangeul in pronunciation sections, see 內
Parse Phonetic hangeul in Pronunciation section (see 死/Korean)
Check extraction of pronunciations and their romanizations in 공중/Korean
# XXX add parsing chinese pronunciations, see 傻瓜 https://en.wiktionary.org/wiki/%E5%82%BB%E7%93%9C#Chinese
Check ignored_cat_re in page.py
Is Rhymes collected and displayed correctly? cf. environ/English
-abil/Romanian declension table produces wikitext that is parsed incorrectly.
Looks like table cell tags (! at least) can be preceded by whitespace, here \t.
Test Pali Alternative Scripts, see cakka/Pali
Check if form-of imports tags from head (av/Swedish) - seems to be missing
LUA ERRORS:
कट्नु/Nepali/Verb: ERROR: LUA error in #invoke ('ne-conj', 'show') parent ('Template:ne-conj', {'i': 'y', 'a': 'y'}) at ['कट्नु', 'ne-conj', '#invoke']
ogun/Yoruba: ERROR: LUA error in #invoke ('number list', 'show_box') parent ('Template:number box', {1: 'yo', 2: '20'}) at ['ogun', 'number box', '#invoke']
Heleb/Northern Kurdish: ERROR: LUA error in #invoke ('wikipedia', 'wikipedia_box') parent ('Template:wp', {'lang': 'ku'}) at ['Heleb', 'wp', '#invoke']
তাহুন/Assamese/Pronoun: ERROR: LUA error in #invoke ('links/templates
Parse declension/conjugation tables
Interpret "see also" sections. This seems to be junk in some languages/words,
but related words in others (e.g., tänään/Finnish)
Implement support for "See also translations at" (e.g., god/translations)
htmlgen: Change when htmlgen computes indexes - only do after disambiguation
htmlgen: disambiguate translation & linkage targets
htmlgen: change disambiguated links
check jakes/English - ]] at the beginning of second sense
Improve #time date format parsing. cf. hard/English.
Parse inflection-like tables from See also sections (e.g.,
sie/Karelian, "Karelian personal pronouns" table)
wikitextprocessor error: '''c''''est is parsed incorrectly (should leave quote).
This needs to be fixed.
- must change example extraction to use node_to_html and then parse from there
- this is a major change that needs proper testing! It is likely to break
a lot of existing cases.
- compare current extraction to extraction after the change, checking all
examples that change.
Working on changes to compute_coltags
- changed how column headers are merged
- still something wrong, look at aamupala/Finnish 3rd pers possessive
Fix parsing tell/English/Verb word head "(dialectal or nonstandard)"
(appears very new, not yet in my test environment)
götürmek/Turkish - inflection table "interrogative" not properly inherited
to cells below (possibly other subtitles to).
- there are also participles per person, before persons are defined???
burn/English
- pronoun before / (e.g., burn past first-person singular) is not
handled correctly - second alternative does not get the person tags
grind/English
- implement handling of lead-in text before table (here "Weak conjugation",
"Strong conjugation")
- check tables, they are unusual
- add a test for this, because of the parenthesized parts
Improve audio file URL generation in wiktextract to handle redirects
in Wikimedia Commons. Options include using Commons API to query the
actual location or parsing Commons dumps to extract redirects. (Note
that some files also seems to have missing transcoded versions -
probably about 0.1-0.5% of them)
- See download at https://kaikki.org/dictionary/rawdata.html, audio
file bulk download. We would like to make this download automatically
updated and to support redirect.
- Modify wiktextract ([email protected]:tatuylonen/wiktextract) to take
wikimedia commons redirect list as a command line option
(--commons-redirect=FILE) and change parse_pronunciation() in
wiktextract/page.py to use this information when computing mp3_url
and ogg_url, so that URLs for redirected media files point to
the correct media file in Wikimedia commons.
- Take sound-file-downloader.py (under kaikki/webgen/wiktextract) and
turn it into a proper downloader that only downloads those files that
it does not yet have. Add support for argparse, with
--datadir=DIR
--rawdata=PATH (extract.json)
- The general idea is to be able to run the downloader after extraction
(should be fast as there are usually only a small number of new files),
recreate the wiktionary-audios.tar file, and copy it to the website.
This way we can update the audio files daily. This does not handle
modified existing files, so additionally we may want to re-download
everything every few months. (Maybe we could have --check-updates option
that uses HTTP HEAD command to get file size and checks if it has changed;
it could also check if the date is newer than the existing file
modification date. Alternative is to periodically re-download everything,
which takes 1-2 weeks.)
- As an alternative, we could perhaps use the existing Wikimedia commons
download and first attempt to fetch the files from there? Perhaps
a --wikimedia-commons-archive=DIR option?
Add tests to ensure that etymology templates and audio URLs are properly extracted.
- add a new unittest file under tests
- take some sample articles, pass them to parse_page() (wiktextract/page.py),
and check that the results are as expected (i..e, etymology templates
correctly extracted, audio URLs properly extracted)
- NOTE: you cannot expect templates to be expanded, let's discuss if there
are major problems
PLAN for 7/2022:
- continue bug fixing
- converting tags to UniMorph and Universal Dependencies tags (as additional
fields, leave existing tags field as-is)
- start moving constant tables to separate files in preparation for
parsing non-English Wiktionaries
- encode links to other sections from translations (e.g.,
"intelligence -- see intelligence" in wit/English/Translations)
- make it easier to control tag bleed in inflection tables
- code refactoring to make maintenance easier (especially page.py)
Moving tags to external files:
- create directory wiktextract/wiktextract/data
- language-specific data would go under
wiktextract/wiktextract/data/<editioncode> (e.g., "en" is an <editioncode>)
- non-language-specific data (e.g., tags) would go under
wiktextract/wiktextract/data/shared
- above directories need to be included in pypi distributions and
must be installed; when used, code must properly search for them
using pkg_resources package (see luaexec.py in wiktextract for an example)
-