-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guide to adding new Wiktionaries #205
Comments
Here is an example of the compounds section: https://en.wiktionary.org/wiki/polku#Compounds
The The |
When you think you've got it all ironed out, put it in the readme in a pull request and I'll merge it! |
I haven't quite gotten it to work, my current version prints a huge number of error. Some small selection:
|
Those seem mostly harmless. DEBUG is mostly used for less-than-actual-error messages, either stuff that is good to know in case something is actually wrong, messages that are used to collect data for other stuff, or when something has been recovered from in some way. The "no corresponding start tag" thing is reallllly common, and the unimplemented top-level templates are in page/parse_top_level_template(), where we parse any templates that come before the "contents" of the actual languages if there's any. Mostly, we ignore most templates like that. |
Thanks for the help. The error might be in the data files, I will have to keep looking. I also can't seem to get the program to run completely (using WSL2). It processes a large amount of entries, and then I get a broken pipe error.
|
You shouldn't ignore these unexpected top-level node error, that usually means the language or POS header is not expanded. For example,
|
@Vuizur Did you make progress on your Russian wiktionary parsing? @xxyzz , @kristian-clausal Is multilingual Wiktionary parsing an actual goal of the project? I am trying to work on the French Wiktionary, but even with the few json files to translate some resources, it seems that most of the logic is very linked to the English wiktionary, expecting some templates at some places, hardcoding categories, etc. There is a |
Hmm, I haven't gotten it to work yet (but I also didn't have that much time recently). I think Wiktextract works pretty decently with the Chinese Wiktionary because they have taken a lot of templates from the English one, which makes them comparatively similar. Other language Wiktionaries like German or Russian haven't done this as much (as far as I could tell), so here it is pretty hard to get it to work because everything is different. The HTML layouts are different, and the data is probably also structured a bit differently. I guess it would be a real challenge to adapt the Wiktextract code to work for all Wiktionaries. So one probably has to write custom code for each Wiktionary, but of course it would be smartest to reuse the wiktextract code where it makes sense, (I don't know the code/details, but for example for creating the forms array with grammar tags out of a Wiktionary table. Or the code to create the mappings between language name and language code) And it would probably make a lot of sense to keep the JSON consistent. DBnary already parses a pretty significant number of different Wiktionary XML dumps to extract data from them. However, in comparison to wiktextract it has several disadvantages: As far as I can tell, it only parses German entries from the German Wiktionary (and Russian for the Russian one), for example, missing potentially useful data. And it doesn't expand templates, so you lose the table data/inflections. (And it uses RDF, which feels very complicated to me compared to Wiktextract's JSON.) So I think the best way to get high quality data from each Wiktionary that is either
I haven't tried 1), but for 2), difficulty varies by how logically the specific Wiktionary structures its HTML, but I was generally really happy with the results compared to the time I invested. (But one could probably spend forever on some small fixes). |
The current code can extract some basic data like POS, definition and example sentence once those config JSON files are added for non-English dump files. Pronunciation and forms parsing code are mostly still hard coded for the English Wiktionary. I think it's inevitable to write separate parsing code for each language, Dbnary seems to use this approach. The downside of parsing HTML is if the MediaWiki theme changed the code also need to change. |
The long-long-long-term plan is to attempt to decouple wiktextract's core from wiktionary language versions. However, the more I see how different wiktionaries can be from each other, the more like a morass it seems like... The code in Wikitextprocessor should definitely be decoupled, and it mostly is. As it is, Tatu and me are trying to make at least en.wiktionary.org to work, mainly because it's the most useful to tackle by far. But even that is a moving target, because en.wiktionary is not static. In the meanwhile, if you want to make another wiktionary work with wiktextract, you have to actually put a lot of work into it, much in the same way that we've put a lot of work into making just en.wiktionary function. |
Thank you for your answers! It seems my best bet is to fork the code and try to do what I can for the French wiktionary first, then maybe see if some parts can be merged back afterwards? At least for basic data, I think it would be good to have something working in the main wiktextract repo (currently, it fails at the very beginning, during the initial stage of recognizing languages, because French wiktionary uses templates inside the second level header). As an aside, DBNary seems to now manage entries in languages other than the main one: http://kaiko.getalp.org/about-dbnary/eager-to-meet-the-exolexica/. I will also have a look at what is done there, but with Java + RDF, it is going to be more of a struggle ... |
Having gained some insight from starting to parse the French Wiktionary, I want to add my thoughts to the discussion. General remarks This leads me my next point. Organizing the code base If this repo wants to support the parsing of different Wiktionary project, it would really benefit from clearly separating which parts deal with the general structure of a Wiktionary page and which parts rely on the (Wiktionary) project-internal conventions for each section. Otherwise, each contributor parsing a different Wiktionary project will choose a different access point (or worse get discouraged from trying at all) and the repo will get a huge mess of Wiktionary-project specific code hidden behind flags. Of course, it's always an option to just let it grow organically and reorganize later. One format to serve them all? Additionally, different Wiktionary projects might be more detailed or more coarse in the information they provide in an organized manner. For example, the French Wiktionary uses in many cases the template This is just one example. The bigger question here is to which extent the divergence of output formats for parsing different Wiktionary projects is acceptable and if yes how can these differences be made transparent? Final thought Cheers. |
The only way to know what other wiktionary projects need is to implement those wiktionary projects and see what kind of output they 'should' generate. "english" could easily be renamed if we can figure out a good term for it (or we could use "french" for when it's French), and just adding things to previously existing fields should not break things too much. I believe it is futile to try to standardize any sort of format at this point (it might be even be that at any point...), so the simplest thing is just to do whatever you need for French and then implement or unify stuff later. Similarly for the separate code stuff, it might be simplest just to let you and xxyyzz wrestle with the Chinese and French Wiktionaries, see where you needed to put your |
Hi everyone,
I thought it would be really cool to have a guide to add new other-language Wiktionaries. I was trying to do some work on the Russian Wiktionary and wrote my current understanding down.
Guide to adding new languages
Create the necessary language data in wiktextract/data.
languages.json
, mapping language codes to language names. (For the English Wiktionary, this would map "en" to "English"). The generation of these files is handled inget_languages.py
. Depending on the Wiktionary, the best way to do this varies, you either have to expand some templates or parse source code. (Simply look at the examples.)pos_subtitles
: The translated names of the parts of speech. (For Russian Wiktionary this might be problematic, because the POS are at the beginning of a string containing all sorts of grammar information, so one would have to use a regex or so)linkage_subtitles.json
: Contains the translated names of synonym/antonym/... sectionsother_subtitles.json
: Has the translated names of the inflection/etymology sections (lower case)zh_pron_tags.json
: Not sure what it does exactly, but the file has to exist and contain an empty dictionary{}
at leastform_of_templates.json
:Run the program using
For Russian:
wiktwords --all --all-languages --out data.json --dump-file-language-code <yourlangcode> ruwiktionary-20230101-pages-articles-multistream.xml
I am currently still a bit confused about the "compounds" key in
other_subtitles.json
. What section exactly does it refer to? I cannot seem to find it in the Russian Wiktionary.@xxyzz
The text was updated successfully, but these errors were encountered: