-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output from subextractors need to be a bit more closely aligned with original output #448
Comments
I totally agree with this sentiment. However, I want to comment that initially when I started working on other extractors the consensus was that we shouldn't worry too much about enforcing a common output schema initially since we didn't know the requirements of the different Wiktionary editions. So I took some liberties introducing new fields and might not have been overly attentive to keeping everything aligned (though I tried). Perhaps now with the json schemas and outputs of different extractors, it is a good time to revisit the issue and align the data structure as much as possible. I want yet to advocate for allowing extractors to define new fields since some editions contain unique information or allow for a clean and more granular extraction than others. For the extractors using pydantic, it could make sense to define base models that all extractors should/must inherit from. Good luck for your endeavors! |
I actually got our html generation code to generate a page from a sample of the Chinese data, and the only changes needed to make it work was:
Funnily enough, there was only one language in the Zh data sample I took (using |
I have some concerns about inherit model, if some fields in the parent are not used they will also be included in schema. And models have another model in its field have to be defined again like |
There are only 314 Esperanto pages in Chinese Wiktionary, it can't be a coincidence... |
The amount of "senses" in the output dictionary html is apparently 551, but I can't make sense how this could have happened. The sample is shuffled with |
I get 118675 word-entries in zh-extract.json for |
The number 314 comes from this https://zh.wiktionary.org/wiki/Category:%E4%B8%96%E7%95%8C%E8%AA%9E category, but our stuff includes https://zh.wiktionary.org/wiki/Category:%E4%B8%96%E7%95%8C%E8%AA%9E%E9%9D%9E%E8%A9%9E%E5%85%83%E5%BD%A2%E5%BC%8F this category, with 113k pages. |
Issue with ru-wiktionary data: |
ru-wiktionary extractor: fields with "synonyms", "holonyms", "hyponyms" etc. should not be lists with pure strings: So the example should be This isn't something I can fix with a quick sed (well I could try, but I bet it would be more trouble than it's worth), so testing ru-wiktionary will have to wait. On to others. EDIT:
|
de.wiktionary data: similarly to the Russian one (and probably the same in ru.wiktionary data too), EDIT:
|
es.wiktionary: senseids should be strings, not integers. EDIT:
|
es.wiktionary: "ipa" fields in "sounds" items (dicts in a list) should be a string, not a list of strings. EDIT:
I also came across translation data that didn't have a lang_name/lang field at all, but that's a bit more open to interpretation. As long as one of lang or lang_code is present, I guess it is minimally sufficient... But for readability, it would be good to have a "lang" field. I don't know what's the best thing to do in cases with "broken" or unusual/nonstandard language codes that were not possible to map properly (at the time of the code running), but this should be either standardised: "lang": "UNKNOWN_LANG_CODE", or perhaps "lang": "lc", "lang": "" or leaving "lang" out are all possible. |
I think both French and Chinese extractor always have "lang" field. The Chinese extractor might only have "lang" if |
es.wiktionary: etymology_templates items should have an EDIT: Weirdly, this is now fixed..?
|
Please feel free to take a closer look at these yourselves. I won't address these right now since I suspect these are more than a quick fix. (At the very least it requires checking whether, indeed, the way the Spanish Wiktionary provides pronunciation data, can be sensibly separated into sounds with unique ipa. It might as well be just an oversight by me but without closer examination it's hard to tell.) |
Tatu thinks that for |
The final issue that prevented es-data from running on (almost) vanilla html-generation code is the structure of sound data:
from https://es.wiktionary.org/wiki/brown The html-generation code assumes "audio" and "mp3_url" are strings (which breaks when joining them with {'sounds':
[{'phonetic_transcription': ['bɹaʊn'], 'audio': 'En-uk-brown.ogg', 'ogg_url': 'https://commons.wikimedia.org/wiki/Special:FilePath/En-uk-brown.ogg', 'mp3_url': 'https://upload.wikimedia.org/wikipedia/commons/transcoded/7/7b/En-uk-brown.ogg/En-uk-brown.ogg.mp3'},
{'phonetic_transcription': ['bɹaʊn'], 'audio': 'en-us-brown.ogg', 'ogg_url': 'https://commons.wikimedia.org/wiki/Special:FilePath/en-us-brown.ogg', 'mp3_url': 'https://upload.wikimedia.org/wikipedia/commons/transcoded/2/29/En-us-brown.ogg/En-us-brown.ogg.mp3'}
]
} Having consistent field name pluralization rules is really handy. There are a couple of nagging exceptions, like "derived" which doesn't have a plural form and should have been "derived_terms" instead, and there's possibly some mistakes or exceptions I don't know or forgot about. EDIT:
|
ru-wiktionary: in sense data (item in
the list is needed to show the 'hierarchy'. EDIT:
Besides that, I got ru-wiktionary to generate a site, except the glosses were missing because of the above. Good thing it was noticeable, otherwise I wouldn't have even noticed. |
The last of the changes needed to make the kaikki HTML generation work with each json output is now done, and I've tested out all of the outputs (well, 10k json object samples of them, which should be enough). Next week, I'll start to work on actually implementing the different websites... The HTML-generating code needs to be made more edition-agnostic (ie. links to wiktionary should be to "xx.wiktionary.org" not en.wiktionary.org, that sort of thing), and after that I need to tackle some bash scripting. 😢 Tatu said he'd hold my hand with that, so we'll see how it goes. If all goes well, we might soon see individual online dictionaries for each extractor, including error data and the json output mapping stuff. Have a good weekend, I have a hot, hot date with a bowl of soup. The weather has been jumping between +2 celsius and -26, and now it's back to a balmy -10. |
I think it is important for downstream usability of the data that the editions be as consistent as possible - same fields, same parts-of-speech, same tags - as much as possible. Yes there are things in some editions that are not present in others. In these cases, we can define additional fields, tags, or even parts-of-speech - but this should only be done when the data cannot be reasonably described using existing mechanisms. |
I noticed that the English translations have the key "code" for the items of the translation array, whereas other languages such as Spanish have the key "lang_code". (I think lang_code would maybe be more consistent, as it is also used for the language code of the entry.) |
I didn't notice en edition uses "code" for translation data when I was writing new extractor code, so I use the same |
Tatu says: "Because the output from the original extractor is being used with other projects that rely on it being "stable", changes to it need to be minimized, while all the outputs from all the projects need to be as close as possible."
I've started on trying to use our html-generation code to create websites from the data extracted with the other extractors, so I will be just posting here issues as they come along:
The first breaking difference is just that "lang_name" in word base data is different from "lang". Because "lang" is used in the original output and there is nothing wrong with it (not really, it is perfectly fine as is) the direction of change here is "lang_name" -> "lang". This change should be pretty simple.
I will continue with trying to make html generation work with the other extracted data and will post here stuff as things come along.
The text was updated successfully, but these errors were encountered: