-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
English translation of examples sometimes missing, included in the original text #604
Comments
Yeah, this is buggy, I'll take a look at it tomorrow. |
Sorry, I got stuck trying to figure out a bug? with our logging system, this might have to wait a while. |
No worries at all, I know how it is :) |
Found part of the issue, and it's a silly one; the examples I looked at (ani and vzít) had a word that was blacklisted from our "what is an English word" set: Actually looking at vzít, the example that is broken doesn't get through because it's 2/3rds Czech names with one English word. I am writing this message as I'm going through examples, and hlavní is a completely different issue: the example isn't in a template that we accept as an 'example' ( I'm going to commit these, and hopefully most of all of the issues will be addressed. If you find that some weren't fixed, just point them out. Unless we add a bunch of Czech names (which is the simplest way), the examples that are just |
Unfortunate that wiktionary doesn't tag the translation of examples with a language! If I understood correctly, this is why wikiextract has to implement these heuristics. Thanks for the fixes already, I'll have a look at the json within the next week! |
I'm looking at the most recent JSON, and some problematic words like ani, čas and žít are fixed, while others aren't, e.g. hlavní, názor, osobní. I think many of the remaining ones use (in case it helps, a pastebin of all of them here) |
Many of those are just that they contain words that are not in nltk.corpus.brown; words like "cellphone", "mousetrap", "He’s" (with a unicode apostrophe or other character), "dumbfuck", "peppermint"... Hrm, many of these could be fixed if we somehow could cheaply detect compound words. If anyone has an idea of how to do this super-cheaply ("peppermint" -> contains "pepper" and "mint" and smooshed together)... Another category of problem are the translations that are basically stuff like "common noun" or "adjective", phrases that will be get classified as |
We can't use template arguments or expanded HTML tags? "ux" and "coi" templates put translation text in the third argument, find argument should be easier and more reliable than check if words are in English. |
I was thinking of that, yeah. We can check to see if the arguments map on to the template expanded output and exit early if they conform to the formatting of examples. The problem is that there might be some pitfalls with this approach, for example if example templates are used for other things, but in context it might be fine. |
EDIT: This is a post that was left unwritten earlier today, posting it here just for completeness. "cause of death" slipped through because decode_tags classified if as tags. "of tags" is parsed as a tag (a space-including tag, so not in valid_tags), and "cause" is classified as a "topic" for some reason, and there's a small piece of boolean logic that says that if any of EDIT: This edgecase should be fixed with the template-arguments fix. |
That last PR fixed most of the issues! Here's a pastebin of the remaining ones, from a JSON downloaded today: https://pastebin.com/hMyZBXnh EDIT: If those remaining ones are due to issues on the wiktionary side, let me know how and I'll fix them one by one. |
I'll take a look at these later, thanks for keeping your eye on the output! |
Issue #604, Czech translations (continued) In translations like ``` ví ucho ― Leonotis nepetifolia (literally, “lion's ear”) ``` the translation part starting with "Leonotis" is has its classification returned as "taxonomic" due to the heuristics used in classify_desc(). I've been trying to kludge something better here, but for this specifically the right call to make is to change it so that if a description is either "english" or "taxonomic", that counts as English. There is not meaningful distinction here in the examples when trying to figure out translation stuff. The heuristics could be better, which is what I tried to figure out, but it works fine for now...
Sorry that this lapsed, I've kludged something small for:
These above examples should soon be fine, as soon as kaikki.org updates. The change made was to accept "taxonomic" text as "english". As for these:
There's too much non-English in these. Sometimes a couple of taxonomic names don't trigger the heuristics too much, but here we have 4/6 non-English words.
This breaks because of the extra long hyphen. We end up with "škola" and "dílna lidskosti [...]".
Too little English.
Don't yet know what the problem here is. Suspect it might be punctuation. The rest seem to contain just rare words that the small corpus we use doesn't recognize. I'll ad them to the dictionary manually. |
The issue with esemeska was the use of the template argument There was an issue with "physical property", because "physical property" was classified as tags by classify_desc(), because both of those words are actual tag data in some language or other. However, this was fixed by improving the same block of conditions that was also used in the esemeska kludge above, and had to do with template arguments. In this case, example templates have a third argument for translations, which is usually just unnamed (and represented by I'll check out some of the issues that are left (when there are too many –– long hyphens) and maybe the liška stuff. |
Issue #604, Czech translations (continued) In translations like ``` ví ucho ― Leonotis nepetifolia (literally, “lion's ear”) ``` the translation part starting with "Leonotis" is has its classification returned as "taxonomic" due to the heuristics used in classify_desc(). I've been trying to kludge something better here, but for this specifically the right call to make is to change it so that if a description is either "english" or "taxonomic", that counts as English. There is not meaningful distinction here in the examples when trying to figure out translation stuff. The heuristics could be better, which is what I tried to figure out, but it works fine for now...
I've committed some more fixes. The only ones left are the "does not look like English to the classify_desc heuristics" in |
I've just spent HOURS trying to figure out why a regex isn't working with a specific line and it turns out there seems to be a rendering bug (or generation bug?) with i+combining acute accent on my computer in all the programs I've tried. Suffice to say, the issue here is that classify_desc uses a simple regex to gatekeep what text to let through to be checked whether it's English or not, and that doesn't include allowing stuff beyond Anyhow, this is more of a memo to me to explain to myself wtf is going on my local branch when I get back from vacation. |
Some examples in some Czech words have an issue, where instead of having both
text
andenglish
in the example object, there's onlytext
containing a concatenation of both, e.g. "the original text ― the translated text".Some examples:
If it's a problem in the page markup, I can fix it there, but I didn't see what could cause it.
The text was updated successfully, but these errors were encountered: