-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large amount of substring ("str left") templates in etymology, possibly in relation to "lite" templates #611
Comments
I think it's the code at here wiktextract/src/wiktextract/extractor/en/page.py Lines 3077 to 3083 in f4fd8c9
uses the |
seems to indicate Tatu meant to capture nested etymology templates, and that to ignore unwanted templates with the blacklist. In this case, I guess the culprit is |
I've added |
Here is the wikitext for Etymology 1 of bot#Old_Javanese:
Here is the parsed wikitext in the latest version (2024/05/01):
https://gist.github.com/myrriad/24429fe70924a39d27cfae7a692979a2
There are an excessive number of "str left" and "str right" templates, which repetitively takes substrings of strings (often only extracting one character at a time.) The etymology_text appears good. I suppose these are affected templates: url
I detected these examples by sorting entries by number of etymology templates. Accordingly, here are all entries with >= 90 etymology templates. These templates empirically appear in conjunction with "-lite" templates.
For debugging purposes here is a list of filtered entries with >100 etymology templates
https://gist.github.com/myrriad/f676ea15c5e0da4022473f790d5432c9
The text was updated successfully, but these errors were encountered: