Replies: 2 comments 1 reply
-
All non-en extractors use pydantic model, it's more reliable than adding data to Python dictionary. We use this library to validate the extracted page data and generate JSON schema. The root model is How pages are parsed depends on the page payout, most editions start with language titles. fr edition uses the title template parameter, other editions check the section title text. Unit tests are in the tests folder and have edition language code in the file name, for example: tests/test_fr_translation.py. Tests are usually for testing the code could extract expected data from some wikitext, test should also use real or simplified wikitext with original page link in comments(or put it in the You have to use a bz2 dump file, don't use the latest 20240520 files(see pinned discussion). Test a page could be done with the |
Beta Was this translation helpful? Give feedback.
-
Thank you for your answer. Second round of questions:
|
Beta Was this translation helpful? Give feedback.
-
First of all, I'm not a contributor. At the moment this is mainly for discussion with those who are.
This is a discussion about a possible roadmap for contributing in a new language. If it is deemed useful, maybe it could be moved into a wiki section for further reference. It could also be used as a template for future PRs.
I will try to place my questions at the bottom of this message. I apologize in advance for most of them being very basic.
Ideally this message will be updated will the contents of the discussion below. This first version is bound to be very defective for I still know very little about the project. Feel free to correct every wrong piece of information you can find.
Since my target language is Greek, I will (maybe?) use that for the practical examples.
Notes
Roadmap
Prelude
wiktextract/src/extractor/LANG_CODE
.page.py
andmodel.py
(required forWordEntry
) from another extractor (I went with the Spanish version).parse_page()
andparse_section()
. (NOTE: Maybe only a simplified version ofmodel.py
could be used REF)Page layout
parse_page()
in the correspondingpage.py
file. Examples: fr, es, de, ruparse_section()
function that deals with the section structure.test_LANG_CODE_page.py
(where LANG is your target language) intests
for some rough validation.Page sections
parse_section()
with the corresponding py file. The relevant modules are: etymology, example, gloss, linkage, pronunciation and translation.Questions
About
parse_page()
WordEntry
object before even writtingparse_page()
? Is it supposed to work if I don't change anything or should I get rid of it at start?parse_page()
function look like? What unittests should it be able to pass? Are only these three lines subject to change:Beta Was this translation helpful? Give feedback.
All reactions