Contributing a new Wiktionary Language #651

daxida · 2024-05-27T12:19:17Z

daxida
May 27, 2024

First of all, I'm not a contributor. At the moment this is mainly for discussion with those who are.

This is a discussion about a possible roadmap for contributing in a new language. If it is deemed useful, maybe it could be moved into a wiki section for further reference. It could also be used as a template for future PRs.

I will try to place my questions at the bottom of this message. I apologize in advance for most of them being very basic.

Ideally this message will be updated will the contents of the discussion below. This first version is bound to be very defective for I still know very little about the project. Feel free to correct every wrong piece of information you can find.

Since my target language is Greek, I will (maybe?) use that for the practical examples.

Notes

Some knowledge of the target language is required to understand the relevant documentation pages from Wiktionary.
As far as I understand the whole developpement should only happen in the extractor and test folders.
The goal is to write an extractor for the target language.

Roadmap

Prelude

Create a folder at wiktextract/src/extractor/LANG_CODE.
Copy the page.py and model.py (required for WordEntry) from another extractor (I went with the Spanish version).
Keep only two functions: parse_page() and parse_section(). (NOTE: Maybe only a simplified version of model.py could be used REF)
Delete / comment the rest, together with the unneeded dependencies.

Page layout

REF The starting point should be a parse_page() in the corresponding page.py file. Examples: fr, es, de, ru
A parse_section() function that deals with the section structure.
At this point, consider adding a simple test_LANG_CODE_page.py (where LANG is your target language) in tests for some rough validation.
Now you should be able to use the wiktwords CLI, which will, as expected, log many errors. Use the database option.

Page sections

Progressively replace local logic in parse_section() with the corresponding py file. The relevant modules are: etymology, example, gloss, linkage, pronunciation and translation.

Questions

About `parse_page()`

Do I need to be concerned about wiktionary dumps or can I get started with simple wiktionary pages by fetching their html/templates and making tests?
Do I need to be concerned with the WordEntry object before even writting parse_page()? Is it supposed to work if I don't change anything or should I get rid of it at start?
What would the first iteration of a parse_page() function look like? What unittests should it be able to pass? Are only these three lines subject to change:

            # https://fr.wiktionary.org/wiki/Modèle:langue
            # https://fr.wiktionary.org/wiki/Wiktionnaire:Liste_des_langues
            if subtitle_template.template_name == "langue":

xxyzz · 2024-05-28T01:35:04Z

xxyzz
May 28, 2024
Collaborator

All non-en extractors use pydantic model, it's more reliable than adding data to Python dictionary. We use this library to validate the extracted page data and generate JSON schema.

The root model is WordEntry, you could start with the "word", "pos", "lang", "lang_code", "senses.glosses" fields, new edition could have new JSON fields for it's unique data but most fields names and types should be the same as other editions.

How pages are parsed depends on the page payout, most editions start with language titles. fr edition uses the title template parameter, other editions check the section title text. Unit tests are in the tests folder and have edition language code in the file name, for example: tests/test_fr_translation.py. Tests are usually for testing the code could extract expected data from some wikitext, test should also use real or simplified wikitext with original page link in comments(or put it in the Wtp.start_page()) so we could know how to change the code and test if templates are updated on Wiktionary in the future.

You have to use a bz2 dump file, don't use the latest 20240520 files(see pinned discussion). Test a page could be done with the --page command argument. Dump file could also be created from the export special page, "Include templates" option must be enabled.

0 replies

daxida · 2024-06-03T05:46:54Z

daxida
Jun 3, 2024
Author

Thank you for your answer.

Second round of questions:

At what point is an extractor expected to be drafted / published, what are the parts: pronunciation.py, tags.py, etc. that would need to be written in order to be considered for a PR? I guess just a rough estimate should be enough.
Related to the previous question, what should be the recommended order of progress. F.e. parse_page() > models.py > gloss.py > pronunciation.py etc.?
How to assess progress? Is the workflow of writting, let's say, a gloss.py exclusively based on working with the corresponding test or is it recommended to also use the dumpfiles?
When should I be concerned with the JSON schema?
Related to the first question, I could see that the original contributor of the "Spanish, German and Russian extractors" empiriker does not participate anymore, and that the recent commits concern improvements in the tags translations of these languages, so I would like to know what are the expectations on this front. Should someone contribute on just the "groundwork" and expect the maintainers of this project to incrementally build on top of it?

1 reply

xxyzz Jun 3, 2024
Collaborator

pydantic model(JSON schema) should be used at the beginning and most fields should be the same as other editions, migrate to pydantic or change model fields later could be a headache.

I would recommend start with gloss list, than move on to other sections. And only extract the most common layout then handle rare cases later.

tags fields(grammatical, qualifier, location data) should use values that are defined in the en edition code(lower case, use - to separate words), but non-en extractors add non-translated and unwanted data so we have to check the values and translate them. Currently non-en extractors first add data to the raw_tags field, then translate some tags to tags field.

I'm not sure about new extractors... I have to ask Tatu whether we should work on new editions. Our main concerns about new editions are:

dump file size should be relatively large
provide significant additional native language data than the en edition
should be a widely spoken language

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing a new Wiktionary Language #651

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Contributing a new Wiktionary Language #651

daxida May 27, 2024

Notes

Roadmap

Prelude

Page layout

Page sections

Questions

About parse_page()

Replies: 2 comments · 1 reply

xxyzz May 28, 2024 Collaborator

daxida Jun 3, 2024 Author

xxyzz Jun 3, 2024 Collaborator

daxida
May 27, 2024

About `parse_page()`

Replies: 2 comments 1 reply

xxyzz
May 28, 2024
Collaborator

daxida
Jun 3, 2024
Author

xxyzz Jun 3, 2024
Collaborator