FYI: non-English editions data now available on https://kaikki.org (currently de, fr, es, ru, zh, ja, pl) #534

tatuylonen · 2024-03-10T21:07:22Z

As you may have noticed, data from non-English editions is now available on kaikki.org (links on the front page).

The non-English edition extractions need more work, but please report problems as issues.

brishtibheja · 2024-05-17T13:07:37Z

Hey there, are you aware of kaikki to yomitan? They use the data extracted here to create dictionaries for use in Yomitan. Yomitan, a fork of Yomichan, enables language learners to instantly look up the meanings of words. Furthermore it also creates Anki cards for you so you can just focus on studying. Having said that, I contacted one of the persons maintaining KTY (kaikki to yomitan) to ask about the eng-jp versions but they have told me the data isn't available in kaikki. It would benefit me greatly if that was possible to extract for KTY. More specifically, he said

yes, kaikki only supports 6 wikt editions currently

I'm sorry if this is not the right place to ask, I'm not a dev and only use Github for reporting issues to the apps I use. I am aware of there being a place called "discussions" and I tried to find it but in vain.

xxyzz · 2024-05-20T01:07:34Z

We currently don't have plan to extract the Japanese Wiktionary but I think en to ja dictionary could be created by using the translation data from English Wiktionary.

And the discussions feature for this repo is not enabled.

kristian-clausal · 2024-05-20T04:38:51Z

To extract a wiktionary project, we need someone who is able to interpret and understand the original Wiktionary articles (that is, they should know at least some Japanese), and then they'd need to do the same kind of work with it that @xxyzz has done with the others, which is definitely not trivial as he can attest.

daxida · 2024-05-27T08:03:03Z

Hello,

I'm considering contributing towards the Greek version. I understand that there are no current plans for other non-English languages, and I recognize that this would require a considerable amount of work, which I may not be able to fully commit to. Nonetheless, I would like to give it a try if that's acceptable.

I have only spent about an hour browsing the repository, but I couldn't find a clear roadmap or wiki outlining the necessary steps for this process. I'll wait for your response before diving in further. If it's feasible, I would appreciate some guidance on the direction I should take.

xxyzz · 2024-05-27T08:27:03Z

You could take a look of the code in the extractor folder, all extractors code start from the parse_page() in page.py file. How extractor code are written depends on the Wiktionary's wikitext page layout: https://el.wiktionary.org/wiki/Βικιλεξικό:Δομή_λημμάτων

once figure out the general page layout, next step is creating a file contains section title data then start extracting data in each section.

daxida · 2024-05-27T11:10:52Z

Thank you. This is a lot of new information for me to digest. As I hinted before, are there any plans to establish a roadmap for contributing to new languages? I'm having trouble figuring out what to do, in which order, and how to test the progress.

Also, I'm not sure if I should continue this conversation here or if I should start a new issue or discussion.

kristian-clausal · 2024-05-27T11:18:21Z

We don't have a roadmap because the process has mostly been Tatu originally creating the extractor for en.wiktionary, and which continued bit by bit for ages and still goes on, and now xxyzz has been working on creating the extractors for other languages; we should use his process as the basis of roadmap, but that's been on the backburner.

Go ahead with a new discussion (now that we've got them, Tatu enabled them recently).

brishtibheja · 2024-05-27T11:19:37Z

As I hinted before, are there any plans to establish a roadmap for contributing to new languages?

This would be really helpful for all of us. I also expect others to contribute to the project including the Japanese wikitionary.

empiriker · 2024-05-27T12:27:22Z

You might also want to take a look at the commits and the PRs where I lay the groundwork for the Spanish, German and Russian extractors.(Just filter by my user name)

This at least could give you a good idea of where to start and how to break it down into small steps.

Take this advice with a grain of salt though. While the "steps" should still be actionable to some extent, the code from back then almost certainly is not. As I understand, @xxyzz has done quite some work to improve and align the different extractors since then.

daxida · 2024-05-27T12:32:09Z

I actually already started doing that :D

I'm slowly trying to consume enough commits to get a clearer idea of the process.

If you will be so kind to chime in the related discussion and give a bit of feedback I (and I hope others in the same situation) would greatly appreciate it.

xxyzz · 2024-09-05T02:48:28Z

Japanese edition and Polish edition data are now available on kaikki.org.

xxyzz · 2024-11-04T08:55:21Z

Dutch and Korean editions have been added.

kristian-clausal · 2024-11-04T08:58:22Z

I forgot to update the frontpage again, I'll do that now...

ngoclt · 2024-11-04T16:41:31Z

Hi guys,
Thank you so much for your effort. Any plan when you add Finnish?

kristian-clausal · 2024-11-05T05:51:20Z

@ngoclt at some point in the future. I'm currently slogging through Greek wiktionary, but after that I'll pick up the Finnish one. But I'm much, much slower at this than @xxyzz so it will take some time.

daxida · 2024-11-07T11:54:34Z

@ngoclt at some point in the future. I'm currently slogging through Greek wiktionary, but after that I'll pick up the Finnish one. But I'm much, much slower at this than @xxyzz so it will take some time.

I'm sorry, I really intended to give it a try but I was (and still am) not familiar enough with the project. It was a very overwhelming experience. The fault is only mine.

I may consider contributing to the Greek version but I am unsure where this is being developped. I see no branches here nor in your page. Should I wait?

And thank you as well for supporting for the Greek version. I wasn’t aware that someone was doing work on that front.

kristian-clausal · 2024-11-07T11:56:53Z

I haven't made a branch yet because it's still bad. If you could take a look at it (and later the output on kaikki), that would be grand; I've tried to translate stuff the best I can.

daxida · 2024-11-07T13:49:30Z

Sure, let me know where should I look at.

kristian-clausal · 2024-11-08T06:27:12Z

Sure, let me know where should I look at.

I will make a PR at some point when I'm happy with what I've got, I'll try to remember to message.

ngoclt · 2024-11-11T22:31:46Z

Thank you so much for your effort. Can you tell me where to start to learn the project and can contribute?

xxyzz · 2024-11-12T00:57:28Z

First you need to learn wikitext then you could start reading any extractor's page.py file in extractor folder except the "en" edition code, and the template folder could be used as a guide.

If you have any questions while reading the code please post them to GitHub discussions.

StefanVukovic99 mentioned this issue May 26, 2024

Utilize translations info yomidevs/kaikki-to-yomitan#46

Closed

StefanVukovic99 mentioned this issue May 31, 2024

ko-zh & zh-ko yomidevs/kaikki-to-yomitan#48

Closed

xxyzz pinned this issue Sep 27, 2024

xxyzz changed the title ~~FYI: non-English editions data now available on https://kaikki.org (currently de, fr, es, ru, zh)~~ FYI: non-English editions data now available on https://kaikki.org (currently de, fr, es, ru, zh, ja, pl) Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FYI: non-English editions data now available on https://kaikki.org (currently de, fr, es, ru, zh, ja, pl) #534

FYI: non-English editions data now available on https://kaikki.org (currently de, fr, es, ru, zh, ja, pl) #534

tatuylonen commented Mar 10, 2024

brishtibheja commented May 17, 2024 •

edited

Loading

xxyzz commented May 20, 2024

kristian-clausal commented May 20, 2024

daxida commented May 27, 2024

xxyzz commented May 27, 2024

daxida commented May 27, 2024

kristian-clausal commented May 27, 2024

brishtibheja commented May 27, 2024

empiriker commented May 27, 2024

daxida commented May 27, 2024

xxyzz commented Sep 5, 2024

xxyzz commented Nov 4, 2024

kristian-clausal commented Nov 4, 2024

ngoclt commented Nov 4, 2024

kristian-clausal commented Nov 5, 2024

daxida commented Nov 7, 2024

kristian-clausal commented Nov 7, 2024

daxida commented Nov 7, 2024

kristian-clausal commented Nov 8, 2024

ngoclt commented Nov 11, 2024

xxyzz commented Nov 12, 2024 •

edited

Loading

FYI: non-English editions data now available on https://kaikki.org (currently de, fr, es, ru, zh, ja, pl) #534

FYI: non-English editions data now available on https://kaikki.org (currently de, fr, es, ru, zh, ja, pl) #534

Comments

tatuylonen commented Mar 10, 2024

brishtibheja commented May 17, 2024 • edited Loading

xxyzz commented May 20, 2024

kristian-clausal commented May 20, 2024

daxida commented May 27, 2024

xxyzz commented May 27, 2024

daxida commented May 27, 2024

kristian-clausal commented May 27, 2024

brishtibheja commented May 27, 2024

empiriker commented May 27, 2024

daxida commented May 27, 2024

xxyzz commented Sep 5, 2024

xxyzz commented Nov 4, 2024

kristian-clausal commented Nov 4, 2024

ngoclt commented Nov 4, 2024

kristian-clausal commented Nov 5, 2024

daxida commented Nov 7, 2024

kristian-clausal commented Nov 7, 2024

daxida commented Nov 7, 2024

kristian-clausal commented Nov 8, 2024

ngoclt commented Nov 11, 2024

xxyzz commented Nov 12, 2024 • edited Loading

brishtibheja commented May 17, 2024 •

edited

Loading

xxyzz commented Nov 12, 2024 •

edited

Loading