-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Macedonian dataset #4
Comments
@stefan-it, hi. Recently we managed to disambiguate the 1984 Macedonian corpus on the level of morphosyntax, but it is still not officially published as some improvements of the annotation are being performed as we speak (and they do not go along very fast). We were eager to start experimenting with this notoriously under-resourced language, also wanted to add at least basic support for it to our CLASSLA pipeline, therefore we performed a train:dev:test split of the preliminary data here on babushka-bench. For what I know?, the corpus still does not contain any NE annotations. We would like to add some for Macedonian in general (not sure whether the 1984 corpus is the best for this task), inter alia, to add Macedonian NER support to CLASSLA. If you happen to know people interested in the task, we have some decent annotation guidelines from other South Slavic languages and quite probably funding available as well. I would not mind hearing on your wider motivation in Macedonian as we are eager to improve support for it on all levels. We will do so also in the MaCoCu project which starts this June (crawling top-level domains of different South-Eastern-European countries, Turkey included, curating / selecting data, building pre-trained language models). Nice work with the dbmdz models btw, we use them primarily for processing German data. We recently published BERTić, if you happen to be in need of processing Croatian, Serbian etc. Nikola |
Hi Nikola, thanks for your detailed answers! Sorry for my misunderstanding, the dataset of course has no NE annotations 😅 But talking about NE, you may have noticed that the recent spacy version comes with a (better) support for Macedonian, including a trained model for NER. The author sent me that dataset (see https://twitter.com/_inesmontani/status/1356280197746606099). They plan to release it publicly, so maybe it could also be integrated here for benchmarking. My colleague and I are working on Macedonian-focussed LMs, so we're primarily looking for datasets for our evaluations. E.g. WikiANN as silver standard is ok, but better datasets are heavily needed :) I just had a look at the BERTić model, results are really looking good! Have you considered working on an ELECTRA model as well 🤔 For mono-lingual models, I could clearly see a performance boost (did a lot of ELECTRA pre-training for our DBMDZ models recently). However, I tried to train multilingual ELECTRA models (same languages as mBERT), but the performance was not really good, so I'm not sure if this would also be the same for 4 languages 🤔 |
Stefan, hi. Busy period. I just contacted Borijan. I will motivate him to publish the dataset if it is CC-BY-SA. No need for all the e-mail writing. :-) What textual data do you use for building the Macedonian LM? I have a ~320M tokens crawl of the .mk domain (used for building these static embeddings https://www.clarin.si/repository/xmlui/handle/11356/1359) if you can profit from that? For evaluation of the model, I guess, part-of-speech tagging on our dataset might be a proper way to evaluate? I would be very much interested in hearing details of what exactly you are doing for Macedonian, to coordinate efforts as much as possible. We can also switch to e-mail (nikola tod ljubesic ta jsi tod si). Regarding BERTić, these are officially four languages, but purely linguistically speaking, these are variants of a single pluricentric language. In other words, I actually did not do any multilingual pre-training via Electra. Good to know that your results were not that good if the need for multilingual training ever arises! |
Hi @nljubesi ,
as far as I understand this commit message:
841c47d#diff-fd8b5fda8a45abe08c7b3247d4abb7b1395dd3bf6008738388f42ff052bef9fe
The Macedonian dataset comes from the 1984 Multext-east data, but I still have some questions 😅
Many thanks,
Stefan ❤️
The text was updated successfully, but these errors were encountered: