ES-CT: translations and data update #777

matyaskopp · 2023-09-19T10:16:19Z

No description provided.

Data-main

sync with ParlaMint data branch

matyaskopp · 2023-09-20T12:38:57Z

@rjzevallos, should we use listPerson, which has been pushed with a pull request for the whole corpus in the release?

rjzevallos · 2023-09-20T13:30:36Z

@matyaskopp, Yes, you can use this new listPerson

rjzevallos · 2023-09-21T07:39:41Z

@matyaskopp, I see that we have a lot of form and syntax warnings, how can we fix that?

TomazErjavec · 2023-09-21T11:46:57Z

Well, maybe @matyaskopp has some better idea, but, in short, you should get a better parser, because the one you are using produces very many illegal UD parses.
That said, it is probably too late now anyway for 3.1.

matyaskopp · 2023-09-21T12:36:22Z

The problem is that ES-CT uses pre-tokenized and pre-sententized input for UDPipe. UDPipe is quite bad for this kind of input.
If you leave tokenization on UDPipe, then everything works, and it never happens that the sentence has multiple roots. I guess that the sentence inside <s>, in fact, contains multiple trees that do not overlap, so some postprocessing can probably solve it.
It is too late to fix it. Frankly, you have known about problems with linguistic annotation for a few months:

ES-CT: linguistics annotations #639

This is probably the reason for the multiple roots:
Your sentences end only with . or ?, which I guess is the complete list of characters at the end of sentences in Catalan.
https://github.com/IULATERM-TRL-UPF/ParlaMint_ES-CT/blob/e99e7bf9b7e43d2b30fd473e7ee2fe31540f8c86/src/util_freeling.py#L57
In your implementation: Hola! Com estàs? is one sentence, but UDPipe sees it as two sentences (= two roots)

matyaskopp and others added 5 commits September 19, 2023 16:30

Merge pull request #766 from clarin-eric/data

643f902

Data-main

Merge pull request #5 from clarin-eric/data

3975b5e

sync with ParlaMint data branch

new taxonomies

6395407

Update ParlaMint-ES-CT-listPerson.xml

71d7b9e

Update ParlaMint-ES-CT-listPerson.xml

acc4238

matyaskopp mentioned this pull request Sep 20, 2023

Update - taxonomies ES-CT #780

Closed

matyaskopp merged commit 623f4e1 into clarin-eric:data Sep 20, 2023
8 checks passed

github-actions bot pushed a commit that referenced this pull request Sep 20, 2023

action: generating ParlaMint-[ES-CT] sample files with #777

27b82f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ES-CT: translations and data update #777

ES-CT: translations and data update #777

matyaskopp commented Sep 19, 2023

matyaskopp commented Sep 20, 2023

rjzevallos commented Sep 20, 2023 •

edited

Loading

rjzevallos commented Sep 21, 2023

TomazErjavec commented Sep 21, 2023

matyaskopp commented Sep 21, 2023

ES-CT: translations and data update #777

ES-CT: translations and data update #777

Conversation

matyaskopp commented Sep 19, 2023

matyaskopp commented Sep 20, 2023

rjzevallos commented Sep 20, 2023 • edited Loading

rjzevallos commented Sep 21, 2023

TomazErjavec commented Sep 21, 2023

matyaskopp commented Sep 21, 2023

rjzevallos commented Sep 20, 2023 •

edited

Loading