Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES-CT: translations and data update #777

Merged
merged 5 commits into from
Sep 20, 2023
Merged

ES-CT: translations and data update #777

merged 5 commits into from
Sep 20, 2023

Conversation

matyaskopp
Copy link
Collaborator

No description provided.

@matyaskopp matyaskopp merged commit 623f4e1 into clarin-eric:data Sep 20, 2023
8 checks passed
@matyaskopp
Copy link
Collaborator Author

@rjzevallos, should we use listPerson, which has been pushed with a pull request for the whole corpus in the release?

@rjzevallos
Copy link
Contributor

rjzevallos commented Sep 20, 2023

@matyaskopp, Yes, you can use this new listPerson

@rjzevallos
Copy link
Contributor

@matyaskopp, I see that we have a lot of form and syntax warnings, how can we fix that?

@TomazErjavec
Copy link
Collaborator

Well, maybe @matyaskopp has some better idea, but, in short, you should get a better parser, because the one you are using produces very many illegal UD parses.
That said, it is probably too late now anyway for 3.1.

@matyaskopp
Copy link
Collaborator Author

The problem is that ES-CT uses pre-tokenized and pre-sententized input for UDPipe. UDPipe is quite bad for this kind of input.
If you leave tokenization on UDPipe, then everything works, and it never happens that the sentence has multiple roots. I guess that the sentence inside <s>, in fact, contains multiple trees that do not overlap, so some postprocessing can probably solve it.
It is too late to fix it. Frankly, you have known about problems with linguistic annotation for a few months:

This is probably the reason for the multiple roots:
Your sentences end only with . or ?, which I guess is the complete list of characters at the end of sentences in Catalan.
https://github.com/IULATERM-TRL-UPF/ParlaMint_ES-CT/blob/e99e7bf9b7e43d2b30fd473e7ee2fe31540f8c86/src/util_freeling.py#L57
In your implementation: Hola! Com estàs? is one sentence, but UDPipe sees it as two sentences (= two roots)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants