-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RO Feedback #626
Comments
Changed the capitalization of surnames with commit 51787f7. |
Sorted component files in commit be08d9a. |
Changed note type to |
Converted notes into more specific elements within segments with commit cc386af. |
Spaces around notes
You have removed spaces around notes which can cause troubles in tokenization... It can happen that the note is inside the token (= unexpected behaviour of my annotation script). <seg xml:id="ParlaMint-RO_2000-04-14-id4927.u39.seg6">Cine este pentru?<vocal type="shouting"><desc>(Vociferără în partea dreaptă a sălii).</desc></vocal>Vă <!-- ... --> confuzie.</seg> <seg xml:id="ParlaMint-RO_2000-04-14-id4927.u39.seg6">Cine este pentru? <vocal type="shouting">
<desc>(Vociferără în partea dreaptă a sălii).</desc>
</vocal> Vă <!-- ... --> confuzie.</seg> |
Added spaces around notes with commit 79b08b1. |
Can you please provide an example? I ran |
Oh, sorry - your <teiCorpus xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en" xml:id="ParlaMint-RO"> This is the only corpus that has it. I implicitly expected that it has To search language context of java -cp /usr/share/java/saxon.jar net.sf.saxon.Query -xi:off \!method=adaptive -qs:'//*[name()="term" and ./ancestor::*[@xml:lang][1]/@xml:lang="ro"]' -s:ParlaMint-RO/ParlaMint-RO.xml
<term xmlns="http://www.tei-c.org/ns/1.0">Legislatură</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Unități geo-politice sau administrative</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Legislatură națională</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Organizație politică</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Camere</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Parlament bicameral</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Senat</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Camera deputaților</term> The majority language in teiCorpus is usually English, so you have it correctly according to the documentation:
but it is common to have the corpus language... @TomazErjavec Can be english preserved in |
Normalized Should resolve:
|
In practice I'd much rather not have an exception. So, |
Changed language of the |
Duplicite person
Every person should have one record in listPerson: <person xml:id="Augustin-Lucian-Bolcas">
<persName>
<forename>Lucian</forename>
<forename>Augustin</forename>
<surname>Bolcaș</surname>
</persName>
<sex value="M"/>
<affiliation ana="#RoParl.51" ref="#RoParl" role="member" from="2000-12-15" to="2004-11-30"/>
</person>
<person xml:id="Lucian-Augustin-Bolcas">
<persName>
<forename>Lucian</forename>
<forename>Augustin</forename>
<surname>Bolcaș</surname>
</persName>
<sex value="M"/>
<affiliation ana="#RoParl.51" ref="#RoParl" role="member" from="2000-12-15" to="2004-11-30"/>
<affiliation ana="#RoParl.52" ref="#RoParl" role="member" from="2004-12-19" to="2008-12-13"/>
</person>
|
As suggested by @TomazErjavec, added |
Fixed duplicate person with commit ac9a2bc. |
Included corpus timespan in |
Included corpus span in corpus subtitle with commit df3879b. |
As discussed in the meeting on April 12, we cannot provide the presence list in time for this version because this requires changes in the crawlers of the session transcripts. I will try to include this data into a future version of the corpus. |
Extended meeting elements with term and sitting information with commit 75affa9. |
@RePierre, you include unannotated files (TEI) in annotated (TEI.ana) root file: <xsi:include xmlns:xsi="http://www.w3.org/2001/XInclude" href="ParlaMint-RO_2015-09-29-id7560.xml"/> should be <xsi:include xmlns:xsi="http://www.w3.org/2001/XInclude" href="ParlaMint-RO_2015-09-29-id7560.ana.xml"/> |
Included proper component files in commit 90da93b. |
@RePierre, thanks for the progress. I have spotted an issue in the TEI.ana version of the files: wrongly placed notes in the TEI.ana version
<seg xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8">Cred <!--
...
--> salariile. <vocal type="noise"><desc>(Aplauze.)</desc></vocal> Însă<!--
...
--> toţi.</seg> TEI.ana: <seg xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8"><vocal type="noise"><desc>(Aplauze.)</desc></vocal> Însă<!--
...
-->toţi.<s xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8.1">
<w xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8.1.1" lemma="Cred" pos="Vmip1s" msd="UPosTag=AUX|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin">Cred</w>
<!--... -->
</s>
<!--... -->
</seg>
|
Unrecognized full-paragraph note
<seg xml:id="ParlaMint-RO_2006-09-18-id6154.u33.seg8">Mulţumesc.</seg>
<seg xml:id="ParlaMint-RO_2006-09-18-id6154.u33.seg9">(Domnul Valeriu Ştefan Zgonea se îndreaptă spre prezidiu.)</seg>
</u> should be: <seg xml:id="ParlaMint-RO_2006-09-18-id6154.u33.seg8">Mulţumesc.</seg>
</u>
<note type="narrative">(Domnul Valeriu Ştefan Zgonea se îndreaptă spre prezidiu.)</note> Other occurrences in sample data:
|
U+0096 (SPA) Unicode Character
This character is allowed in ParlaMint, but it causes problems in linguistic annotations, I suggest removing it from the text: https://github.com/romanian-parlamint/ParlaMint/blob/a510c149ba04407fe6df77414b3a2aaec6f47022/Data/ParlaMint-RO/ParlaMint-RO_2000-10-24-id4980.xml#L148 <seg xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5">După <!--
...
--> urgie � 1940. Dar n-a fost să fie aşa.</seg> <w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.29" lemma="�" pos="Ncm--n" msd="UPosTag=NOUN|Definite=Ind|Gender=Masc">�</w> |
Named entities
I guess you are using a model that labels not only named entities from PER/LOC/ORG/MISC set but also DATE and probably other labels. Something like this: https://huggingface.co/dumitrescustefan/bert-base-romanian-ner <name type="MISC">
<w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.23" lemma="acel" pos="Dd3msr---e" msd="UPosTag=DET|Case=Acc,Nom|Gender=Masc|Number=Sing|Person=3|Position=Prenom|PronType=Dem">acel</w>
<w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.24" lemma="an" pos="Ncms-n" msd="UPosTag=NOUN|Definite=Ind|Gender=Masc|Number=Sing">an</w>
</name> or <name type="MISC">
<w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.30" lemma="1940" pos="Mc-s-d" msd="UPosTag=">1940</w>
</name> The year 1940 is not a proper name, so it shouldn't be surrounded by
We are under time pressure, so I suggest using option (1) for ParlaMint3.0, and you can possibly improve it in ParlaMint3.1 (create RO special taxonomy, use proper elements and add |
shifted NEs ?
In this paragraph (ParlaMint-RO_2000-10-24-id4980.u2.seg8.2), NEs seem to be shifted. <s xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg8.2">
atitudinea autorităţilor ucrainene faţă de delegaţiile judeţului Suceava şi
<name type="MISC">Botoşani</name>
, la festivitatea dezvelirii
<name type="LOC">statuii</name>
lui
<name type="LOC">Eminescu</name>
, la Cernăuţi, în ziua de 15 iunie
<name type="LOC">2000</name>
; constrângerile
<name type="MISC">aduse în şcolile româneşti;</name>
coborârea unicului steag românesc de
<name type="MISC">pe</name>
clădirea sediului
<name type="LOC">redacţiei ziarului"</name>
Lumea"
<name type="MISC">;</name>
prezenţa la
<name type="MISC">manifestările româneşti a unor</name>
reprezentanţi gălăgioşi ai organizaţiilor
<name type="MISC">extremiste</name>
ucrainene; oprirea tinerilor etnici români,
<name type="MISC">în</name>
număr de
<name type="PER">200, de</name>
a veni la studii
<name type="MISC">în</name>
România, cu burse din partea statului
<name type="LOC">român</name>
şi altele.
</s> |
Voci din sală: in utterance
<note type="speaker">Domnul Vasile Lupu:</note>
<u ana="#chair" who="#Vasile-Lupu" xml:id="ParlaMint-RO_2000-10-24-id4980.u37">
<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg1">Să vedem cine îl face. <vocal type="murmuring"><desc>(Rumoare în partea stângă a sălii)</desc></vocal> </seg>
<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg2">Dar, iată, se pare că nu s-a terminat şedinţa Biroului permanent.</seg>
<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg3">Voci din sală:</seg>
<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg4">S-a terminat de mult!</seg>
</u> should be: <note type="speaker">Domnul Vasile Lupu:</note>
<u ana="#chair" who="#Vasile-Lupu" xml:id="ParlaMint-RO_2000-10-24-id4980.u37">
<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg1">Să vedem cine îl face. <vocal type="murmuring"><desc>(Rumoare în partea stângă a sălii)</desc></vocal> </seg>
<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg2">Dar, iată, se pare că nu s-a terminat şedinţa Biroului permanent.</seg>
</u>
<note type="speaker">Voci din sală:</note>
<!-- no who attribute, ana is regular - expecting MP interrupting -->
<u ana="#regular" xml:id="ParlaMint-RO_2000-10-24-id4980.u38">
<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u38.seg1">S-a terminat de mult!</seg>
</u> |
person - affiliation - organization
I guess you are aware of this. I just wanted it to be recorded
|
Fixed with commit 6662ec4. |
Removed in commit 69a116e. |
strange UPosTag
|
No
|
As RO won't be a part of 3.1, moving this to "future" milestone. |
meeting element
#parla.term
,#parla.sitting
)I haven't found any information about terms or sitting in the meeting elements. This is how other corpora implement it:
ParlaMint/Data/ParlaMint-UA/ParlaMint-UA_2014-12-02-m0.xml
Lines 11 to 13 in 197e5ec
I was not able to find term info on Romanian parliament websites - I believe the information is there.
And if a single file contains one sitting, then add sitting identification.
Missing speech content
In some files there is no speech content:
https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-09-04-id4959.xml
but the source contains speech contents:
https://www.cdep.ro/pls/steno/steno2015.stenograma?ids=4959&idl=1#S0
Chairman note type
narrative
orpresident
According to doc,
narrative
orpresident
fits better in this case:https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-09-04-id4959.xml#L125
not recognized notes
Notes are in source italics so easy to recognize...
https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-04-14-id4927.xml#L474
should be: (https://clarin-eric.github.io/ParlaMint/#TEI.vocal)
presence list
https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-04-14-id4927.xml#L510-L513
corpus timespan
bibl
setting
https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L72
https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L252
setting element
root file
setting
element should correspond to component ones (missing country)https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L249-L253
vs:
https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-04-14-id4927.xml#L97-L101
capitalize surname
https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L384
should be
sort component files
The component files should be ordered according to the contents' date.
taxonomies
xml:lang="ro"
The text was updated successfully, but these errors were encountered: