Skip to content
This repository has been archived by the owner on May 8, 2024. It is now read-only.

Protocol sections to \<div> #391

Merged
merged 13 commits into from
Oct 26, 2023
Merged

Protocol sections to \<div> #391

merged 13 commits into from
Oct 26, 2023

Conversation

BobBorges
Copy link
Collaborator

Here I use existing code (scripts/split_into_sections.py) to divide up the unicameral period protocols into sections (based on the § character), delimited by <div> elements.

I also sneak in a script (scripts/git-add_diff-sample.py) which should work in tandem with @ninpnin 's sample-git-diffs, in order to quickly git add the files that were sampled from the diff.

Sample for quality assessment to follow.

@BobBorges

This comment was marked as outdated.

@BobBorges BobBorges requested review from ninpnin and MansMeg October 18, 2023 14:06
organized diff-sampling stuff into subdir
@MansMeg
Copy link
Collaborator

MansMeg commented Oct 18, 2023

The unit tests are failing?

@BobBorges
Copy link
Collaborator Author

its the schema test. some of the 202122 protocols are empty. I found it just before i went home, so not really sure what the cause of that is yet.

@MansMeg
Copy link
Collaborator

MansMeg commented Oct 18, 2023

Seems like it captures page divs:
corpus/protocols/201617/prot-201617--53.xml
This should be easy to fix, I think.

@MansMeg
Copy link
Collaborator

MansMeg commented Oct 18, 2023

Also commentSection does not really make sense semantically. I would go with debateSection and otherSection for now.

edit: I saw this is the standard in the parlamint. But it hurts my eyes. So i would create our own sections here anyway. Simply because I think we will want to have a more elaborate sectioning further down the lines.

@MansMeg
Copy link
Collaborator

MansMeg commented Oct 18, 2023

Also. ParlaMint states that the first note after should be a header, so maybe add that as well?

@ninpnin
Copy link
Collaborator

ninpnin commented Oct 19, 2023

ParlaMint is the more restrictive version of the two, a strict subset of ParlaClarin. I think we should use it as a suggestion.

In practice: sometimes the header is not available in our data, so I think we shouldn't put too much effort into following that rule.

@BobBorges
Copy link
Collaborator Author

I think we should decide on a preliminary idea of how to adjust the divs now and I can implement it before we commit changes to the whole unicameral period. My thoughts:

  • first <div> element under <body> should probably not be tagged as a debate section
  • debate section divs have a type attrib with debate_ as a general value, and we can specify further as we go, e.g., debate_interpellationDebate and debate_interpellationQuestion
  • commentSection should probably be other or something generic for the time being to signal !debate

@BobBorges
Copy link
Collaborator Author

I just talked it over with @ninpnin -- we'll leave the commentSection/debateSection for now. It's easy enough to change later. Parlaclarin, specifies a subtype attribute, so that solves my main issue about classifying types of debates.

I see one check mark on an incorrect <div> -- who should check the rest so we can get on with this?

@MansMeg
Copy link
Collaborator

MansMeg commented Oct 19, 2023

Fair enough. Long term we probably want this information in tables anyways. Hence we should add IDs to the div tags just as we have for the notes and utterances.

i suggest we just use uuid there as well.

@BobBorges
Copy link
Collaborator Author

That's reasonable -- do you want to check the divs are correct enough first? I think it's a short script to add an id to the div tags -- we have a uuid generator function in the pyriksdagen module.

@BobBorges
Copy link
Collaborator Author

the unit test fails because of a couple protocols in 2021/22 with no body. They're on the riksdag open data, will fix this in a separate PR.

@MansMeg
Copy link
Collaborator

MansMeg commented Oct 19, 2023

When I have been thinking a little longer. If we would remove type from the tags later, this would mean that we actually change the API. So we should try to avoid it and fix this right away. I also think MetaSolution was quite clear that the data should just include IDs to simplify linking and adding metadata.

Hence, we should do this right away. I dont think its much work. This would mean:

  1. Create a csv-file (called record_divisors.csv?) with column div_id and type. Im not sure in what folder we should store this.
  2. Add id to all div
  3. Move the ”type” attribute to the csv-file

Does this make sense?

@ninpnin
Copy link
Collaborator

ninpnin commented Oct 20, 2023

I think this is a fundamentally different approach than what we have done so far.

So far, we have had a lot of annotations in the XML files. That's what ParlaClarin is for. Otherwise we would use tabular data, eg. CSVs for text too.

My current gut feeling is that our current approach works better with git.

Either way, I don't think we should add a new CSV now. Either we continue with our current approach, or change to a tabular structure later after more planning.

@MansMeg
Copy link
Collaborator

MansMeg commented Oct 20, 2023

That is true. I think we get some conflicting best practices here. ParlaClarin as a format and MetaSolutions recommendations re using ids and linked data.

I agree with metasolutions long term, but you are right. Lets keep this as small as possible. Although we need to add id to all elements anyway since we gonna need to take samples of sections.

@MansMeg
Copy link
Collaborator

MansMeg commented Oct 20, 2023

the unit test fails because of a couple protocols in 2021/22 with no body. They're on the riksdag open data, will fix this in a separate PR.

Im hesitant to merge a PR that doesnt pass the tests. So we should then try to fix that assp.

@BobBorges
Copy link
Collaborator Author

Here comes a new sample with id atribs in the div and 'empty' protocols in the 202122 year curated. Lets hope the unit tests pass :D

@BobBorges
Copy link
Collaborator Author

BobBorges commented Oct 25, 2023

Sampled changes

corpus/protocols/1972/prot-1972--24.xml

Diff starting from line 3172

@@ -3150,6 +3172,8 @@
           <note xml:id="i-HHDhpAANZJrmYUDqzPhmCK">
             Denna anhållan bordlades.
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-X33R3qea3RbqeNGwrLd1mh">
           <note xml:id="i-KzgEJ9DWhqzRBZbWpYnuwj">
             § 12 Anmäldes och bordlades Kungl. Maj:ts propositioner:
           </note>
  • Correct
  • Incorrect

corpus/protocols/1973/prot-1973--120.xml

Diff starting from line 65

@@ -65,7 +65,7 @@
         </div>
       </front>
       <body>
-        <div>
+        <div type="commentSection" xml:id="i-7Hg1De5Po567941hoEN5Eb">
           <pb facs="https://betalab.kb.se/prot-1973--120/prot_1973__120-000.jp2/_view"/>
           <note xml:id="i-6MnUiXYLfspTq7tqvktbyu">
             Riksdagens protokoll
  • Correct
  • Incorrect

corpus/protocols/197879/prot-197879--79.xml

Diff starting from line 62

@@ -62,7 +62,7 @@
         </div>
       </front>
       <body>
-        <div>
+        <div type="commentSection" xml:id="i-WhnS8hWbaziWUxiyVsjRu">
           <pb facs="https://betalab.kb.se/prot-197879--79/prot_197879__79-000.jp2/_view"/>
           <note xml:id="i-PGiGmFUjFqxeozFogDjSPY">
             Riksdagens protokoll
  • Correct
  • Incorrect

corpus/protocols/197879/prot-197879--90.xml

Diff starting from line 3430

@@ -3400,18 +3430,26 @@
             betänkande 1978/79:14 Jordbruksutskottets betänkande 1978/79:17
             Näringsutskottets betänkanden 1978/79:19-21
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-GDV2NxHAuiYua1uC4LgXNa">
           <note xml:id="i-BUGefzQkzxYYkaXkrsZx3X">
             § 19 Föredrogs och bifölls Interpellationsframställning 1978/79:149
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-QYro5141WoAx8z7uYXVSpa">
           <note xml:id="i-WQtoWEzf5SCNFmFSB7tkmg">
             § 20 Talmannen meddelade att på föredragningslistan för morgondagens
             sammanträde skulle finansutskottets betänkande nr 20 och skatteutskottets
             betänkande nr 29 uppföras främst bland två gånger bordlagda ärenden.
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-NbesPokTdQ24SVVR7fSbwU">
           <note xml:id="i-UoJDppyqfyMhSS23nZhNgg">
             § 21 Anmäldes och bordlades Proposition 1978/79:89 om lokalhyra
           </note>
           <pb facs="https://betalab.kb.se/prot-197879--90/prot_197879__90-041.jp2/_view"/>
+        </div>
+        <div type="debateSection" xml:id="i-G7rCj1kjqCHQDawYn99ejp">
           <note xml:id="i-N97S5BHKP6kAXoaYbEo9ea">
             § 22 Anmälan av interpellation
           </note>
  • Correct
  • Incorrect

corpus/protocols/197980/prot-197980--41.xml

Diff starting from line 5419

@@ -5389,6 +5419,8 @@
               av Lysekilsbanan kan genomföras utan dröjsmål?
             </seg>
           </u>
+        </div>
+        <div type="commentSection" xml:id="i-V3hXj8qo6S2Pzzt3L2yZpA">
           <note xml:id="i-9YeCE5sNnJRu34tVFknkgb">
             § 17 Kammaren åtskildes kl. 15.01. In fidem
           </note>
  • Correct
  • Incorrect

corpus/protocols/197980/prot-197980--56.xml

Diff starting from line 8043

@@ -8023,6 +8043,8 @@
           <note xml:id="i-K4E5su54KdtTu5pacFps41">
             Mom. 2-7 Kammaren biföll vad utskottet i dessa moment hemställt.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-UNjYVpvV3BHw9yDy8yT9Dg">
           <note xml:id="i-8EHo2t5WvyzppWZyb8gyLb">
             § 12 Invandrarundervisning m. m.
           </note>
  • Correct
  • Incorrect

corpus/protocols/198182/prot-198182--31.xml

Diff starting from line 64

@@ -64,7 +64,7 @@
         </div>
       </front>
       <body>
-        <div>
+        <div type="commentSection" xml:id="i-YKJQ9t6vq2g91Se14ztFjn">
           <pb facs="https://betalab.kb.se/prot-198182--31/prot_198182__31-000.jp2/_view"/>
           <note xml:id="i-5fKzSMmN1WerpNSyAhwcXV">
             Riksdagens protokoll
  • Correct
  • Incorrect

corpus/protocols/198283/prot-198283--111.xml

Diff starting from line 353

@@ -353,6 +353,8 @@
           <note xml:id="i-PnUNJn84bxmRD9K6GUAbZf">
             suppleant i utbildningsutskottet Sonia Thomasson (vpk)
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-S1pessFPXLM6rWYzEH1QjS">
           <note xml:id="i-PBCqXtv8qLCcE4P18gCuVF">
             3§ Talmannen meddelade att Ingemar Konradsson (s) denna dag återtagit
             sin plats i riksdagen, varigenom Ulla-Britt Carlssons uppdrag
  • Correct
  • Incorrect

corpus/protocols/198384/prot-198384--100.xml

Diff starting from line 3183

@@ -3175,15 +3183,21 @@
           <note xml:id="i-9YZ5JzboDwZ4z63771xjK3">
             Överläggningen var härmed avslutad.
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-BsCQM62oikNZ3ioKPCXuVk">
           <note xml:id="i-DbjQrUu8GsGVNnAbuZjLbi">
             11 § På förslag av talmannen beslöt kammaren kl. 11.10 att ajournera
             sina förhandlingar till kl. 14.00, då de till dagens bordläggning
             anmälda utskottsbetänkandena väntades föreligga.
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-BdgMumfDWQE9NA242JctvF">
           <note xml:id="i-BGwYLbyW36NMRKdpzngTAC">
             12 § Förhandlingarna återupptogs kl. 14.00 under ledning av förste
             vice talmannen.
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-NDrQvVL2ZQjE8AcX4cttKZ">
           <note xml:id="i-7upkPfaSsBkcRFxuFV6S8a">
             13 § Anmäldes och bordlades Proposition 1983/84:128 Förslag till
             lag om företagshypotek m. m.
  • Correct
  • Incorrect

corpus/protocols/198384/prot-198384--155.xml

Diff starting from line 3523

@@ -3519,6 +3523,8 @@
           <note xml:id="i-FSkMitL3nfGSkXpQC11GRs">
             Övriga moment Utskottets hemställan bifölls.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-CFNt3WaMCeUbAjPuAUL5o2">
           <note xml:id="i-A4tx6KGBJRvjLq9Hk5NkQ9">
             5 §&amp; Arbetsmiljöfrågor, m. m.
           </note>
  • Correct
  • Incorrect

corpus/protocols/198586/prot-198586--110.xml

Diff starting from line 819

@@ -817,6 +819,8 @@
           <note xml:id="i-HVqcDkZHt748vre4KMfoYj">
             Överläggningen var härmed avslutad.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-Vgx1LYaDsrZqDRSozjEjtA">
           <note xml:id="i-Vsy74coahMz5bjoq4eZm48">
             3 § Svar på interpellation 1985/86:146 om åtgärder mot radioaktiva
             utsläpp från engelsk upparbetningsanläggning
  • Correct
  • Incorrect

corpus/protocols/198687/prot-198687--73.xml

Diff starting from line 366

@@ -366,6 +366,8 @@
           <note xml:id="i-7TqiRkjb2yAuoPYePo7jTv">
             18 Justerades protokollet för den 9 innevarande månad.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-Cc9TJgYWfzZsaGybgG1n5W">
           <note xml:id="i-AmpTwAcF3WPs17Dhk2iYVZ">
             2 § Svar på interpellation 1986/87:96 om åtgärder för att förenkla
             och effektivisera socialförsäkringen
  • Correct
  • Incorrect

corpus/protocols/199091/prot-199091--78.xml

Diff starting from line 59

@@ -59,13 +59,15 @@
         </div>
       </front>
       <body>
-        <div>
+        <div type="commentSection" xml:id="i-Fkv3bwf9PRu2aakA4TSb2n">
           <note xml:id="i-H8Se86iLdeznpxoeEpkSnk">
             1 § Justering av protokoll
           </note>
           <note xml:id="i-Ab5VjJsL9Tzm9HS2Cmrk8y">
             Justerades protokollet för den 8 mars.
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-R3U6NZvuQwJefao1SVMMji">
           <note xml:id="i-rTHHpwPubrfwA13NDtzyZ">
             2 § Bordläggning
           </note>
  • Correct
  • Incorrect

corpus/protocols/199192/prot-199192--121.xml

Diff starting from line 10376

@@ -10328,6 +10376,8 @@
             Kammaren beslöt att ärendebehandlingen skulle fortsättas vid
             arbetsplenum måndagen den 1 juni.
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-WhuAoTJaCdoaqvPzkqS64Z">
           <note xml:id="i-QhYGThrMHD1PiKTY1Rfbi7">
             26 § Bordläggning
           </note>
  • Correct
  • Incorrect

corpus/protocols/199293/prot-199293--71.xml

Diff starting from line 6983

@@ -6939,6 +6983,8 @@
           <note xml:id="i-23akbUZ546t1WCuf495VAR">
             1992/93:AU7, AU9 och AU15
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-4QcG6NWqhcZT5x5JbpkprZ">
           <note xml:id="i-AFZCMQVmoGJBNbdzg9Exyu">
             24 § Bordläggning
           </note>
  • Correct
  • Incorrect

corpus/protocols/199394/prot-199394--124.xml

Diff starting from line 10470

@@ -10456,6 +10470,8 @@
           <note xml:id="i-6PvWsiN7QtC2TzTu8BVf8V">
             Förhandlingarna återupptogs kl. 15.00.
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-34i1ky7Vu8UVS6qhcYpwRQ">
           <note xml:id="i-Xf9w3uZ9NXgfNKVLbXEFHp">
             9 § Avsägelse
           </note>
  • Correct
  • Incorrect

corpus/protocols/199495/prot-199495--40.xml

Diff starting from line 518

@@ -504,6 +518,8 @@
           <note xml:id="i-Y7rzZ3AgF24sFWMunRbJqS">
             (Beslut skulle fattas den 14 december.)
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-LMooFNh7qWfKgy2ZXrbKj3">
           <note xml:id="i-EW8VcGYMQSdCvE1iB7TWzZ">
             8 § Oskäliga avtalsvillkor m.m.
           </note>
  • Correct
  • Incorrect

corpus/protocols/199495/prot-199495--76.xml

Diff starting from line 69

@@ -69,6 +69,8 @@
               ________________________________________________________________________
             </seg>
           </u>
+        </div>
+        <div type="commentSection" xml:id="i-3guoHd4Xv1BTuGxzwzcJK8">
           <note xml:id="i-DtVyPH8Joqh7Nki1jYXfPp">
             1 § Avsägelse
           </note>
  • Correct
  • Incorrect

corpus/protocols/199899/prot-199899--17.xml

Diff starting from line 4845

@@ -4825,6 +4845,8 @@
             Interpellationerna redovisas i bilaga som fogas till riksdagens
             snabbprotokoll tisdagen den 24 november.
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-DhjpCNZzuFzXNSkuENgUJ3">
           <note xml:id="i-TJo1UeQVntvrUtBpFVTXJZ">
             11 § Anmälan om fråga för skriftligt svar
           </note>
  • Correct
  • Incorrect

corpus/protocols/199899/prot-199899--38.xml

Diff starting from line 336

@@ -320,6 +336,8 @@
             AU1 samt näringsutskottets betänkanden NU1, NU2 och NU3 skulle
             avgöras i ett sammanhang efter avslutad debatt.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-3F8a7BPWWUveovCAFDHV9G">
           <note xml:id="i-VURidbF1UszbSTjSQCsGf6">
             9 § Ekonomisk trygghet vid arbetslöshet samt arbetsmarknad och
             arbetsliv
  • Correct
  • Incorrect

corpus/protocols/19992000/prot-19992000--112.xml

Diff starting from line 3395

@@ -3379,6 +3395,8 @@
           <note xml:id="i-9u95apYeYffGQ4b6dy4Tx2">
             (Beslut fattades under 11 §.)
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-ETQD59HnRUTg3rFxjeuGda">
           <note xml:id="i-PWc4tbZvVhN9ySwwtvfgtV">
             9 § Tillträde till internationella instrument mot penningförfalskning
           </note>
  • Correct
  • Incorrect

corpus/protocols/200001/prot-200001--35.xml

Diff starting from line 188

@@ -180,6 +188,8 @@
           <note xml:id="i-NXmeydYpGAxgtrgtxNxWGo">
             Ingegerd Wärnersson
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-kajQ6vJPDaaWnAuVDJ2mF">
           <note xml:id="i-QWTK52XQPGju38N82HDtuS">
             5 § Svar på interpellation 2000/01:97 om verksamheten vid Lunds
             universitets historiska museum
  • Correct
  • Incorrect

corpus/protocols/200001/prot-200001--56.xml

Diff starting from line 15540

@@ -15490,6 +15540,8 @@
           <note xml:id="i-R8atM2aknihnLFwqQbxcyn">
             Överläggningen var härmed avslutad.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-KKR4LSNNrvaH2uD441gxzR">
           <note xml:id="i-QeJXhHMkjXxLrtGf9BSpCd">
             26 § Svar på interpellation 2000/01:188 om tomträtter
           </note>
  • Correct
  • Incorrect

corpus/protocols/200001/prot-200001--64.xml

Diff starting from line 59

@@ -59,7 +59,7 @@
         </div>
       </front>
       <body>
-        <div>
+        <div type="commentSection" xml:id="i-Up1rjAzqTpCVu3qiGxjv5w">
           <pb facs="http://data.riksdagen.se/fil/EAEC16F1-80A8-4F8B-AAC0-1C8AE4993D01#page=1"/>
           <note xml:id="i-T2gmCNM3HbFu4pDN2mUgFP">
             Det justerade protokollet beräknas utkomma om 3 veckor
  • Correct
  • Incorrect

corpus/protocols/200102/prot-200102--65.xml

Diff starting from line 70

@@ -70,12 +70,16 @@
           <note xml:id="i-K8LC1ZsLfRNnC4XHiXkvKP">
             -------------------------------------------------------------------
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-WWSVeFS8UUeZxUawjjzX4z">
           <note xml:id="i-PHZWq5QuLYvuQUWAARAvJ5">
             1 § Justering av protokoll
           </note>
           <note xml:id="i-YLarRBtfkqisttE3Xm3QLh">
             Justerades protokollet för den 1 februari.
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-XggoQqtkUGi3Mb2DdxJX5T">
           <note xml:id="i-TtPsCZTzEyDsSRKGgV28nc">
             2 § Meddelande om utrikespolitisk debatt
           </note>
  • Correct
  • Incorrect

corpus/protocols/200102/prot-200102--79.xml

Diff starting from line 4816

@@ -4798,6 +4816,8 @@
           <note xml:id="i-D2WnFNisRbsJ1fv78yRRpb">
             Överläggningen var härmed avslutad.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-LVVXvDFV784HLHitman5w1">
           <note xml:id="i-721gyy9MBKdF8MueZJeuyS">
             10 § Svar på interpellation 2001/02:243 om
           </note>
  • Correct
  • Incorrect

corpus/protocols/200304/prot-200304--25.xml

Diff starting from line 9060

@@ -9018,6 +9060,8 @@
               tisdagen den 18 november.
             </seg>
           </u>
+        </div>
+        <div type="commentSection" xml:id="i-LDfA3KtpJomEeCCu9ZhQ6u">
           <note xml:id="i-4amEiRs2H6QD57J1pBz3tV">
             22 § Kammaren åtskildes kl. 21.51.
           </note>
  • Correct
  • Incorrect

corpus/protocols/200405/prot-200405--101.xml

Diff starting from line 1840

@@ -1828,6 +1840,8 @@
           <note xml:id="i-WfUyr6bDix8pQydRBvYZnc">
             Överläggningen var härmed avslutad.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-AfTQ1twiztLp9zz5FWX2xC">
           <note xml:id="i-3rc7DmGdCMBNkKsQzR6Moo">
             7 § Kommunal demokrati och kompetens
           </note>
  • Correct
  • Incorrect

corpus/protocols/200405/prot-200405--49.xml

Diff starting from line 12097

@@ -12081,6 +12097,8 @@
           <note xml:id="i-WPQNZ4ZexSQjgLEEmFjrKz">
             Överläggningen var härmed avslutad.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-2wFWqZkSPkaRicGQUS4QkA">
           <note xml:id="i-7Exh5dsrheozxMigKJXvWH">
             9 § Jord- och skogsbruk, fiske med anslutande näringar
           </note>
  • Correct
  • Incorrect

corpus/protocols/200607/prot-200607--105.xml

Diff starting from line 6734

@@ -6704,6 +6734,8 @@
               tisdagen den 15 maj.
             </seg>
           </u>
+        </div>
+        <div type="commentSection" xml:id="i-7XoJcB6wGnkbkZnvzWbZKS">
           <note xml:id="i-E7Md5uEf3X81TWwqgJcmDs">
             16 § Kammaren åtskildes kl. 13.37.
           </note>
  • Correct
  • Incorrect

corpus/protocols/200607/prot-200607--111.xml

Diff starting from line 8544

@@ -8514,6 +8544,8 @@
           <note xml:id="i-DJApmBGuyE7ZCZvgJpKzXq">
             Förste vice talmannen konstaterade att ingen talare var anmäld.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-AcokcFEjeByJPJ5v8MqBM1">
           <note xml:id="i-WWx9WS9xnkDb7MA6Usg6Jq">
             16 § Avskaffande av åldersgräns
           </note>
  • Correct
  • Incorrect

corpus/protocols/200708/prot-200708--112.xml

Diff starting from line 11085

@@ -11061,6 +11085,8 @@
           <note xml:id="i-5MtwaUpEnbdqsHgpMgfiWA">
             Överläggningen var härmed avslutad.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-5yLTXSMvfTtW3QKsgJ2xE9">
           <note xml:id="i-Pc8RzFZwcmC7gf6GuvGYTy">
             13 § Ny instansordning för arbetsmiljöärenden
           </note>
  • Correct
  • Incorrect

corpus/protocols/200708/prot-200708--138.xml

Diff starting from line 538

@@ -530,6 +538,8 @@
           <note xml:id="i-EBL7QFRPax7LF4C5dLpdD2">
             Överläggningen var härmed avslutad.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-27dDYET3WdPSjHCEj5wq8W">
           <note xml:id="i-Vny6bvLYnMyxjjBur38FNC">
             5 § Svar på interpellation 2007/08:837 om kommunernas ekonomi
           </note>
  • Correct
  • Incorrect

corpus/protocols/200809/prot-200809--46.xml

Diff starting from line 561

@@ -545,6 +561,8 @@
           <note xml:id="i-PizfC89y4Rb3dBCzerUGoH">
             Punkterna 37
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-3cJLnXtTKXxgthXAvtP9et">
           <note xml:id="i-BG5doLJXVjx69Mq5s9PmTf">
             9 § Beslut om ärenden som slutdebatterats den 8 december
           </note>
  • Correct
  • Incorrect

corpus/protocols/200910/prot-200910--11.xml

Diff starting from line 239

@@ -225,6 +239,8 @@
           <note xml:id="i-EYKHweT77ozmbLM9zB1ZRr">
             Anmäldes och bordlades
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-Qe4pDnCh4V6Et9tfDMbjHt">
           <note xml:id="i-HrBXgaTLy5sdWanYYzqpmJ">
             8 § Anmälan om interpellationer
           </note>
  • Correct
  • Incorrect

corpus/protocols/200910/prot-200910--145.xml

Diff starting from line 14647

@@ -14601,6 +14647,8 @@
           <note xml:id="i-Ef7v3QnAjmwK3YPfyo8VjZ">
             Överläggningen var härmed avslutad.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-8oP5aKFZKeWhhwcqaG6ZsT">
           <note xml:id="i-Brxy4sUCXTSQckPAnmZZSK">
             24 § Svar på interpellation 2009/10:451 om en allmän och solidarisk
             a-kassa
  • Correct
  • Incorrect

corpus/protocols/201213/prot-201213--110.xml

Diff starting from line 2565

@@ -2547,6 +2565,8 @@
           <note xml:id="i-2dFGfFENRyQ6MWqrsy7X91">
             Förhandlingarna återupptogs kl. 14.00.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-YSpr6xrU1VxqY1v3b4WudA">
           <note xml:id="i-WypqQ9kBen4z8vFrur8ntj">
             10 § Statsministerns frågestund
           </note>
  • Correct
  • Incorrect

corpus/protocols/201314/prot-201314--106.xml

Diff starting from line 9972

@@ -9936,6 +9972,8 @@
           <note xml:id="i-3ZF5gNg6DqSkKLoAZJxXWq">
             Hans Hoff
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-NtkHTW2aNp1shs4gWNa1Xe">
           <note xml:id="i-RZmLCJhytDFJw8X2PxTMN">
             19 § Anmälan om skriftliga svar på frågor
           </note>
  • Correct
  • Incorrect

corpus/protocols/201314/prot-201314--92.xml

Diff starting from line 1773

@@ -1759,6 +1773,8 @@
           <note xml:id="i-3hzU8zPGcz1a6wyV3CXm4U">
             Överläggningen var härmed avslutad.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-9xbf4UDfdTmJGeWN1QcHwN">
           <note xml:id="i-A8vb8aPdq5xaRWjGShVR8f">
             8 § Svar på interpellation 2013/14:265 om nedsättningen av arbetsgivaravgiften
             för unga
  • Correct
  • Incorrect

corpus/protocols/201415/prot-201415--121.xml

Diff starting from line 6277

@@ -6221,51 +6277,83 @@
           <note xml:id="i-HYL11c4ALwi7G75cDWejwC">
             Innehållsförteckning
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-LF6ocLKPvDi2aGX9xim9m1">
           <note xml:id="i-BENKRLk9gwSs8AfMNGX1FZ">
             § 1 Justering av protokoll
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-C3AeoTJtidce8KFFhihRE6">
           <note xml:id="i-CyvwfEQE78ZviPMQBDnUJy">
             § 2 Anmälan om interpellationer
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-H7WqmxQ1Rx4Rue67FjRGFB">
           <note xml:id="i-qwqsUrvwJ43QJtE3tGHZ7">
             § 3 Anmälan om skriftliga frågor och svar
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-17NvSH25pNM9mtfwF6wPyE">
           <note xml:id="i-5L8gCSf1Xvu1Vb1oByo2Jz">
             § 4 Anmälan om ny riksdagsledamot
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-Cuhqodpmrr5niztLQwofHm">
           <note xml:id="i-KHjec4Capk2BqQs3yxgvfi">
             § 5 Anmälan om återtagande av plats i riksdagen
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-Ft99LcFwpf2HuKzgG5fFDu">
           <note xml:id="i-22p2sBTuZ4Hd9TLRZktw1t">
             § 6 Avsägelser
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-TKV2MKo6JrMzT3tR3JP96o">
           <note xml:id="i-K4NXZNhjqedhLT2mgspPYb">
             § 7 Anmälan om ersättare
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-37TeapwriAtaYsUehZSTaF">
           <note xml:id="i-pjaWvS52ctLq5NctfmAyj">
             § 8 Anmälan om ersättare för statsråd
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-SXudBx9J8gZeiRQxAwAboC">
           <note xml:id="i-TCZELaX43MvEdJr9MeM7ZQ">
             § 9 Anmälan om ersättare för talman
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-AGTuEeZLnF9p6AKLFNYmx7">
           <note xml:id="i-CGRLxVnpW4fHNcQRKciDct">
             § 10 Anmälan om kompletteringsval
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-XbHjeFUgBGmAeSP9R78yqU">
           <note xml:id="i-9RbJy9TKZEqKnirLvvSQat">
             § 11 Anmälan om ny ledamot i Europaparlamentet
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-H5CkXN1RYHysvhbC6XCt4s">
           <note xml:id="i-16z4Yxx2H5ufGG7x8pxVkH">
             § 12 Anmälan om fördröjda svar på interpellationer
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-SiPoBr8ZPyL4sSnNG4Yybz">
           <note xml:id="i-MqQLhzqUdZQShQcqj4L4U1">
             § 13 Anmälan om faktapromemorior
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-8tFw1bSgPtUBXxKRj6kfzL">
           <note xml:id="i-HqAj21UMDMwhq3nyv7wfBU">
             § 14 Anmälan om granskningsrapporter
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-NCDvCP3GQ5xEbC7goubgA9">
           <note xml:id="i-L9H5AtAJ1pNtv6HE8Xp4wV">
             § 15 Anmälan och omedelbar hänvisning av ärenden till utskott
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-XRMzZeUZqBvbSsCGZn8dbc">
           <note xml:id="i-Sjmy9DsGJ9KmfAZivikbrh">
             § 16 Svar på interpellation 2014/15:629 om Öresundssamarbete
           </note>
  • Correct
  • Incorrect

corpus/protocols/201516/prot-201516--102.xml

Diff starting from line 8679

@@ -8595,6 +8679,8 @@
           <note xml:id="i-Lg4UUSYZEaM7zEpP2DMg34" type="speaker">
             Anf. 80 Utbildningsminister GUSTAV FRIDOLIN (MP)
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-5ZQs9Uyf91eac6x2jvDbrP">
           <note xml:id="i-FtDtadi4A1CgxwPy42nFz4">
             § 17 Svar på interpellation 2015/16:597 om digitala verktyg till
             nyanlända elever
  • Correct
  • Incorrect

corpus/protocols/201516/prot-201516--118.xml

Diff starting from line 7422

@@ -7400,6 +7422,8 @@
           <note xml:id="i-8ZWcNpAkEtWdvnMdExgNR2">
             (Beslut skulle fattas den 15 juni.)
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-LuTkQsGxeFj6DaaX6y3SEB">
           <note xml:id="i-TRLzdicc649cdM9zHBbU6U">
             § 4 Övergångsstyre och utjämning vid ändrad kommun- och landstingsindelning
           </note>
  • Correct
  • Incorrect

corpus/protocols/201617/prot-201617--132.xml

Diff starting from line 5507

@@ -5413,6 +5507,8 @@
           <note xml:id="i-DsfKwnirY3ayry8TLiQZLk" type="speaker">
             Anf. 70 Statsrådet ANNA EKSTRÖM (S)
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-WHuUJvLk4MBvaTWbKZDEXz">
           <note xml:id="i-NuKiNa3oRbRx1UwtC5JFyq">
             § 22 Svar på interpellation 2016/17:567 om psykisk ohälsa i gymnasiet
           </note>
  • Correct
  • Incorrect

corpus/protocols/201617/prot-201617--26.xml

Diff starting from line 2246

@@ -2230,6 +2246,8 @@
           <note xml:id="i-KN2TqXyaGbYBWfTMo8dmtb">
             Överläggningen var härmed avslutad.
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-UTP5wAXm5cv23Ug2tC7mQW">
           <note xml:id="i-4WbDPFnVBDUWCixCXgUJ93">
             § 9 Svar på interpellation 2016/17:68 om svenska kommuners skatteintäkter
           </note>
  • Correct
  • Incorrect

corpus/protocols/201617/prot-201617--29.xml

Diff starting from line 6744

@@ -6706,6 +6744,8 @@
             investeringsprodukter för icke-professionella investerare (Priip-produkter)
             vad gäller förordningens tillämpningsdag
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-VzJ1qH8cMTdndKv4wkVYBC">
           <note xml:id="i-SWLCKs9UiqCqTLYQTNZxLC">
             § 20 Anmälan om interpellationer
           </note>
  • Correct
  • Incorrect

corpus/protocols/201617/prot-201617--71.xml

Diff starting from line 59

@@ -59,14 +59,18 @@
         </div>
       </front>
       <body>
-        <div>
+        <div type="commentSection" xml:id="i-YFtCuWMiDDWu6Uop9FZMJt">
           <pb facs="http://data.riksdagen.se/fil/FEB488CD-C695-4AC6-BF50-03E8DD992394#page=1"/>
+        </div>
+        <div type="commentSection" xml:id="i-BMV2VyzC6umMTWsPP1uyaw">
           <note xml:id="i-RHzFCz3FUyw7P9THXX4PPd">
             § 1 Justering av protokoll
           </note>
           <note xml:id="i-HkQCKHagHXAHV4tTTmBQLY">
             Protokollet för den 31 januari justerades.
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-MnPga3aEKRTDK14nvituAw">
           <note xml:id="i-XeK4DbvNUKiAio9JknQBBb">
             § 2 Anmälan om ny riksdagsledamot
           </note>
  • Correct
  • Incorrect

corpus/protocols/201718/prot-201718--16.xml

Diff starting from line 2097

@@ -2087,6 +2097,8 @@
           <note xml:id="i-VprfTiJcobS5BY1aFLWkUu">
             till statsrådet Tomas Eneroth (S)
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-9AJtcwMQdjThYeF4js2jhp">
           <note xml:id="i-42sh1hHRcpx3F8E3zerAEG">
             § 6 Anmälan om frågor för skriftliga svar
           </note>
  • Correct
  • Incorrect

corpus/protocols/201819/prot-201819--29.xml

Diff starting from line 896

@@ -882,6 +896,8 @@
             RiR 2018:32 Förvaltningen av premiepensionssystemet – kostnadseffektivitet
             för spararnas bästa?
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-URPXaDxaHtMWPbrpGJo7Nn">
           <note xml:id="i-AamRW2k5tVJHzPJHHx3khK">
             § 8 Ärende för hänvisning till utskott
           </note>
  • Correct
  • Incorrect

corpus/protocols/201819/prot-201819--81.xml

Diff starting from line 12005

@@ -12005,7 +12005,7 @@
             Anf. 74 Statsminister STEFAN LÖFVEN (S)
           </note>
         </div>
-        <div type="debateSection">
+        <div type="debateSection" xml:id="i-9SAZJyKhY4oW94uYaAuTm6">
           <note xml:id="i-7PskJ8BQ6tG5KGB9E3NL4M">
             § 8 (forts. från § 6) Kriminalvårdsfrågor (forts. JuU13)
           </note>
  • Correct
  • Incorrect

corpus/protocols/202021/prot-202021--12.xml

Diff starting from line 930

@@ -908,21 +930,33 @@
           <note xml:id="i-4zgVw8CEi2V7WRGwEPDeUT">
             Innehållsförteckning
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-2yzxffrPR29NXGwzQyf65k">
           <note xml:id="i-TFmWCfnWAdd55cQ87ESsaG">
             § 1 Avsägelser
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-3X51zpqMQnPRasbrdG4qEZ">
           <note xml:id="i-V2SEZX2HhFhFB6rg2iR2Jq">
             § 2 Anmälan om kompletteringsval
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-KzbhxJEVuoCbBNVf2jZWU9">
           <note xml:id="i-DUfcrF46i39FucKjtYMUNQ">
             § 3 Anmälan om subsidiaritetsprövning
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-Xvxh559FpJiAFQb7HF7ygT">
           <note xml:id="i-Q2an59QFjwVGMYVhqT2Qm9">
             § 4 Anmälan om fördröjt svar på interpellation
           </note>
+        </div>
+        <div type="commentSection" xml:id="i-7BDA4TFYeNEFgy8PCrZJDQ">
           <note xml:id="i-LZDTcAcrZpcLuqjtAxcXyf">
             § 5 Ärenden för hänvisning till utskott
           </note>
+        </div>
+        <div type="debateSection" xml:id="i-SVdbsBgArW7NeiGihZY8HX">
           <note xml:id="i-NRBVBusRwHf7TJjewfHLwx">
             § 6 Svar på interpellation 2019/20:451 om bistånd till stater
             som inte respekterar mänskliga rättigheter
  • Correct
  • Incorrect

@MansMeg
Copy link
Collaborator

MansMeg commented Oct 25, 2023

Any ideas how we formally know if it is correct or not?

@MansMeg
Copy link
Collaborator

MansMeg commented Oct 25, 2023

Still the problem that tags becomes a section. This should be easy to fix?

Also, an innehållsförteckning seem to incorrectly end up in a large number of sections. Is this easy to fix?

@BobBorges
Copy link
Collaborator Author

Any ideas how we formally know if it is correct or not?

I guess if the div is not empty, doesn't contain multiple sections, and has the type+id attribs.

@BobBorges
Copy link
Collaborator Author

Still the problem that tags becomes a section. This should be easy to fix?

I don't follow.

Also, an innehållsförteckning seem to incorrectly end up in a large number of sections. Is this easy to fix?

After merging this it's what I wanted to do first after taking a first crack at identifying the interpellation debates. I don't think it would be too difficult, but you never know until you actually start doing it.

@BobBorges
Copy link
Collaborator Author

At this stage -- given it's the first kind of attempt at creating sections -- unless there is something really bad, i.e. that worsens the quality of the data/work we've already done (which I don't see in the sample or in other edits), then we should accept this round of div additions.

I see many things that could be better, but I don't think we will get it all right at once. Some incorrect section delimitation is an improvement over no section delimitation.

  • moving solo <pb> elems (that's what I didn't get before) into adjacent divs
  • joining stray solo sections under a unified table of contents
  • finding additional section delimitation
  • --- by missing nrs in the sequence, and / or
  • --- finding the end of a real section before the end of the div, e.g.:
    image

...can all be done in steps (minimal PR!), but if we sit on this for too long it blocks me from categorizing the debates

@MansMeg
Copy link
Collaborator

MansMeg commented Oct 25, 2023

I fully agree.

  1. I fully agree that we should do minimal PRs. That said, eg fixing the tag seem so small that it is just a quick fix (as a couple of lines of code). Then we might just fix it, right? The other issues seem to need som additional work.
  2. The revision control: So we need to check that these divs are correct that includes the debateSection and commentSection. I guess we only check that the debatesection contain a real debate (or a section of a debate), right?

@BobBorges
Copy link
Collaborator Author

No it's not that much work to fix stray <pb> elems in a section, but...

2.1. We wanted to do this quality control before committing edits to the whole set of protocols for reasons of economy. So either we approve what's here and I can commit it, then fix the pb thing with another commit (before merging the PR), or I can fix it now in the already modified files, but then we conflate 'types' of edits in one piece of the revision history.

2.2. Debate sections have intros, comment sections don't -- it seems like a reasonable criterion for evaluation. Should I check that? in the sample? I'd like to be able to take this a step or two forward today.

@MansMeg
Copy link
Collaborator

MansMeg commented Oct 26, 2023

  1. I was thinking of fixing all ? Not just the stray ones. An estimate is that roughly 2% of all edits are due to this problem?

2.a. Im not sure I followed. So I just checked for obvious errors and found those. If we fix those, we can get a new sample we can assess. That should not conflate anything or be problematic?

2.b. Great. I just wanted to know. Then it seems good to just check the debates based on this definition and check that the commentSections are not incorrect and that not incorrect divs are introduced.

But this raises an issue that we need to start to define divs in a better way. Because this is slightly in between an analytic decision and an data authentic one. And we want to be as close to the latter as possible.

@BobBorges
Copy link
Collaborator Author

I've gone through them now: mostly they're ok. Marked correct if:

  • div elem has id
  • schematically correct
  • debates have intros/comment sections have no intros

It looks like 6 are incorrect by those criteria and the incorrect ones are due to lone <pb> elems in a div or the content of the table of contents section getting tagged as section head and intros. I'll commit the rest of the protocols, then let's merge and I'll open issues for these two problems.

@MansMeg
Copy link
Collaborator

MansMeg commented Oct 26, 2023

Great! Do you open an issue?

@BobBorges
Copy link
Collaborator Author

#405

@BobBorges
Copy link
Collaborator Author

@MansMeg will you merge when the tests pass?

@MansMeg MansMeg merged commit 950a38e into dev Oct 26, 2023
3 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants