Skip to content

Commit

Permalink
feat: add OCR workflow
Browse files Browse the repository at this point in the history
  • Loading branch information
BobBorges committed Nov 13, 2024
1 parent 3f88dcf commit 2b0a4b6
Show file tree
Hide file tree
Showing 20 changed files with 1,814 additions and 30 deletions.
91 changes: 91 additions & 0 deletions quality/data/ocr-estimation/sample_1860_annotated.csv

Large diffs are not rendered by default.

91 changes: 91 additions & 0 deletions quality/data/ocr-estimation/sample_1870_annotated.csv

Large diffs are not rendered by default.

91 changes: 91 additions & 0 deletions quality/data/ocr-estimation/sample_1880_annotated.csv

Large diffs are not rendered by default.

91 changes: 91 additions & 0 deletions quality/data/ocr-estimation/sample_1890_annotated.csv

Large diffs are not rendered by default.

91 changes: 91 additions & 0 deletions quality/data/ocr-estimation/sample_1900_annotated.csv

Large diffs are not rendered by default.

91 changes: 91 additions & 0 deletions quality/data/ocr-estimation/sample_1910_annotated.csv

Large diffs are not rendered by default.

94 changes: 94 additions & 0 deletions quality/data/ocr-estimation/sample_1920_annotated.csv

Large diffs are not rendered by default.

91 changes: 91 additions & 0 deletions quality/data/ocr-estimation/sample_1930_annotated.csv

Large diffs are not rendered by default.

112 changes: 112 additions & 0 deletions quality/data/ocr-estimation/sample_1940_annotated.csv

Large diffs are not rendered by default.

178 changes: 178 additions & 0 deletions quality/data/ocr-estimation/sample_1950_annotated.csv

Large diffs are not rendered by default.

178 changes: 178 additions & 0 deletions quality/data/ocr-estimation/sample_1960_annotated.csv

Large diffs are not rendered by default.

103 changes: 103 additions & 0 deletions quality/data/ocr-estimation/sample_1970_annotated.csv

Large diffs are not rendered by default.

79 changes: 79 additions & 0 deletions quality/data/ocr-estimation/sample_1980_annotated.csv

Large diffs are not rendered by default.

30 changes: 0 additions & 30 deletions quality/docs/qe-ocr.md

This file was deleted.

57 changes: 57 additions & 0 deletions quality/docs/qe_ocr-estimation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Optical Character Recognition (OCR) error

## Summary

The goal is to estimate the total OCR error in the corpus. The quality of the OCR might differ between different years and between different documents. Hence, we take a stratified sample by year and document.


## What is the problem

In this quality dimension we want to estimate the total OCR error in the corpus, as can be described as the textual representation error in [Hurtado Bodell, Magnusson & Müntzel (2022)](https://raw.githubusercontent.com/swerik-project/swerik-reference-list/refs/heads/main/bibfiles/HurtadoBodellMagnussonMutzel2022.bib). The quality of the OCR is important in many research applications that rely on the text being correct.


## Estimation procedure

This is a stratified cluster sample, where the page is the cluster and the strata are years and document.


### Sampling plan

We take a stratified sample of two pages per year and document type. Then on each page, the annotator counts the number of rows in the body text, writes down the number of rows and takes a random sample of three rows.

If there are two columns, count each row in each column as a separate row but double the sample size (to six rows), i.e. we sample three full lines per document.


### Annotation guidelines

Annotators receive a CSV with the page and a link to the page with three rows per page.

- Start by counting the total number of rows (or row-column combinations) of the main text (ignore marginal notes). Add the total number of rows under the NROWS column
- Then sample three rows (or six rows if the page is two columns) and indicate the sampled row in the csv.
- Write down the row line and the content of these three lines in the csv-file (one row per line) in order (ie the first row first).


## Other comments

The quality data has been annotated by students at Uppsala University.


## References

[Hurtado Bodell, Magnusson & Müntzel 2022](https://raw.githubusercontent.com/swerik-project/swerik-reference-list/refs/heads/main/bibfiles/HurtadoBodellMagnussonMutzel2022.bib)

```bibtex
@article{HurtadoBodellMagnussonMuntzel2022,
author = {Hurtado Bodell, Miriam AND Magnusson, Måns AND Mützel, Sophie},
title ={From Documents to Data: A Framework for Total Corpus Quality},
journal = {Socius},
volume = {8},
pages = {23780231221135523},
year = {2022},
doi = {10.1177/23780231221135523},
URL = {https://doi.org/10.1177/23780231221135523},
eprint = {https://doi.org/10.1177/23780231221135523},
abstract = { As large corpora of digitized text become increasingly available, researchers are rediscovering textual data’s potential fruitfulness for inquiries into social and cultural phenomena. Although textual corpora promise to enrich our knowledge of the social world, avoiding problems related to data quality remains a challenge to related empirical research. Hence, evaluating the quality of a corpus will be pivotal for future social scientific inquiries. The authors propose a conceptual framework for total corpus quality, incorporating three crucial dimensions: total corpus error, corpus comparability, and corpus reproducibility. These dimensions affect the validity and reliability of inferences drawn from textual data. In addition, the authors’ framework provides insights toward evaluating and improving studies on the basis of large-scale textual analyses. After outlining this framework, the authors then illustrate an application of the total corpus quality framework by an example case study using digitized newspaper articles to study topic salience over 75 years. }
}
```
19 changes: 19 additions & 0 deletions quality/docs/template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Title

## Summary
Short description of the quality dimension.

## What is the problem
What do we want to estimate? What is the the problem, why has it been included as a quality dimension?

## Estimation procedure
How is the estimation conducted (in words) What dataset has been been used to estimate

### Sampling plan [if applicable]
How has the sampling been done?

### Annotation guidelines [if applicable]
What is the change that we're proposing and/or doing?

## Previous experiences
E.g. how long time does it take to annotate?
3 changes: 3 additions & 0 deletions quality/estimates/ocr-estimation/metrics.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
decade,lev_mean,lev_first_q,lev_third_q,wer_mean,wer_first_q,wer_third_q,cer_mean,cer_first_q,cer_third_q,perfect_match
1860,0.2,0.0,0.0,0.03261904865503311,0.0,0.06250000186264515,0.002866502866502867,0.0,0.0,0.8
1940,0.5,0.0,0.75,0.12853535413742065,0.0,0.08901515416800976,0.0215792646172393,0.0,0.0189873417721519,0.7
21 changes: 21 additions & 0 deletions quality/estimates/ocr-estimation/mpl_lev+.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
prot annotation NROWS NCOLS row_to_check most_probable_line lev year decade wer cer
data/1867/prot-1867--ak--0130.xml Utskottet; och den torde nu rättast böra till ett tillfälligt Utskott hän- 32 1 17 Utskottet; och den torde nu rättast böra till ett tillfälligt Utskott hänv 1 1867 1860 0.0833333358168602 0.013513513513513514
data/1867/prot-1867--ak--0130.xml Utskott möjligen ega anspråk på handläggningen af ämnet 32 1 25 Utskott möjligen ega anspråk på handläggningen af ämnet 0 1867 1860 0.0 0.0
data/1867/prot-1867--ak--0130.xml det för tvistens slitande vara lämpligast 32 1 26 det för tvistens slitande vara lämpligast 0 1867 1860 0.0 0.0
data/1867/prot-1867--ak--0309.xml att man snarare kan säga 49 1 8 att man snarare kan säga 0 1867 1860 0.0 0.0
data/1867/prot-1867--ak--0309.xml bland annat 49 1 24 bland annat 0 1867 1860 0.0 0.0
data/1867/prot-1867--ak--0309.xml anstalter utöfvade en skadlig inverkan på folket. Under åberopande af 49 1 46 anstalter utöfvade en skadlig inverkan på folket. Under åberopande af 0 1867 1860 0.0 0.0
data/1867/prot-1867--ak--0412.xml jag deremot tror mig vara viss uppå 44 1 13 Jag deremot tror mig vara viss uppå 0 1867 1860 0.1428571492433548 0.0
data/1867/prot-1867--ak--0412.xml att något fullständiga Herr Björcks yttrande. Jag kan nemligen be- 44 1 24 att något fullständiga Herr Björcks yttrande. Jag kan nemligen bes 1 1867 1860 0.10000000149011612 0.015151515151515152
data/1867/prot-1867--ak--0412.xml bankoreglemente skall oförändradt bibehållas 44 1 38 bankoreglemente skall oförändradt bibehållas 0 1867 1860 0.0 0.0
data/1867/prot-1867--ak--0417.xml Vidare anfördes ej; och Utskottets hemställande afslogs af Kammaren, 42 1 25 Vidare anfördes ej; och Utskottets hemställande afslogs af Kammaren, 0 1867 1860 0.0 0.0
data/1940/prot-1940--ak--026.xml då vill jag i de svenska arbetarkvinnornas namn förklara 46 1 19 då vill jag i de svenska arbetarkvinnornas namn förklara 0 1940 1940 0.0 0.0
data/1940/prot-1940--ak--026.xml Föredrogs 46 1 35 Föredrogs 0 1940 1940 0.0 0.0
data/1940/prot-1940--ak--026.xml § 10. 46 1 41 § 10. 0 1940 1940 0.0 0.0
data/1940/prot-1940--fk--003.xml hjälp. En internationell vädjan I denna riktning har under svensk medverkan 54 1 16 hjälp. En internationell vädjan i denna riktning har under svensk medverkan 0 1940 1940 0.09090909361839294 0.0
data/1940/prot-1940--fk--003.xml på ett något tidigare stadium. Vidare skulle jag vilja hemställa till vederbö- 54 1 33 på ett något tidigare stadium. Vidare skulle jag vilja hemställa till vederböra 2 1940 1940 0.0833333358168602 0.02531645569620253
data/1940/prot-1940--fk--003.xml ställan 54 1 38 mställan 1 1940 1940 1.0 0.14285714285714285
data/1941/prot-1941--ak--007.xml Förslag till lag om dyrtidstillägg å folkpensioner 53 1 2 Förslag till lag om dyrtidstillägg å folkpensioner 0 1941 1940 0.0 0.0
data/1941/prot-1941--ak--007.xml det andra anser jag 53 1 22 det andra anser jag 0 1941 1940 0.0 0.0
data/1941/prot-1941--ak--007.xml föreliggande propositionen väckt en motion 53 1 43 föreliggande propositionen väckt en motion 0 1941 1940 0.0 0.0
data/1941/prot-1941--fk--029.xml så sent som den 17 januari i år beslutande 53 1 14 så sent som den 17 januari i år beslutade, 2 1941 1940 0.1111111119389534 0.047619047619047616
21 changes: 21 additions & 0 deletions quality/estimates/ocr-estimation/mpl_lev.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
prot annotation NROWS NCOLS row_to_check most_probable_line lev
data/1867/prot-1867--ak--0130.xml Utskottet; och den torde nu rättast böra till ett tillfälligt Utskott hän- 32 1 17 Utskottet; och den torde nu rättast böra till ett tillfälligt Utskott hänv 1
data/1867/prot-1867--ak--0130.xml Utskott möjligen ega anspråk på handläggningen af ämnet 32 1 25 Utskott möjligen ega anspråk på handläggningen af ämnet 0
data/1867/prot-1867--ak--0130.xml det för tvistens slitande vara lämpligast 32 1 26 det för tvistens slitande vara lämpligast 0
data/1867/prot-1867--ak--0309.xml att man snarare kan säga 49 1 8 att man snarare kan säga 0
data/1867/prot-1867--ak--0309.xml bland annat 49 1 24 bland annat 0
data/1867/prot-1867--ak--0309.xml anstalter utöfvade en skadlig inverkan på folket. Under åberopande af 49 1 46 anstalter utöfvade en skadlig inverkan på folket. Under åberopande af 0
data/1867/prot-1867--ak--0412.xml jag deremot tror mig vara viss uppå 44 1 13 Jag deremot tror mig vara viss uppå 0
data/1867/prot-1867--ak--0412.xml att något fullständiga Herr Björcks yttrande. Jag kan nemligen be- 44 1 24 att något fullständiga Herr Björcks yttrande. Jag kan nemligen bes 1
data/1867/prot-1867--ak--0412.xml bankoreglemente skall oförändradt bibehållas 44 1 38 bankoreglemente skall oförändradt bibehållas 0
data/1867/prot-1867--ak--0417.xml Vidare anfördes ej; och Utskottets hemställande afslogs af Kammaren, 42 1 25 Vidare anfördes ej; och Utskottets hemställande afslogs af Kammaren, 0
data/1940/prot-1940--ak--026.xml då vill jag i de svenska arbetarkvinnornas namn förklara 46 1 19 då vill jag i de svenska arbetarkvinnornas namn förklara 0
data/1940/prot-1940--ak--026.xml Föredrogs 46 1 35 Föredrogs 0
data/1940/prot-1940--ak--026.xml § 10. 46 1 41 § 10. 0
data/1940/prot-1940--fk--003.xml hjälp. En internationell vädjan I denna riktning har under svensk medverkan 54 1 16 hjälp. En internationell vädjan i denna riktning har under svensk medverkan 0
data/1940/prot-1940--fk--003.xml på ett något tidigare stadium. Vidare skulle jag vilja hemställa till vederbö- 54 1 33 på ett något tidigare stadium. Vidare skulle jag vilja hemställa till vederböra 2
data/1940/prot-1940--fk--003.xml ställan 54 1 38 mställan 1
data/1941/prot-1941--ak--007.xml Förslag till lag om dyrtidstillägg å folkpensioner 53 1 2 Förslag till lag om dyrtidstillägg å folkpensioner 0
data/1941/prot-1941--ak--007.xml det andra anser jag 53 1 22 det andra anser jag 0
data/1941/prot-1941--ak--007.xml föreliggande propositionen väckt en motion 53 1 43 föreliggande propositionen väckt en motion 0
data/1941/prot-1941--fk--029.xml så sent som den 17 januari i år beslutande 53 1 14 så sent som den 17 januari i år beslutade, 2
Loading

0 comments on commit 2b0a4b6

Please sign in to comment.