This workflow describes step performed to convert letter index into CMIF format. Utilized tools:
- Transkribus
- OpenRefine
- VS code
- python (Jupyter Notebooks)
step | comment |
---|---|
regions defined manually | |
line detection automatically | |
text recognition | |
export document as simple text | |
OpenRefine is free to download and use | |
it's a java web server (currently java8) | |
launch OpenRefine in your shell | |
web service is ready on your local machine at port 3333 | |
browser FireFox and Chrome will do | |
browse your input file | |
left is original scanned | |
right is transkribus output, simple text format | |
click Next |
|
adjust import option, I left defaults | |
click Create Project >> |
|
now your project is started | |
apply operations to shape your data | |
if you had done so previously, it's possible to apply operations again | |
assuming you preserved those operations | |
open Undo/Redo |
|
copy previous operations to reapply | |
paste previous action in and Perform Operations |
|
notice result | |
data is shaped as before | |
OpenRefine allows edit | |
OCR recognition isn't error free | |
edit manually | |
edition uses [ ] to flag uncertain data, human readable dates may be converted into more machine friendly form |
|
use a human to bring sense into data | |
OCR led to year 190 which is way off |
|
when you are done, export data as json using templating | |
at this step, defaults are good |
We finish basic conversion, now it would be good to validate your work before going further. A high level programming language may be useful. I personally like python very much, in conjunction with jupyter notebook it's even more simple to perform an infrequent used workflow. Notes help to explain what is going on and debugging is also simple.
on CentOS 8 adding package looks like following:
pip3 install jupyterlab
after package has been installed
jupyter notebook
will launch jupyter web service at your local machine.
validateDates.ipynb
reshapes json created by OpenRefine.
Names spanning over multiple lines are combined into 1 and dates listed are
validated, a chronological order is assumed.
on OCR text recognition might not be error free. you take a glance to transkribus result and all looks good. nevertheless a sematic check on result is useful.
our data is a index form a book it starts with names (receiver of letter) in maybe multiple lines, followed by dates, which are sorted, from older to newer.
Friedmann, Ernst
13. 9. 1912
Friese, Carl
13. 2. 1907
20. 2. 1907
Fulda, Ludwig
28. 11. 1898
28. 12. 1898
4. 1. 1899
10. 3. 1890
25. 4. 1899
20. 6. 1900
23. 6. 1900
7. 7. 1900
17. 1. 1901
22. 3. 1901
8. 6. 1901
23. 7. 1904
25. 8. 1904
Glümer, Marie
Sommer 1889
4. 8. 1889
5. 8. 1889
with OpenRefine, a json was created
}, {
"recv": "Fulda, Ludwig",
"date_when": "1898-12-28",
"date_text": "28. 12. 1898",
"uncertain": false
}, {
"recv": "Fulda, Ludwig",
"date_when": "1899-01-04",
"date_text": "4. 1. 1899",
"uncertain": false,
}, {
"recv": "Fulda, Ludwig",
"date_when": "1890-03-10",
"date_text": "10. 3. 1890",
"uncertain": false,
}, {
"recv": "Fulda, Ludwig",
"date_when": "1899-04-25",
"date_text": "25. 4. 1899",
"uncertain": false
}, {
only a focused sight at this data, text or refined data, will reveal recognition error
at 10. 3. 1890
(1890-03-10), seen in context, it is supposed to be 10. 3. 1899
;
however we don't want to replace humans, so it's up to a human to decide.
goal of this notebook is to find suspicious dates, out of order dates. if a newer date was present in previous line, one of those two is properly wrong.
}, {
"recv": "Fulda, Ludwig",
"date_when": "1898-12-28",
"date_text": "28. 12. 1898",
"uncertain": false
}, {
"recv": "Fulda, Ludwig",
"date_when": "1899-01-04",
"date_text": "4. 1. 1899",
"uncertain": false,
"suspicious": "true"
}, {
"recv": "Fulda, Ludwig",
"date_when": "1890-03-10",
"date_text": "10. 3. 1890",
"uncertain": false,
"suspicious": "true"
}, {
"recv": "Fulda, Ludwig",
"date_when": "1899-04-25",
"date_text": "25. 4. 1899",
"uncertain": false
}, {
at this point we successfully shaped text input into a table, manually corrected errors, generated a json representation and validated dates. now we'd like to get a CMIF json file.
now we are almost there. just open CMIF Creator