Skip to content

Latest commit

 

History

History
291 lines (263 loc) · 10.4 KB

wrkFlwArthurSchnitzlerBriefe.md

File metadata and controls

291 lines (263 loc) · 10.4 KB

Workflow Arthur Schnitzler Briefe

This workflow describes step performed to convert letter index into CMIF format. Utilized tools:

step comment
OpenRefine
regions defined manually
OpenRefine
line detection automatically
text recognition
OpenRefine
export document as simple text
OpenRefine
OpenRefine is free to download and use
it's a java web server (currently java8)
OpenRefine
launch OpenRefine in your shell
OpenRefine
web service is ready on your local machine at port 3333
browser FireFox and Chrome will do
OpenRefine
browse your input file
OpenRefine
left is original scanned
right is transkribus output, simple text format
OpenRefine
click Next
OpenRefine
adjust import option, I left defaults
click Create Project >>
OpenRefine
now your project is started
apply operations to shape your data
OpenRefine
if you had done so previously, it's possible to apply operations again
assuming you preserved those operations
open Undo/Redo
OpenRefine
copy previous operations to reapply
OpenRefine
paste previous action in and Perform Operations
OpenRefine
notice result
data is shaped as before
OpenRefine
OpenRefine allows edit
OpenRefine
OCR recognition isn't error free
OpenRefine
edit manually
OpenRefine
edition uses [ ] to flag uncertain data, human readable dates may be converted into more machine friendly form
OpenRefine
use a human to bring sense into data
OpenRefine
OCR led to year 190 which is way off
OpenRefine
when you are done, export data as json using templating
OpenRefine
at this step, defaults are good

Jupyter Notebooks

We finish basic conversion, now it would be good to validate your work before going further. A high level programming language may be useful. I personally like python very much, in conjunction with jupyter notebook it's even more simple to perform an infrequent used workflow. Notes help to explain what is going on and debugging is also simple.

on CentOS 8 adding package looks like following:

pip3 install jupyterlab

after package has been installed

jupyter notebook

will launch jupyter web service at your local machine.

validateDates.ipynb reshapes json created by OpenRefine. Names spanning over multiple lines are combined into 1 and dates listed are validated, a chronological order is assumed.

step comment
OpenRefine
OpenRefine
easy installation using pip, pip3 for python 3.x
OpenRefine
navigate to a path (repo) and launch jupyter notebook
enter in your shell jupyter notebook
OpenRefine
navigate to notebook, D_dbs/3_py/validateDates.ipynb
run each step (click play button in toolbar)
jumping is also possible and permitted
at step In [7] output file is generated
OpenRefine
check output file with your editor of choice
vscode
you may want to reshape json file (optional)
vscode
extension Beautify generate a nice looking json
open context menu and pick Format Document
vscode
result is much more human friendly

validateDates

on OCR text recognition might not be error free. you take a glance to transkribus result and all looks good. nevertheless a sematic check on result is useful.

our data is a index form a book it starts with names (receiver of letter) in maybe multiple lines, followed by dates, which are sorted, from older to newer.

Friedmann, Ernst
13. 9. 1912
Friese, Carl
13. 2. 1907
20. 2. 1907
Fulda, Ludwig
28. 11. 1898
28. 12. 1898
4. 1. 1899
10. 3. 1890
25. 4. 1899
20. 6. 1900
23. 6. 1900
7. 7. 1900
17. 1. 1901
22. 3. 1901
8. 6. 1901
23. 7. 1904
25. 8. 1904
Glümer, Marie
Sommer 1889
4. 8. 1889
5. 8. 1889

with OpenRefine, a json was created

}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1898-12-28",
    "date_text": "28. 12. 1898",
    "uncertain": false
}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1899-01-04",
    "date_text": "4. 1. 1899",
    "uncertain": false,
}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1890-03-10",
    "date_text": "10. 3. 1890",
    "uncertain": false,
}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1899-04-25",
    "date_text": "25. 4. 1899",
    "uncertain": false
}, {

only a focused sight at this data, text or refined data, will reveal recognition error at 10. 3. 1890 (1890-03-10), seen in context, it is supposed to be 10. 3. 1899; however we don't want to replace humans, so it's up to a human to decide.

goal of this notebook is to find suspicious dates, out of order dates. if a newer date was present in previous line, one of those two is properly wrong.

}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1898-12-28",
    "date_text": "28. 12. 1898",
    "uncertain": false
}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1899-01-04",
    "date_text": "4. 1. 1899",
    "uncertain": false,
    "suspicious": "true"
}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1890-03-10",
    "date_text": "10. 3. 1890",
    "uncertain": false,
    "suspicious": "true"
}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1899-04-25",
    "date_text": "25. 4. 1899",
    "uncertain": false
}, {

CMIF

at this point we successfully shaped text input into a table, manually corrected errors, generated a json representation and validated dates. now we'd like to get a CMIF json file.

step comment
OpenRefine
OpenRefine
click Open ...
OpenRefine
click Browse
OpenRefine
find your file
OpenRefine
click Next >>
OpenRefine
pick row
OpenRefine
Create Project >>
OpenRefine
find suspicious values now
correct those accordingly
OpenRefine
when all is good
export using Templating ... again
OpenRefine
defaults need to be adjusted
OpenRefine
copy previously researched json format in
prefix section
OpenRefine
row template section
OpenRefine
suffix section
OpenRefine
note preview from time to time
if you are happy click Export
OpenRefine
do remember filename
OpenRefine
locate exported file
OpenRefine
move it to repo
OpenRefine
give appropriate name
OpenRefine
use your text editor to take a look

now we are almost there. just open CMIF Creator

step comment
CMIF Creator
navigate to CMIF Creator
browse file
CMIF Creator
pick CMIF json (or xml) file
CMIF Creator
inspect data
CMIF Creator
inspect data
CMIF Creator
complete GND information
CMIF Creator
make your choice on GND information
CMIF Creator
GND has been updated
CMIF Creator
all entries with identical receiver has been updated as well