Workflow Arthur Schnitzler Briefe

This workflow describes step performed to convert letter index into CMIF format. Utilized tools:

Transkribus
OpenRefine
VS code
python (Jupyter Notebooks)

step	comment

	regions defined manually

	line detection automatically
	text recognition

	export document as simple text

	OpenRefine is free to download and use
	it's a java web server (currently java8)

	launch OpenRefine in your shell

	web service is ready on your local machine at port 3333
	browser FireFox and Chrome will do

	browse your input file

	left is original scanned
	right is transkribus output, simple text format

	click `Next`

	adjust import option, I left defaults
	click `Create Project >>`

	now your project is started
	apply operations to shape your data

	if you had done so previously, it's possible to apply operations again
	assuming you preserved those operations
	open `Undo/Redo`

	copy previous operations to reapply

	paste previous action in and `Perform Operations`

	notice result
	data is shaped as before

	OpenRefine allows edit

	OCR recognition isn't error free

	edit manually

	edition uses `[` `]` to flag uncertain data, human readable dates may be converted into more machine friendly form

	use a human to bring sense into data

	OCR led to year `190` which is way off

	when you are done, export data as json using templating

	at this step, defaults are good

Jupyter Notebooks

We finish basic conversion, now it would be good to validate your work before going further. A high level programming language may be useful. I personally like python very much, in conjunction with jupyter notebook it's even more simple to perform an infrequent used workflow. Notes help to explain what is going on and debugging is also simple.

on CentOS 8 adding package looks like following:

pip3 install jupyterlab

after package has been installed

jupyter notebook

will launch jupyter web service at your local machine.

validateDates.ipynb reshapes json created by OpenRefine. Names spanning over multiple lines are combined into 1 and dates listed are validated, a chronological order is assumed.

step	comment



	easy installation using `pip`, `pip3` for python 3.x

	navigate to a path (repo) and launch jupyter notebook
	enter in your shell `jupyter notebook`

	navigate to notebook, `D_dbs/3_py/validateDates.ipynb`
	run each step (click play button in toolbar)
	jumping is also possible and permitted
	at step `In [7]` output file is generated

	check output file with your editor of choice

	you may want to reshape json file (optional)

	extension `Beautify` generate a nice looking json
	open context menu and pick `Format Document`

	result is much more human friendly

validateDates

on OCR text recognition might not be error free. you take a glance to transkribus result and all looks good. nevertheless a sematic check on result is useful.

our data is a index form a book it starts with names (receiver of letter) in maybe multiple lines, followed by dates, which are sorted, from older to newer.

Friedmann, Ernst
13. 9. 1912
Friese, Carl
13. 2. 1907
20. 2. 1907
Fulda, Ludwig
28. 11. 1898
28. 12. 1898
4. 1. 1899
10. 3. 1890
25. 4. 1899
20. 6. 1900
23. 6. 1900
7. 7. 1900
17. 1. 1901
22. 3. 1901
8. 6. 1901
23. 7. 1904
25. 8. 1904
Glümer, Marie
Sommer 1889
4. 8. 1889
5. 8. 1889

with OpenRefine, a json was created

}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1898-12-28",
    "date_text": "28. 12. 1898",
    "uncertain": false
}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1899-01-04",
    "date_text": "4. 1. 1899",
    "uncertain": false,
}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1890-03-10",
    "date_text": "10. 3. 1890",
    "uncertain": false,
}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1899-04-25",
    "date_text": "25. 4. 1899",
    "uncertain": false
}, {

only a focused sight at this data, text or refined data, will reveal recognition error at 10. 3. 1890 (1890-03-10), seen in context, it is supposed to be 10. 3. 1899; however we don't want to replace humans, so it's up to a human to decide.

goal of this notebook is to find suspicious dates, out of order dates. if a newer date was present in previous line, one of those two is properly wrong.

}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1898-12-28",
    "date_text": "28. 12. 1898",
    "uncertain": false
}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1899-01-04",
    "date_text": "4. 1. 1899",
    "uncertain": false,
    "suspicious": "true"
}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1890-03-10",
    "date_text": "10. 3. 1890",
    "uncertain": false,
    "suspicious": "true"
}, {
    "recv": "Fulda, Ludwig",
    "date_when": "1899-04-25",
    "date_text": "25. 4. 1899",
    "uncertain": false
}, {

CMIF

at this point we successfully shaped text input into a table, manually corrected errors, generated a json representation and validated dates. now we'd like to get a CMIF json file.

step	comment



	click `Open ...`

	click `Browse`

	find your file

	click `Next >>`

	pick row

	`Create Project >>`

	find suspicious values now
	correct those accordingly

	when all is good
	export using `Templating ...` again

	defaults need to be adjusted

	copy previously researched json format in
	prefix section

	row template section

	suffix section

	note preview from time to time
	if you are happy click `Export`

	do remember filename

	locate exported file

	move it to repo

	give appropriate name

	use your text editor to take a look

now we are almost there. just open CMIF Creator

step	comment

	navigate to CMIF Creator
	browse file

	pick CMIF json (or xml) file

	inspect data

	inspect data

	complete GND information

	make your choice on GND information

	GND has been updated

	all entries with identical receiver has been updated as well

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrkFlwArthurSchnitzlerBriefe.md

wrkFlwArthurSchnitzlerBriefe.md

Workflow Arthur Schnitzler Briefe

Jupyter Notebooks

validateDates

CMIF

Files

wrkFlwArthurSchnitzlerBriefe.md

Latest commit

History

wrkFlwArthurSchnitzlerBriefe.md

File metadata and controls

Workflow Arthur Schnitzler Briefe

Jupyter Notebooks

validateDates

CMIF