img2xml

img2xml is a toolchain for linking images of text to transcriptions and annotations. A working demo may be seen at (http://docsouth.unc.edu/dusenbery/journal).

Source TIFFs may need to be pre-processed to removed dark margins, using select/fill in a standard image editor, such as Photoshop or the Gimp. These processed TIFFs are converted to bitmaps, using ImageMagick's convert tool, with a BASH command like:

convert image.tif image.pnm

or in batch with a command like:

for f in $(ls *.tif); do convert $f `echo $f | sed 's/\([^.]*\).tif/\1.pnm/'`; done

they are then traced using potrace:

potrace -s -k 0.6 image.pnm

which will produce an SVG file, image.svg. Batch commands may be used, like:

for f in $(ls *.pnm); do potrace -s -k 0.6 $f; done

The '-k N' parameter may be varied between 0 and 1 if the results are undesirable. The 'k' parameter represents the black/white cutoff and may have to be adjusted for images where the surface is darker, or the ink lighter, etc.

Once the SVGs have been created, line detection can be performed using the Python script in the bin/ directory. The script is executed in the following way:

python line_detector.py input.svg output.svg

The result will be an SVG with the paths (representing a tracing of the source image's text) ungrouped, and the lines of text wrapped in svg:rect shapes. These should be checked, and if necessary corrected manually, using an SVG editor like Inkscape.

Once the SVG tracings, with lines detected, have been QA-ed, they should be scaled and corrected using a pair of stylesheets. /xslt/scale-svg.xsl scales the SVG's coordinate system to match that of the source image. The XSLT takes two parameters, width and height. For convenience's sake, I recommend processing them from the command line, using the Saxon HE processor, which may be obtained from http://saxon.sourceforge.net/. A BASH script like the following will make this process easier:

for f in $(ls *.svg) do t=echo $f | sed 's/svg/tif/' java -jar ~/Development/saxonhe9-2-0-2j/saxon9he.jar -s:$f -xsl:../scale-svg.xsl -o:../diary_pages_scaled/$f identify -format "width=%[fx:w] height=%[fx:h]" ../diary_pages_altered/$t done

(where we assume the script is being run in the directory containing the SVG files, the output is going into a parallel directory, 'diary_pages_scaled', and the TIFFs are in another parallel directory, 'diary_pages_altered'.).

If any edits have been made to the SVG in Inkscape that involve the addition of new line rects or the deletion of superfluous ones, the xslt/order-rects.xsl stylesheet will clean these up. This makes it easier to interface with a Javascript app that expects sequential @ids like, "line1, line2", etc..

Finally, the SVG lines can be converted to OpenLayers features using xslt/svg-to-js.xsl.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
bin		bin
presentations		presentations
xslt		xslt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

img2xml

About

Releases

Packages

Languages

hcayless/img2xml

Folders and files

Latest commit

History

Repository files navigation

img2xml

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages