OCR quality measures #5

organisciak · 2016-09-10T00:11:49Z

It's dejecting when looking lower in the global frequency list, seeing just how much space is being wasted by OCR errors vs. legitimate-but-rare words. For every hundred tokens of junk, you get one "greyish-olive" or "tetraspilus". Should we explore OCR accuracy estimation methods, so that after the top two million words or so, we can start raising our standards for what a token is? We'd be able to dig deeper down the list that way, but I'm not sure if it's a useful endeavor.

bmschmidt · 2016-09-10T02:50:38Z

If we had OCR quality estimates, I could see limiting the input texts to those with high quality scores. (Although there are problems with that).

Topic models or word2vec models might be effective at assigning such scores to documents now.

I think, though, that we could sink a lot of time into many refinements for fairly low reward. There are clear reasons to keep rare English OCR errors from swamping out common Hebrew words (or whatever), but language is doing that. The low-frequency English words aren't going to produce very good charts anyway.

organisciak added the question label Sep 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR quality measures #5

OCR quality measures #5

organisciak commented Sep 10, 2016

bmschmidt commented Sep 10, 2016

OCR quality measures #5

OCR quality measures #5

Comments

organisciak commented Sep 10, 2016

bmschmidt commented Sep 10, 2016