Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR quality measures #5

Open
organisciak opened this issue Sep 10, 2016 · 1 comment
Open

OCR quality measures #5

organisciak opened this issue Sep 10, 2016 · 1 comment
Labels

Comments

@organisciak
Copy link
Member

It's dejecting when looking lower in the global frequency list, seeing just how much space is being wasted by OCR errors vs. legitimate-but-rare words. For every hundred tokens of junk, you get one "greyish-olive" or "tetraspilus". Should we explore OCR accuracy estimation methods, so that after the top two million words or so, we can start raising our standards for what a token is? We'd be able to dig deeper down the list that way, but I'm not sure if it's a useful endeavor.

@bmschmidt
Copy link
Member

If we had OCR quality estimates, I could see limiting the input texts to those with high quality scores. (Although there are problems with that).

Topic models or word2vec models might be effective at assigning such scores to documents now.

I think, though, that we could sink a lot of time into many refinements for fairly low reward. There are clear reasons to keep rare English OCR errors from swamping out common Hebrew words (or whatever), but language is doing that. The low-frequency English words aren't going to produce very good charts anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants