You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's dejecting when looking lower in the global frequency list, seeing just how much space is being wasted by OCR errors vs. legitimate-but-rare words. For every hundred tokens of junk, you get one "greyish-olive" or "tetraspilus". Should we explore OCR accuracy estimation methods, so that after the top two million words or so, we can start raising our standards for what a token is? We'd be able to dig deeper down the list that way, but I'm not sure if it's a useful endeavor.
The text was updated successfully, but these errors were encountered:
If we had OCR quality estimates, I could see limiting the input texts to those with high quality scores. (Although there are problems with that).
Topic models or word2vec models might be effective at assigning such scores to documents now.
I think, though, that we could sink a lot of time into many refinements for fairly low reward. There are clear reasons to keep rare English OCR errors from swamping out common Hebrew words (or whatever), but language is doing that. The low-frequency English words aren't going to produce very good charts anyway.
It's dejecting when looking lower in the global frequency list, seeing just how much space is being wasted by OCR errors vs. legitimate-but-rare words. For every hundred tokens of junk, you get one "greyish-olive" or "tetraspilus". Should we explore OCR accuracy estimation methods, so that after the top two million words or so, we can start raising our standards for what a token is? We'd be able to dig deeper down the list that way, but I'm not sure if it's a useful endeavor.
The text was updated successfully, but these errors were encountered: