Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Klamath County Results #74

Open
16 of 17 tasks
dwillis opened this issue Sep 19, 2015 · 8 comments
Open
16 of 17 tasks

Parse Klamath County Results #74

dwillis opened this issue Sep 19, 2015 · 8 comments

Comments

@dwillis
Copy link
Contributor

dwillis commented Sep 19, 2015

Klamath produces image PDFs, so these files will need to be OCRd before any parsing:

@JasonBernert
Copy link
Contributor

JasonBernert commented Nov 17, 2016

I'm tackling this now. It's this first time I've contributed to Open Elections, but I think I have a good grasp of what I need to do after reading the docs.

I downloaded and OCRd all the PDFs with pypdfocr. I'm extracting the 2002 primary data with Tabula and now cleaning and checking.

@dwillis
Copy link
Contributor Author

dwillis commented Nov 17, 2016

Thanks, @JasonBernert! That sounds like a great approach - let me know if you run into any issues or questions.

@nk9
Copy link
Contributor

nk9 commented Nov 25, 2016

Hey @JasonBernert, how's it coming? I am having trouble getting pypdfocr working, so I can't OCR things now… Do you have any OCRed files I could work on in the meantime? Also, in case you haven't run across it, I recommend OpenRefine. Once you get the hang of it, it makes cleaning up OCRed data like this much easier.

@JasonBernert
Copy link
Contributor

Hey @nk9! It's a little messy. pypdfocr is great for batch processing, but doesn't do a great job on PDF images with low DPI. I downloaded the Adobe Acrobat free trial. It's great at OCR, but takes a bit longer. I'm cleaning up 2002, 2004, and 2008 results now. It looks like 2006 will have to be entered in by hand. Want to try Acrobat OCR on Clatsop County results? Or tackle the 2006 results?

@nk9
Copy link
Contributor

nk9 commented Nov 25, 2016

It turns out I have access to Acrobat myself, so I'm good to go on the OCR front. Unfortunately, it doesn't seem to embed the OCRed text in the document itself… which seems like the primary thing people would want to do. :-( Anyway, I can start on Clatsop County now.

@dwillis
Copy link
Contributor Author

dwillis commented Nov 26, 2016

@JasonBernert @nk9 if you're running into issues with OCR, I've got Able2Extract which has been pretty good.

dwillis added a commit that referenced this issue Nov 26, 2016
Adding 2008 Klamath County General and Primary Results #74
@dwillis dwillis closed this as completed Dec 7, 2016
@nk9
Copy link
Contributor

nk9 commented Mar 11, 2017

Got the last two elections from Klamath, in 2000.

@nk9 nk9 reopened this Mar 11, 2017
nk9 added a commit that referenced this issue Mar 11, 2017
@dwillis
Copy link
Contributor Author

dwillis commented May 6, 2017

@nk9 Looks like 2000 primary results are Democratic-only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants