-
-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spider: Illinois Housing Development Authority #857
Comments
I'll take a look at this one. |
@mhartenbower sounds good! |
@pjsier So this one has turned out to be a bit more interesting than I had anticipated. :) Specifically, I'm not sure how to handle the layout of this PDF: https://www.ihda.org/wp-content/uploads/2016/02/2019-Public-Notice-Board-and-Committee-Meeting-Dates.pdf Are there any examples of dealing with the table layout we're seeing here? |
@mhartenbower hmm, not that layout specifically, but one thing I've been considering trying out is to use One thing you could also look into is using the same |
I would be interested in this one if it's free. |
Sounds great! Assigning you now @amlehman |
I was going to say that i'll like to take this issue, but the website seems to be offline. I'm getting |
@sosolidkk this site works for me. Are you still having issues? |
Yeah, I get a |
@sosolidkk are you outside the US? The Illinois state government blocks IPs outside of the country for some reason, we've run into this on a few sites |
Yes, i'm outside of US, that's probably it. |
Hello, Wannabe first time contributor here. I'd like to take this issue under my wing. |
@ianbruyere sure! Assigning you now. Feel free to reach out with any questions |
So as a quick update as to where I am and what I am running into: I've piped the main pdf into pdfminer and tried a couple methods of extraction:
HTML output here: html_output
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer # also imported the types LTTextBox, LTRect, got same results
for page_layout in extract_pages("test.pdf"):
for element in page_layout:
if isinstance(element, LTTextContainer): # subbing different pdfminer.layout types here
print(element.get_text()) The above returns raw_text_output. Proposed Solutions:
TL;DR I'm hoping for some guidance, given what I've presented and tried, from the more experienced people in the group about what would be viable or possibly something I missed in the html_output or the raw_text output. I'll keep trying stuff and looking into it in the meantime. Thank you ahead of time for your efforts. |
Thanks for looking into this and doing a comprehensive write-up, @ianbruyere! I think we can rule out options 1 and 3, since 1 seems like a significant amount of work that may not pay off and 3 feels too brittle. In our case, we would rather get less information that we can be more confident in than more information that's likely to be incorrect. The main risk with 2 is that it seems like the strikethrough formatting means cancellation, and we can't reliably parse that from the PDF as-is. I'm also not seeing any way of pulling information from the text or HTML outputs. Overall, it looks like 4 is going to be our best bet. I'm seeing a few agendas that aren't cancellations, including for September 18, so we should still be able to get some information from those documents. Let me know if you have any other questions, and thanks again! |
URL: https://www.ihda.org/about-ihda/public-meetings-and-notices/
Spider Name:
il_housing_development
Agency Name: Illinois Housing Development Authority
See the contribution guide for information on how to get started
The text was updated successfully, but these errors were encountered: