Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spider: Illinois Housing Development Authority #857

Open
pjsier opened this issue Oct 3, 2019 · 15 comments
Open

Spider: Illinois Housing Development Authority #857

pjsier opened this issue Oct 3, 2019 · 15 comments

Comments

@pjsier
Copy link
Collaborator

pjsier commented Oct 3, 2019

URL: https://www.ihda.org/about-ihda/public-meetings-and-notices/
Spider Name: il_housing_development
Agency Name: Illinois Housing Development Authority

See the contribution guide for information on how to get started

@mhartenbower
Copy link

I'll take a look at this one.

@pjsier
Copy link
Collaborator Author

pjsier commented Oct 4, 2019

@mhartenbower sounds good!

@mhartenbower
Copy link

@pjsier So this one has turned out to be a bit more interesting than I had anticipated. :)

Specifically, I'm not sure how to handle the layout of this PDF: https://www.ihda.org/wp-content/uploads/2016/02/2019-Public-Notice-Board-and-Committee-Meeting-Dates.pdf

Are there any examples of dealing with the table layout we're seeing here?

@pjsier
Copy link
Collaborator Author

pjsier commented Oct 6, 2019

@mhartenbower hmm, not that layout specifically, but one thing I've been considering trying out is to use pdfminer.six instead of PyPDF2. We have an example in a different repo here:

https://github.com/City-Bureau/city-scrapers-akr/blob/589fa96fa0d659f3bb553ee791f1663b92ca87d6/city_scrapers/spiders/summ_planning.py#L47-L62

One thing you could also look into is using the same pdfminer function to write to an HTML string instead, which might make the parsing a little easier. The output won't be clean, and the styles will all be inline HTML, but you might be able to distinguish columns with that. If that doesn't look doable we can default to basing the meetings off the dates on the minutes and agendas posted on the site with default times

@amlehman
Copy link

amlehman commented Jun 6, 2020

I would be interested in this one if it's free.

@pjsier
Copy link
Collaborator Author

pjsier commented Jun 8, 2020

Sounds great! Assigning you now @amlehman

@sosolidkk
Copy link
Contributor

I was going to say that i'll like to take this issue, but the website seems to be offline. I'm getting www.ihda.org’s server IP address could not be found., could someone check for me?

@alderocas
Copy link
Contributor

@sosolidkk this site works for me. Are you still having issues?

@sosolidkk
Copy link
Contributor

Yeah, I get a DNS_PROBE_FINISHED_NXDOMAIN message when I try to access it.

@pjsier
Copy link
Collaborator Author

pjsier commented Sep 11, 2020

@sosolidkk are you outside the US? The Illinois state government blocks IPs outside of the country for some reason, we've run into this on a few sites

@sosolidkk
Copy link
Contributor

@sosolidkk are you outside the US? The Illinois state government blocks IPs outside of the country for some reason, we've run into this on a few sites

Yes, i'm outside of US, that's probably it.

@ianbruyere
Copy link

Hello,

Wannabe first time contributor here. I'd like to take this issue under my wing.
I'm forked, I looked into the contribution guidelines, I sent out a Slack form request just in case.
I read the info about the potential workarounds for the pdf file parsing this one will entail.
I'm doing this via Hacktoberfest but I feel for this project I would like to make commits in the future if all goes well.

@pjsier
Copy link
Collaborator Author

pjsier commented Oct 1, 2020

@ianbruyere sure! Assigning you now. Feel free to reach out with any questions

@ianbruyere
Copy link

ianbruyere commented Oct 3, 2020

So as a quick update as to where I am and what I am running into:

I've piped the main pdf into pdfminer and tried a couple methods of extraction:

  1. HTML Output
    This method outputs something that is essentially structurally incoherent. I might be missing something but it's a lot of spans and absolute positioning; nothing to link the columns with their corresponding meeting/time. The code I used is below. Note: I downloaded the pdf in question and renamed it to test. The pdf can be found here

    from io import StringIO
    from pdfminer.high_level import extract_text_to_fp
    
    output_string = StringIO()
    with open('test.pdf', 'rb') as pdf:
        # experimented with laparams to no avail, trying out different margins, boxes_flow
        # haven't tried every permutation but it seems to not alter the result, the default below
        # giving the same result as my 'customization'
        extract_text_to_fp(pdf, output_string, laparams=None, output_type='html', codec=None)
    
    print(output_string.getvalue())

HTML output here: html_output

  1. I simply tried getting everything by not outputting to html, instead pulling the text information based off of association of the boxes the date is found in. Essentially same result. I can get all the dates, but they are separated from their time and meeting type. It is consistent so I could correlate each column of dates with the proper meeting type via hard coding. Below is the following code I have been using:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer # also imported the types LTTextBox, LTRect, got same results
for page_layout in extract_pages("test.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer): # subbing different pdfminer.layout types here
            print(element.get_text())

The above returns raw_text_output.

Proposed Solutions:

  1. I've been looking at the source code and reading the documentation for pdfminer(there is not much) and I would think it would be useful functionality, for it to group entirely by different permutations of vertical boxes, ignoring any horizontal interruptions, giving the user control over how they would like their data grouped. It seems like this might be available but a) i can't get it to work or b) it's not actually there even though in the source code it does mention 'vertical boxes' being a thing. I would have to talk with them a bit before I could get back on that. Lots of work, not that I mind. But it might also be unwelcome/uninformed work, i.e. walking blindly forward hoping for the best

  2. If you only wanted the standard meetings(not the Finance, Audit, Asset Management submeetings) those exist as the superset of ALL the dates on the page. None of the submeetings exist on a date where a normal Board meeting doesn't take place. I like this solution the least, since it doesn't get all the info, it is highly naive.

  3. I think it would be possible as I said somewhere far above in this wall of text, to simply do the correlation between meetings and times based off the way pdfminer always spits out the values. Heaven forbid anything ever changes though(the format of the pdf or how pdfminer) Another terrible solution.

  4. As @pjsier pointed out above, it might be possible to get the dates off the minutes, but it seems those are only posted AFTER the meeting has taken place. Also, much of the information in the meeting agendas seems to be there because of cancellations due to COVID-19, and might not be there at any reasonable time prior to the meeting actually taking place. I don't know for sure and it's hard to say.

TL;DR I'm hoping for some guidance, given what I've presented and tried, from the more experienced people in the group about what would be viable or possibly something I missed in the html_output or the raw_text output.
I want to do the solution that fits with your goals. Sorry for the wall of text, but the previous two people who worked on this didn't put forth much in terms of what they tried and roadblocks they ran into. This issue is like a year old and I'd like to have a hand in closing it.

I'll keep trying stuff and looking into it in the meantime.

Thank you ahead of time for your efforts.

@pjsier
Copy link
Collaborator Author

pjsier commented Oct 7, 2020

Thanks for looking into this and doing a comprehensive write-up, @ianbruyere! I think we can rule out options 1 and 3, since 1 seems like a significant amount of work that may not pay off and 3 feels too brittle.

In our case, we would rather get less information that we can be more confident in than more information that's likely to be incorrect. The main risk with 2 is that it seems like the strikethrough formatting means cancellation, and we can't reliably parse that from the PDF as-is. I'm also not seeing any way of pulling information from the text or HTML outputs.

Overall, it looks like 4 is going to be our best bet. I'm seeing a few agendas that aren't cancellations, including for September 18, so we should still be able to get some information from those documents. Let me know if you have any other questions, and thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants