-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scraping Witness Lists #39
Comments
An update for people interested in this issue: I didn't have much success trying to parse the witness lists with regexes. (see my notebook and the output of that process). When you download the bills, be sure to get them from the ftp directory (ftp://ftp.legis.state.tx.us/bills/85R/witlistbill/). I believe the house and senate lists follow slightly different formats, so you might want to write separate parsers. Even within the same chamber of the legislature, there will be inconsistencies in formatting. I think the listings are not always consistent about what kind of information goes in the parentheses, and some fields are omitted on some lines. It's just messy data in that some witnesses will have filled out the sign-up form wrong, and their mistakes end up in the documents. There are Python libraries that do named entity recognition, so I suggest looking at those. |
Updated. The notebook should be easier to read and I think the output files are about 95% correct. The notebook still just uses regexes, which will never be perfect where names are involved. |
RE: Witness List issue. I was able to download the files from the SOT web site and run @mscarey’s script, That got me a HouseWitness.csv with 722 entries. I ran a script against that to match the names in the FullText field against the FirstName and LastName fields. That gave me a list of 108 mismatches. I ran some of those against the regex /\w*.\w+,.+(/ and seemed to get good results. I'd like to try it in the script, but I'm not sure how to insert it. |
@jpolache Are you saying you're having trouble adding new lines to the jupyter notebook and rerunning it? I'm not clear on where you got stuck, but maybe you want the Jupyter documentation? |
Not sure where in the code to insert my regex. Also, not sure how your regexs work. I usually figure out other people's code by stepping through it in a debugger (not that your code has bugs :) ). I'm not good enough with python pdb to figure it out. |
Fair enough, the code is hard to understand. I just committed a new version of the notebook that hopefully will be a little clearer. I added docstrings for each function. I also changed a few of the functions to use strings rather than lists as inputs and outputs, so I hope that doesn't break any code you've already written. |
Thanks Matt. I'm working through the new code version now. |
@mscarey Getting an error :( import csv def export(dir, witList): houseDir = 'C:/Users/user/Documents/code3/venv/witness/HouseWitness1205.csv' NameError Traceback (most recent call last) NameError: name 'houseRows' is not defined |
@jpolache is it possible you only ran the code in that block, but you didn't run the code earlier in the notebook where a list is assigned to the variable "houseRows"? |
This time, I went to the menu and selected Kernel->Restart and Run All. Here is the error; error Traceback (most recent call last) in extractRows(folderName) in mergelines(wit) in addName(line) c:\users\user\documents\code3\venv\lib\re.py in compile(pattern, flags) c:\users\user\documents\code3\venv\lib\re.py in _compile(pattern, flags) c:\users\user\documents\code3\venv\lib\sre_compile.py in compile(p, flags) c:\users\user\documents\code3\venv\lib\sre_parse.py in parse(str, flags, pattern) c:\users\user\documents\code3\venv\lib\sre_parse.py in _parse_sub(source, state, verbose, nested) c:\users\user\documents\code3\venv\lib\sre_parse.py in _parse(source, state, verbose, nested, first) error: missing ), unterminated subpattern at position 10 |
Fixed the "error: missing ), unterminated subpattern at position 10" issue by using an escape "\(" on the open paren in my regex pattern. But the regex pattern is not compatible with the rest of the script, apparently because I am not using grouping (?:). So far I am not able to duplicate the regex I came up with using grouping and am hesitant to try and rewrite everything else to resolve the error. Research continues. |
@jpolache I can't quite tell which regular expression is triggering the error for you. Maybe the issue is that I was running a different version of Python or a library. I added lines to the notebook that show what I was running. Here's what they show:
|
Thanks Matt. I'll check it out.
…On Dec 12, 2017 11:12 PM, "Matt Carey" ***@***.***> wrote:
@jpolache <https://github.com/jpolache> I can't quite tell which regular
expression is triggering the error for you. Maybe the issue is that I was
running a different version of Python or a library. I added lines to the
notebook
<https://github.com/open-austin/influence-texas/blob/master/notebooks/witness-lists.ipynb>
that show what I was running. Here's what they show:
Python 3.6.1 :: Continuum Analytics, Inc.
bs4 4.6.0
pandas 0.20.3
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#39 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKGAPgfnbS8pIfi4VfIABzqt68UCeFFrks5s_1yzgaJpZM4PxPbK>
.
|
Matt, Python 3.6.3 (v3.6.3:2c5fed8, Oct 3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)] beautifulsoup4==4.6.0 |
I will be at the meetup on Monday evening. Will Matt or John be there as
well? If so, we should be able to easily resolve this.
…-Michael
On Thu, Dec 14, 2017 at 10:32 PM, jpolache ***@***.***> wrote:
Matt,
Python 3.6.3 (v3.6.3:2c5fed8, Oct 3 2017, 18:11:49) [MSC v.1900 64 bit
(AMD64)]
beautifulsoup4==4.6.0
bs4==0.0.1
pandas==0.21.0
freeze.txt
<https://github.com/open-austin/influence-texas/files/1561598/freeze.txt>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#39 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB2ob4-qvDAWP83aU_0HPRhmwZendk_mks5tAfZZgaJpZM4PxPbK>
.
|
@lazarus1331 Yeah, I'll be there. There isn't usually a bunch of time to do project work at the library, but I'll bring my computer. |
Sorry I was unable to attend the meetup. I now have the code running in my
environment and have done some analysis of the output.
Let me know if you would like to discuss next steps.
Jonathan
512 659 6919
…On Mon, Dec 18, 2017 at 4:59 PM, Matt Carey ***@***.***> wrote:
@lazarus1331 <https://github.com/lazarus1331> Yeah, I'll be there. There
isn't usually a bunch of time to do project work at the library, but I'll
bring my computer.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#39 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKGAPrVy2rVJS-c224ISSJh5noO27m-8ks5tBu5SgaJpZM4PxPbK>
.
|
Witness lists are the lists of citizens and organizations who represent themselves or others by speaking before for or against a bill during its hearing by a legislative committee. Witness lists could be scraped from the Texas Legislature Online and added to INFLUENCE TX. NOTE Witnesses should be categorized as stakeholders and not listed as pro or con speakers, as the version of the bill on which they commented may have changed via various amendments before being voted on or passed into law.
The text was updated successfully, but these errors were encountered: