Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping Witness Lists #39

Open
AmyMcCullough opened this issue Oct 7, 2017 · 17 comments
Open

Scraping Witness Lists #39

AmyMcCullough opened this issue Oct 7, 2017 · 17 comments

Comments

@AmyMcCullough
Copy link
Collaborator

AmyMcCullough commented Oct 7, 2017

Witness lists are the lists of citizens and organizations who represent themselves or others by speaking before for or against a bill during its hearing by a legislative committee. Witness lists could be scraped from the Texas Legislature Online and added to INFLUENCE TX. NOTE Witnesses should be categorized as stakeholders and not listed as pro or con speakers, as the version of the bill on which they commented may have changed via various amendments before being voted on or passed into law.

@mscarey
Copy link
Collaborator

mscarey commented Oct 26, 2017

An update for people interested in this issue: I didn't have much success trying to parse the witness lists with regexes. (see my notebook and the output of that process). When you download the bills, be sure to get them from the ftp directory (ftp://ftp.legis.state.tx.us/bills/85R/witlistbill/). I believe the house and senate lists follow slightly different formats, so you might want to write separate parsers. Even within the same chamber of the legislature, there will be inconsistencies in formatting. I think the listings are not always consistent about what kind of information goes in the parentheses, and some fields are omitted on some lines. It's just messy data in that some witnesses will have filled out the sign-up form wrong, and their mistakes end up in the documents. There are Python libraries that do named entity recognition, so I suggest looking at those.

@mscarey
Copy link
Collaborator

mscarey commented Nov 1, 2017

Updated. The notebook should be easier to read and I think the output files are about 95% correct. The notebook still just uses regexes, which will never be perfect where names are involved.

@jpolache
Copy link

RE: Witness List issue. I was able to download the files from the SOT web site and run @mscarey’s script, That got me a HouseWitness.csv with 722 entries. I ran a script against that to match the names in the FullText field against the FirstName and LastName fields. That gave me a list of 108 mismatches. I ran some of those against the regex /\w*.\w+,.+(/ and seemed to get good results. I'd like to try it in the script, but I'm not sure how to insert it.

@mscarey
Copy link
Collaborator

mscarey commented Nov 30, 2017

@jpolache Are you saying you're having trouble adding new lines to the jupyter notebook and rerunning it? I'm not clear on where you got stuck, but maybe you want the Jupyter documentation?

@jpolache
Copy link

jpolache commented Dec 1, 2017

Not sure where in the code to insert my regex. Also, not sure how your regexs work. I usually figure out other people's code by stepping through it in a debugger (not that your code has bugs :) ). I'm not good enough with python pdb to figure it out.

@mscarey
Copy link
Collaborator

mscarey commented Dec 5, 2017

Fair enough, the code is hard to understand. I just committed a new version of the notebook that hopefully will be a little clearer. I added docstrings for each function. I also changed a few of the functions to use strings rather than lists as inputs and outputs, so I hope that doesn't break any code you've already written.

@jpolache
Copy link

jpolache commented Dec 6, 2017

Thanks Matt. I'm working through the new code version now.

@jpolache
Copy link

jpolache commented Dec 6, 2017

@mscarey Getting an error :(

import csv

def export(dir, witList):
with open(dir,'w') as f:
writer = csv.writer(f)
writer.writerow(['FullText', 'Position', 'Bill', 'LastName', 'FirstName', 'Role', 'Organization', 'City', 'State'])
writer.writerows(witList) # better just to include the FullText field.
return None

houseDir = 'C:/Users/user/Documents/code3/venv/witness/HouseWitness1205.csv'
#houseDir = '../data/witness-lists/HouseWitness.csv'
export(houseDir, houseRows)


NameError Traceback (most recent call last)
in ()
10 houseDir = 'C:/Users/user/Documents/code3/venv/witness/HouseWitness1205.csv'
11 #houseDir = '../data/witness-lists/HouseWitness.csv'
---> 12 export(houseDir, houseRows)

NameError: name 'houseRows' is not defined

@mscarey
Copy link
Collaborator

mscarey commented Dec 6, 2017

@jpolache is it possible you only ran the code in that block, but you didn't run the code earlier in the notebook where a list is assigned to the variable "houseRows"?

@jpolache
Copy link

jpolache commented Dec 6, 2017

This time, I went to the menu and selected Kernel->Restart and Run All. Here is the error;


error Traceback (most recent call last)
in ()
22 folderName = 'C:/Users/user/Documents/code3/venv/witness/house_bills/HB00400_HB00499/'
23 #folderName = 'bills/85R/witlistbill/html/house_bills/'
---> 24 houseWit = extractRows(folderName)

in extractRows(folderName)
15 wit = HBWitness(source)
16 # trying to rejoin entries split across lines
---> 17 new = mergelines(wit)
18 new = mergelines(new) # Will doing it twice catch 3-line entries?
19 houseWit.extend(new)

in mergelines(wit)
28 changed += 1
29
---> 30 elif row[0].count(')') != row[0].count('(') and wit[lineIndex + 1][0].count(')') != wit[lineIndex + 1][0].count('(') and addName(row[0]) != [None, None] and not re.search(endWithState, row[0]):
31 newList.append([row[0] + " " + wit[lineIndex + 1][0], row[1], row[2]])
32 badList.append(wit[lineIndex + 1][0:3])

in addName(line)
17 for f in flags:
18 for r in regexes:
---> 19 nameRe = re.compile(r, f)
20 match = re.search(nameRe, line)
21 if match:

c:\users\user\documents\code3\venv\lib\re.py in compile(pattern, flags)
231 def compile(pattern, flags=0):
232 "Compile a regular expression pattern, returning a pattern object."
--> 233 return _compile(pattern, flags)
234
235 def purge():

c:\users\user\documents\code3\venv\lib\re.py in _compile(pattern, flags)
299 if not sre_compile.isstring(pattern):
300 raise TypeError("first argument must be string or compiled pattern")
--> 301 p = sre_compile.compile(pattern, flags)
302 if not (flags & DEBUG):
303 if len(_cache) >= _MAXCACHE:

c:\users\user\documents\code3\venv\lib\sre_compile.py in compile(p, flags)
560 if isstring(p):
561 pattern = p
--> 562 p = sre_parse.parse(p, flags)
563 else:
564 pattern = None

c:\users\user\documents\code3\venv\lib\sre_parse.py in parse(str, flags, pattern)
853
854 try:
--> 855 p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
856 except Verbose:
857 # the VERBOSE flag was switched on inside the pattern. to be

c:\users\user\documents\code3\venv\lib\sre_parse.py in _parse_sub(source, state, verbose, nested)
414 while True:
415 itemsappend(_parse(source, state, verbose, nested + 1,
--> 416 not nested and not items))
417 if not sourcematch("|"):
418 break

c:\users\user\documents\code3\venv\lib\sre_parse.py in _parse(source, state, verbose, nested, first)
766 if not source.match(")"):
767 raise source.error("missing ), unterminated subpattern",
--> 768 source.tell() - start)
769 if group is not None:
770 state.closegroup(group, p)

error: missing ), unterminated subpattern at position 10

@jpolache
Copy link

jpolache commented Dec 9, 2017

Fixed the "error: missing ), unterminated subpattern at position 10" issue by using an escape "\(" on the open paren in my regex pattern. But the regex pattern is not compatible with the rest of the script, apparently because I am not using grouping (?:). So far I am not able to duplicate the regex I came up with using grouping and am hesitant to try and rewrite everything else to resolve the error. Research continues.

@mscarey
Copy link
Collaborator

mscarey commented Dec 13, 2017

@jpolache I can't quite tell which regular expression is triggering the error for you. Maybe the issue is that I was running a different version of Python or a library. I added lines to the notebook that show what I was running. Here's what they show:

Python 3.6.1 :: Continuum Analytics, Inc.
bs4 4.6.0
pandas 0.20.3

@jpolache
Copy link

jpolache commented Dec 13, 2017 via email

@jpolache
Copy link

Matt,

Python 3.6.3 (v3.6.3:2c5fed8, Oct 3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)]

beautifulsoup4==4.6.0
bs4==0.0.1
pandas==0.21.0

freeze.txt

@lazarus1331
Copy link
Collaborator

lazarus1331 commented Dec 17, 2017 via email

@mscarey
Copy link
Collaborator

mscarey commented Dec 18, 2017

@lazarus1331 Yeah, I'll be there. There isn't usually a bunch of time to do project work at the library, but I'll bring my computer.

@jpolache
Copy link

jpolache commented Dec 20, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants