Scraping Witness Lists #39

AmyMcCullough · 2017-10-07T05:33:22Z

Witness lists are the lists of citizens and organizations who represent themselves or others by speaking before for or against a bill during its hearing by a legislative committee. Witness lists could be scraped from the Texas Legislature Online and added to INFLUENCE TX. NOTE Witnesses should be categorized as stakeholders and not listed as pro or con speakers, as the version of the bill on which they commented may have changed via various amendments before being voted on or passed into law.

mscarey · 2017-10-26T01:11:28Z

An update for people interested in this issue: I didn't have much success trying to parse the witness lists with regexes. (see my notebook and the output of that process). When you download the bills, be sure to get them from the ftp directory (ftp://ftp.legis.state.tx.us/bills/85R/witlistbill/). I believe the house and senate lists follow slightly different formats, so you might want to write separate parsers. Even within the same chamber of the legislature, there will be inconsistencies in formatting. I think the listings are not always consistent about what kind of information goes in the parentheses, and some fields are omitted on some lines. It's just messy data in that some witnesses will have filled out the sign-up form wrong, and their mistakes end up in the documents. There are Python libraries that do named entity recognition, so I suggest looking at those.

mscarey · 2017-11-01T22:00:35Z

Updated. The notebook should be easier to read and I think the output files are about 95% correct. The notebook still just uses regexes, which will never be perfect where names are involved.

jpolache · 2017-11-29T14:08:22Z

RE: Witness List issue. I was able to download the files from the SOT web site and run @mscarey’s script, That got me a HouseWitness.csv with 722 entries. I ran a script against that to match the names in the FullText field against the FirstName and LastName fields. That gave me a list of 108 mismatches. I ran some of those against the regex /\w*.\w+,.+(/ and seemed to get good results. I'd like to try it in the script, but I'm not sure how to insert it.

mscarey · 2017-11-30T04:19:34Z

@jpolache Are you saying you're having trouble adding new lines to the jupyter notebook and rerunning it? I'm not clear on where you got stuck, but maybe you want the Jupyter documentation?

jpolache · 2017-12-01T04:52:01Z

Not sure where in the code to insert my regex. Also, not sure how your regexs work. I usually figure out other people's code by stepping through it in a debugger (not that your code has bugs :) ). I'm not good enough with python pdb to figure it out.

mscarey · 2017-12-05T11:47:15Z

Fair enough, the code is hard to understand. I just committed a new version of the notebook that hopefully will be a little clearer. I added docstrings for each function. I also changed a few of the functions to use strings rather than lists as inputs and outputs, so I hope that doesn't break any code you've already written.

jpolache · 2017-12-06T01:48:46Z

Thanks Matt. I'm working through the new code version now.

jpolache · 2017-12-06T02:47:56Z

@mscarey Getting an error :(

import csv

def export(dir, witList):
with open(dir,'w') as f:
writer = csv.writer(f)
writer.writerow(['FullText', 'Position', 'Bill', 'LastName', 'FirstName', 'Role', 'Organization', 'City', 'State'])
writer.writerows(witList) # better just to include the FullText field.
return None

houseDir = 'C:/Users/user/Documents/code3/venv/witness/HouseWitness1205.csv'
#houseDir = '../data/witness-lists/HouseWitness.csv'
export(houseDir, houseRows)

NameError Traceback (most recent call last)
in ()
10 houseDir = 'C:/Users/user/Documents/code3/venv/witness/HouseWitness1205.csv'
11 #houseDir = '../data/witness-lists/HouseWitness.csv'
---> 12 export(houseDir, houseRows)

NameError: name 'houseRows' is not defined

mscarey · 2017-12-06T06:28:49Z

@jpolache is it possible you only ran the code in that block, but you didn't run the code earlier in the notebook where a list is assigned to the variable "houseRows"?

jpolache · 2017-12-06T14:20:01Z

This time, I went to the menu and selected Kernel->Restart and Run All. Here is the error;

error Traceback (most recent call last)
in ()
22 folderName = 'C:/Users/user/Documents/code3/venv/witness/house_bills/HB00400_HB00499/'
23 #folderName = 'bills/85R/witlistbill/html/house_bills/'
---> 24 houseWit = extractRows(folderName)

in extractRows(folderName)
15 wit = HBWitness(source)
16 # trying to rejoin entries split across lines
---> 17 new = mergelines(wit)
18 new = mergelines(new) # Will doing it twice catch 3-line entries?
19 houseWit.extend(new)

in mergelines(wit)
28 changed += 1
29
---> 30 elif row[0].count(')') != row[0].count('(') and wit[lineIndex + 1][0].count(')') != wit[lineIndex + 1][0].count('(') and addName(row[0]) != [None, None] and not re.search(endWithState, row[0]):
31 newList.append([row[0] + " " + wit[lineIndex + 1][0], row[1], row[2]])
32 badList.append(wit[lineIndex + 1][0:3])

in addName(line)
17 for f in flags:
18 for r in regexes:
---> 19 nameRe = re.compile(r, f)
20 match = re.search(nameRe, line)
21 if match:

c:\users\user\documents\code3\venv\lib\re.py in compile(pattern, flags)
231 def compile(pattern, flags=0):
232 "Compile a regular expression pattern, returning a pattern object."
--> 233 return _compile(pattern, flags)
234
235 def purge():

c:\users\user\documents\code3\venv\lib\re.py in _compile(pattern, flags)
299 if not sre_compile.isstring(pattern):
300 raise TypeError("first argument must be string or compiled pattern")
--> 301 p = sre_compile.compile(pattern, flags)
302 if not (flags & DEBUG):
303 if len(_cache) >= _MAXCACHE:

c:\users\user\documents\code3\venv\lib\sre_compile.py in compile(p, flags)
560 if isstring(p):
561 pattern = p
--> 562 p = sre_parse.parse(p, flags)
563 else:
564 pattern = None

c:\users\user\documents\code3\venv\lib\sre_parse.py in parse(str, flags, pattern)
853
854 try:
--> 855 p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
856 except Verbose:
857 # the VERBOSE flag was switched on inside the pattern. to be

c:\users\user\documents\code3\venv\lib\sre_parse.py in _parse_sub(source, state, verbose, nested)
414 while True:
415 itemsappend(_parse(source, state, verbose, nested + 1,
--> 416 not nested and not items))
417 if not sourcematch("|"):
418 break

c:\users\user\documents\code3\venv\lib\sre_parse.py in _parse(source, state, verbose, nested, first)
766 if not source.match(")"):
767 raise source.error("missing ), unterminated subpattern",
--> 768 source.tell() - start)
769 if group is not None:
770 state.closegroup(group, p)

error: missing ), unterminated subpattern at position 10

jpolache · 2017-12-09T21:37:26Z

Fixed the "error: missing ), unterminated subpattern at position 10" issue by using an escape "\(" on the open paren in my regex pattern. But the regex pattern is not compatible with the rest of the script, apparently because I am not using grouping (?:). So far I am not able to duplicate the regex I came up with using grouping and am hesitant to try and rewrite everything else to resolve the error. Research continues.

mscarey · 2017-12-13T05:12:18Z

@jpolache I can't quite tell which regular expression is triggering the error for you. Maybe the issue is that I was running a different version of Python or a library. I added lines to the notebook that show what I was running. Here's what they show:

Python 3.6.1 :: Continuum Analytics, Inc.
bs4 4.6.0
pandas 0.20.3

jpolache · 2017-12-13T12:43:01Z

Thanks Matt. I'll check it out.

…

On Dec 12, 2017 11:12 PM, "Matt Carey" ***@***.***> wrote: @jpolache <https://github.com/jpolache> I can't quite tell which regular expression is triggering the error for you. Maybe the issue is that I was running a different version of Python or a library. I added lines to the notebook <https://github.com/open-austin/influence-texas/blob/master/notebooks/witness-lists.ipynb> that show what I was running. Here's what they show: Python 3.6.1 :: Continuum Analytics, Inc. bs4 4.6.0 pandas 0.20.3 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#39 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKGAPgfnbS8pIfi4VfIABzqt68UCeFFrks5s_1yzgaJpZM4PxPbK> .

jpolache · 2017-12-15T04:32:24Z

Matt,

Python 3.6.3 (v3.6.3:2c5fed8, Oct 3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)]

beautifulsoup4==4.6.0
bs4==0.0.1
pandas==0.21.0

freeze.txt

lazarus1331 · 2017-12-17T01:48:44Z

I will be at the meetup on Monday evening. Will Matt or John be there as well? If so, we should be able to easily resolve this.

…

-Michael

On Thu, Dec 14, 2017 at 10:32 PM, jpolache ***@***.***> wrote: Matt, Python 3.6.3 (v3.6.3:2c5fed8, Oct 3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)] beautifulsoup4==4.6.0 bs4==0.0.1 pandas==0.21.0 freeze.txt <https://github.com/open-austin/influence-texas/files/1561598/freeze.txt> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#39 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB2ob4-qvDAWP83aU_0HPRhmwZendk_mks5tAfZZgaJpZM4PxPbK> .

mscarey · 2017-12-18T22:59:29Z

@lazarus1331 Yeah, I'll be there. There isn't usually a bunch of time to do project work at the library, but I'll bring my computer.

jpolache · 2017-12-20T02:56:59Z

Sorry I was unable to attend the meetup. I now have the code running in my environment and have done some analysis of the output. Let me know if you would like to discuss next steps. Jonathan 512 659 6919

…

On Mon, Dec 18, 2017 at 4:59 PM, Matt Carey ***@***.***> wrote: @lazarus1331 <https://github.com/lazarus1331> Yeah, I'll be there. There isn't usually a bunch of time to do project work at the library, but I'll bring my computer. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#39 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKGAPrVy2rVJS-c224ISSJh5noO27m-8ks5tBu5SgaJpZM4PxPbK> .

AmyMcCullough added Coding Python labels Oct 7, 2017

lazarus1331 added Data Science Data Wrangling feedback wanted labels Mar 20, 2019

lazarus1331 added this to the Industry Classification milestone Mar 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraping Witness Lists #39

Scraping Witness Lists #39

AmyMcCullough commented Oct 7, 2017 •

edited

Loading

mscarey commented Oct 26, 2017

mscarey commented Nov 1, 2017

jpolache commented Nov 29, 2017

mscarey commented Nov 30, 2017

jpolache commented Dec 1, 2017

mscarey commented Dec 5, 2017

jpolache commented Dec 6, 2017

jpolache commented Dec 6, 2017

mscarey commented Dec 6, 2017

jpolache commented Dec 6, 2017

jpolache commented Dec 9, 2017 •

edited

Loading

mscarey commented Dec 13, 2017

jpolache commented Dec 13, 2017 via email

jpolache commented Dec 15, 2017

lazarus1331 commented Dec 17, 2017 via email

mscarey commented Dec 18, 2017

jpolache commented Dec 20, 2017 via email

Scraping Witness Lists #39

Scraping Witness Lists #39

Comments

AmyMcCullough commented Oct 7, 2017 • edited Loading

mscarey commented Oct 26, 2017

mscarey commented Nov 1, 2017

jpolache commented Nov 29, 2017

mscarey commented Nov 30, 2017

jpolache commented Dec 1, 2017

mscarey commented Dec 5, 2017

jpolache commented Dec 6, 2017

jpolache commented Dec 6, 2017

mscarey commented Dec 6, 2017

jpolache commented Dec 6, 2017

jpolache commented Dec 9, 2017 • edited Loading

mscarey commented Dec 13, 2017

jpolache commented Dec 13, 2017 via email

jpolache commented Dec 15, 2017

lazarus1331 commented Dec 17, 2017 via email

mscarey commented Dec 18, 2017

jpolache commented Dec 20, 2017 via email

AmyMcCullough commented Oct 7, 2017 •

edited

Loading

jpolache commented Dec 9, 2017 •

edited

Loading