Refactor for scraped #7

ondenman · 2017-04-25T17:08:45Z

This is the first step in moving the scraper to scraped.

The output of this PR matches the output in master.

Further changes to be made after this PR:

Fix the image field so the correct image is captured for all members (Hong Kong: Sixth Legislative Council -- member has PDF download button for image everypolitician/everypolitician-data#33716)
Split multiple phone numbers and emails (Hong Kong: Sixth Legislative Council -- multiple emails and phone numbers need splitting everypolitician/everypolitician-data#33726)

tmtmtmtm

This ended up being much more of a rewrite than I was expecting. In general you've done a pretty good job of that, but there are a few too many places where you're doing things in a non-standard way and it's difficult to know if that's deliberate/needed.

tmtmtmtm · 2017-04-26T08:41:41Z

scraper.rb

+require 'nokogiri'
+require 'scraped_page_archive/open-uri'
+require 'date'
+require 'scraped'


I'm assuming you've reintroduced these by mistake?

tmtmtmtm · 2017-04-26T08:42:38Z

scraper.rb

-
+list_url = 'http://www.legco.gov.hk/general/english/members/yr16-20/biographies.htm'
+(scrape list_url => MembersPage).member_urls.each do |url|
+  data = (scrape url => MemberPage).to_h.merge(term: 6)
  ScraperWiki.save_sqlite([:id], data)


our normal practice here is to do this save in one shot after building up all the data, rather than within the loop.

tmtmtmtm · 2017-04-26T08:43:03Z

scraper.rb

 end
-
-ScraperWiki.sqliteexecute('DROP TABLE data') rescue nil


why have you removed the DROP TABLE?

tmtmtmtm · 2017-04-26T08:46:28Z

test/data/suk-yee.yml

+  :faction: New People's Party
+  :email: [email protected]
+  :website: http://www.reginaip.hk
+  :phone: 2537 3267 / 2537 3265


I think you should add an explicit TODO note here that this isn't what we want

tmtmtmtm · 2017-04-26T08:48:52Z

lib/member_page.rb

+# frozen_string_literal: true
+
+require 'scraped'
+require 'pry'


I don't think you want/need this to be here.

tmtmtmtm · 2017-04-26T09:16:07Z

lib/member_page.rb

+
+  field :faction do
+    f = bio.xpath('//p[contains(.,"Political affiliation")]/'\
+                  'following-sibling::ul[not(position() > 1)]/li/text()')


That's worth factoring out to a separate method, rather than doing both the finding and the processing here.

tmtmtmtm · 2017-04-26T09:18:16Z

lib/member_page.rb

+      'Kowloon West New Dynamic',
+      'New Territories Association of Societies',
+      'April Fifth Action',
+    ]


it's a fairly small list, so it's not a massive issue here, but it's worth getting into the habit of using a Set instead of an Array when a list exists solely to be scanned, so that lookup is O(1) instead of O(n)

tmtmtmtm · 2017-04-26T09:20:49Z

lib/member_page.rb

+    # Some member pages list more than one group affiliation for that member
+    # Here, we remove affiliations with known non-party groups
+    f.map(&:to_s).map(&:tidy).find do |party|
+      !non_party_groups.to_s.include? party


a find with a not like this is a little bit awkward, and also disguises slightly the case of what's happening if there's more than one non-party-group party.

I think this might be a little bit clearer and more explicit as .reject { … }.first

tmtmtmtm · 2017-04-26T09:23:25Z

lib/member_page.rb

+
+  field :area do
+    # splitting here by en-dash (not hyphen)
+    area_parts.last.split('–').last.tidy


Changing this to a hyphen doesn't cause a test failure, so either it's unnecessary or you want an extra test case…

BTW: rather than having to draw attention to the specific character with a comment, perhaps it might be clearer to use a unicode string explicitly: .split("\u{2013}")?

tmtmtmtm · 2017-04-26T09:33:31Z

lib/members_page.rb

+  decorator Scraped::Response::Decorator::CleanUrls
+
+  field :member_urls do
+    noko.css('.bio-member-detail-1 a/@href').map(&:to_s)


We usually use .text rather than .to_s there, so unless there's a reason why that doesn't work here, it's better to be consistent.

tmtmtmtm · 2017-05-02T14:58:25Z

lib/member_page.rb

+  end
+
+  field :faction do
+    return 'Independent' if (affiliation = political_affiliation).empty?


Creating an extra affiliation variable here is a little clumsy, and doesn't really buy us anything, other than the cost of an extra method call to political_affiliation (if that method in turn were slow then we could memoise it, but as it's just an XPath lookup, then I don't think there's any need for that either). I'd just leave this as if political_affiliation.empty?

tmtmtmtm · 2017-05-02T14:59:17Z

lib/member_page.rb

+    # Some member pages list more than one group affiliation for that member
+    # Here, we remove affiliations with known non-party groups
+    affiliation.map(&:to_s).map(&:tidy).reject do |party|
+      non_party_groups.to_s.include? party


Why are you casting non_party_groups to a string here? The idea of it being a Set is that you have an O(1) lookup directly into it…

tmtmtmtm · 2017-05-02T15:01:08Z

lib/member_page.rb

+  end
+
+  field :name do
+    name_parts.first.to_s.gsub(Regexp.union(titles << '.'), '').tidy


My previous concerns here still apply…

tmtmtmtm · 2017-05-11T08:22:01Z

lib/member_page.rb

-    affiliation.map(&:to_s).map(&:tidy).reject do |party|
-      non_party_groups.to_s.include? party
+    political_affiliation.map(&:to_s).map(&:tidy).reject do |party|
+      non_party_groups.include? party
    end.first


Ditto the above comment for list1 - list2 vs list1.reject { |e| list2.include? e }

NB: this removal/subtraction appears to be untested. If I drop the reject part of this, and make this simply political_affiliation.map(&:to_s).map(&:tidy).first, the tests still pass.

tmtmtmtm · 2017-05-11T08:27:36Z

lib/member_page.rb

@@ -15,7 +15,7 @@ class MemberPage < Scraped::HTML
  end

  field :name do
-    name_parts.first.to_s.gsub(Regexp.union(titles << '.'), '').tidy
+    name_parts.first.split.reject { |a| titles.include? a }.map(&:tidy).join(' ')


the logic here is quite odd.

Why are you calling tidy after the reject? If the parts aren't already suitable tidied, then they won't be correctly filtered out. If they are, then why re-tidy them again? What's that protecting against?

name_parts.first.split is a little hard to follow, possibly as name_parts doesn't really convey what the data is you're dealing with (and worse makes it sound like it's already the parts of the name, in which case why are we splitting them up again). I suspect it would be better to factor this out as another method, and make sure both have suitably descriptive names.

Iterating over a list to remove, one by one, the members of another list, is quite long-winded, confusing, and inefficient. list1 - list2 is much simpler and clearer.

tmtmtmtm · 2017-05-12T10:37:21Z

@ondenman this doesn't autosquash cleanly. Can you squash it down to a series of clean commits?

This class represents a document listing members of the legislature.

The scraper now conforms to rubocop.

This commit extracts the term from the URL in a separate method. It is then used when constructing the member id.

ondenman · 2017-05-12T11:41:14Z

I've squashed the commits down and have opened a separate PR (#8) to setup the test framework. I will move the commits to a new branch so Travis runs the regression tests.

ondenman · 2017-05-18T10:55:18Z

Closing in favour of: https://github.com/everypolitician-scrapers/hong_kong_legislative_council_members/compare/use-scraped?expand=1

ondenman mentioned this pull request Apr 25, 2017

WIP: Use scraped #3

Closed

ondenman requested a review from tmtmtmtm April 25, 2017 17:12

ondenman assigned tmtmtmtm Apr 25, 2017

tmtmtmtm suggested changes Apr 26, 2017

View reviewed changes

tmtmtmtm assigned ondenman and unassigned tmtmtmtm Apr 26, 2017

ondenman force-pushed the used-scraped--refactor branch from c859362 to a3ef6ad Compare May 2, 2017 11:34

ondenman requested a review from tmtmtmtm May 2, 2017 11:40

ondenman assigned tmtmtmtm May 2, 2017

tmtmtmtm suggested changes May 2, 2017

View reviewed changes

tmtmtmtm removed their assignment May 2, 2017

ondenman requested a review from tmtmtmtm May 9, 2017 11:23

ondenman assigned tmtmtmtm May 9, 2017

ondenman removed the request for review from tmtmtmtm May 9, 2017 11:23

ondenman unassigned tmtmtmtm May 9, 2017

ondenman requested a review from tmtmtmtm May 9, 2017 11:24

ondenman assigned tmtmtmtm May 9, 2017

tmtmtmtm suggested changes May 11, 2017

View reviewed changes

tmtmtmtm removed their assignment May 11, 2017

ondenman requested a review from tmtmtmtm May 11, 2017 11:26

ondenman assigned tmtmtmtm and unassigned tmtmtmtm May 11, 2017

ondenman requested review from tmtmtmtm and removed request for tmtmtmtm May 11, 2017 11:28

ondenman assigned tmtmtmtm May 11, 2017

tmtmtmtm removed their assignment May 12, 2017

Oliver Denman added 3 commits May 12, 2017 12:15

Extract MembersPage

2ce4f90

This class represents a document listing members of the legislature.

Extract MemberPage

cad259f

Add scraper test Rake task

b3c9c22

Oliver Denman added 7 commits May 12, 2017 12:15

Configure travis to run tests

32169fa

Clear out rubocop todo

2b02c4f

The scraper now conforms to rubocop.

Test member without gender data

31b8fba

Test member with gender data

a2de617

Construct :id from term and basename parts of the source URL

bfb1dc0

This commit extracts the term from the URL in a separate method. It is then used when constructing the member id.

Test member with non-party group

cb7a667

Test Independent member

a5cc143

ondenman force-pushed the used-scraped--refactor branch from 7cfe436 to a5cc143 Compare May 12, 2017 11:18

ondenman closed this May 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor for scraped #7

Refactor for scraped #7

ondenman commented Apr 25, 2017 •

edited by tmtmtmtm

Loading

tmtmtmtm left a comment

tmtmtmtm Apr 26, 2017

tmtmtmtm Apr 26, 2017

tmtmtmtm Apr 26, 2017

tmtmtmtm Apr 26, 2017

tmtmtmtm Apr 26, 2017

tmtmtmtm Apr 26, 2017

tmtmtmtm Apr 26, 2017

tmtmtmtm Apr 26, 2017

tmtmtmtm Apr 26, 2017 •

edited

Loading

tmtmtmtm Apr 26, 2017

tmtmtmtm May 2, 2017

tmtmtmtm May 2, 2017

tmtmtmtm May 2, 2017

tmtmtmtm May 11, 2017

tmtmtmtm May 11, 2017

tmtmtmtm commented May 12, 2017

ondenman commented May 12, 2017

ondenman commented May 18, 2017

Refactor for scraped #7

Refactor for scraped #7

Conversation

ondenman commented Apr 25, 2017 • edited by tmtmtmtm Loading

tmtmtmtm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmtmtmtm Apr 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmtmtmtm commented May 12, 2017

ondenman commented May 12, 2017

ondenman commented May 18, 2017

ondenman commented Apr 25, 2017 •

edited by tmtmtmtm

Loading

tmtmtmtm Apr 26, 2017 •

edited

Loading