-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor for scraped #7
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ended up being much more of a rewrite than I was expecting. In general you've done a pretty good job of that, but there are a few too many places where you're doing things in a non-standard way and it's difficult to know if that's deliberate/needed.
scraper.rb
Outdated
require 'nokogiri' | ||
require 'scraped_page_archive/open-uri' | ||
require 'date' | ||
require 'scraped' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming you've reintroduced these by mistake?
scraper.rb
Outdated
|
||
list_url = 'http://www.legco.gov.hk/general/english/members/yr16-20/biographies.htm' | ||
(scrape list_url => MembersPage).member_urls.each do |url| | ||
data = (scrape url => MemberPage).to_h.merge(term: 6) | ||
ScraperWiki.save_sqlite([:id], data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our normal practice here is to do this save
in one shot after building up all the data, rather than within the loop.
scraper.rb
Outdated
end | ||
|
||
ScraperWiki.sqliteexecute('DROP TABLE data') rescue nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why have you removed the DROP TABLE
?
test/data/suk-yee.yml
Outdated
:faction: New People's Party | ||
:email: [email protected] | ||
:website: http://www.reginaip.hk | ||
:phone: 2537 3267 / 2537 3265 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should add an explicit TODO
note here that this isn't what we want
lib/member_page.rb
Outdated
# frozen_string_literal: true | ||
|
||
require 'scraped' | ||
require 'pry' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you want/need this to be here.
lib/member_page.rb
Outdated
|
||
field :faction do | ||
f = bio.xpath('//p[contains(.,"Political affiliation")]/'\ | ||
'following-sibling::ul[not(position() > 1)]/li/text()') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's worth factoring out to a separate method, rather than doing both the finding and the processing here.
lib/member_page.rb
Outdated
'Kowloon West New Dynamic', | ||
'New Territories Association of Societies', | ||
'April Fifth Action', | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a fairly small list, so it's not a massive issue here, but it's worth getting into the habit of using a Set instead of an Array when a list exists solely to be scanned, so that lookup is O(1) instead of O(n)
lib/member_page.rb
Outdated
# Some member pages list more than one group affiliation for that member | ||
# Here, we remove affiliations with known non-party groups | ||
f.map(&:to_s).map(&:tidy).find do |party| | ||
!non_party_groups.to_s.include? party |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a find
with a not
like this is a little bit awkward, and also disguises slightly the case of what's happening if there's more than one non-party-group party.
I think this might be a little bit clearer and more explicit as .reject { … }.first
lib/member_page.rb
Outdated
|
||
field :area do | ||
# splitting here by en-dash (not hyphen) | ||
area_parts.last.split('–').last.tidy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this to a hyphen doesn't cause a test failure, so either it's unnecessary or you want an extra test case…
BTW: rather than having to draw attention to the specific character with a comment, perhaps it might be clearer to use a unicode string explicitly: .split("\u{2013}")
?
lib/members_page.rb
Outdated
decorator Scraped::Response::Decorator::CleanUrls | ||
|
||
field :member_urls do | ||
noko.css('.bio-member-detail-1 a/@href').map(&:to_s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use .text
rather than .to_s
there, so unless there's a reason why that doesn't work here, it's better to be consistent.
c859362
to
a3ef6ad
Compare
lib/member_page.rb
Outdated
end | ||
|
||
field :faction do | ||
return 'Independent' if (affiliation = political_affiliation).empty? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating an extra affiliation
variable here is a little clumsy, and doesn't really buy us anything, other than the cost of an extra method call to political_affiliation
(if that method in turn were slow then we could memoise it, but as it's just an XPath lookup, then I don't think there's any need for that either). I'd just leave this as if political_affiliation.empty?
lib/member_page.rb
Outdated
# Some member pages list more than one group affiliation for that member | ||
# Here, we remove affiliations with known non-party groups | ||
affiliation.map(&:to_s).map(&:tidy).reject do |party| | ||
non_party_groups.to_s.include? party |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you casting non_party_groups
to a string here? The idea of it being a Set
is that you have an O(1) lookup directly into it…
lib/member_page.rb
Outdated
end | ||
|
||
field :name do | ||
name_parts.first.to_s.gsub(Regexp.union(titles << '.'), '').tidy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous concerns here still apply…
lib/member_page.rb
Outdated
affiliation.map(&:to_s).map(&:tidy).reject do |party| | ||
non_party_groups.to_s.include? party | ||
political_affiliation.map(&:to_s).map(&:tidy).reject do |party| | ||
non_party_groups.include? party | ||
end.first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto the above comment for list1 - list2
vs list1.reject { |e| list2.include? e }
NB: this removal/subtraction appears to be untested. If I drop the reject
part of this, and make this simply political_affiliation.map(&:to_s).map(&:tidy).first
, the tests still pass.
lib/member_page.rb
Outdated
@@ -15,7 +15,7 @@ class MemberPage < Scraped::HTML | |||
end | |||
|
|||
field :name do | |||
name_parts.first.to_s.gsub(Regexp.union(titles << '.'), '').tidy | |||
name_parts.first.split.reject { |a| titles.include? a }.map(&:tidy).join(' ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic here is quite odd.
-
Why are you calling
tidy
after thereject
? If the parts aren't already suitable tidied, then they won't be correctly filtered out. If they are, then why re-tidy them again? What's that protecting against? -
name_parts.first.split
is a little hard to follow, possibly asname_parts
doesn't really convey what the data is you're dealing with (and worse makes it sound like it's already the parts of the name, in which case why are we splitting them up again). I suspect it would be better to factor this out as another method, and make sure both have suitably descriptive names. -
Iterating over a list to remove, one by one, the members of another list, is quite long-winded, confusing, and inefficient.
list1 - list2
is much simpler and clearer.
@ondenman this doesn't autosquash cleanly. Can you squash it down to a series of clean commits? |
This class represents a document listing members of the legislature.
The scraper now conforms to rubocop.
This commit extracts the term from the URL in a separate method. It is then used when constructing the member id.
7cfe436
to
a5cc143
Compare
I've squashed the commits down and have opened a separate PR (#8) to setup the test framework. I will move the commits to a new branch so Travis runs the regression tests. |
This is the first step in moving the scraper to scraped.
The output of this PR matches the output in master.
Further changes to be made after this PR: