Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor for scraped #7

Closed
wants to merge 10 commits into from
Closed

Refactor for scraped #7

wants to merge 10 commits into from

Conversation

ondenman
Copy link
Contributor

@ondenman ondenman commented Apr 25, 2017

This is the first step in moving the scraper to scraped.

The output of this PR matches the output in master.

Further changes to be made after this PR:

Copy link
Contributor

@tmtmtmtm tmtmtmtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ended up being much more of a rewrite than I was expecting. In general you've done a pretty good job of that, but there are a few too many places where you're doing things in a non-standard way and it's difficult to know if that's deliberate/needed.

scraper.rb Outdated
require 'nokogiri'
require 'scraped_page_archive/open-uri'
require 'date'
require 'scraped'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming you've reintroduced these by mistake?

scraper.rb Outdated

list_url = 'http://www.legco.gov.hk/general/english/members/yr16-20/biographies.htm'
(scrape list_url => MembersPage).member_urls.each do |url|
data = (scrape url => MemberPage).to_h.merge(term: 6)
ScraperWiki.save_sqlite([:id], data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our normal practice here is to do this save in one shot after building up all the data, rather than within the loop.

scraper.rb Outdated
end

ScraperWiki.sqliteexecute('DROP TABLE data') rescue nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have you removed the DROP TABLE?

:faction: New People's Party
:email: [email protected]
:website: http://www.reginaip.hk
:phone: 2537 3267 / 2537 3265
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add an explicit TODO note here that this isn't what we want

# frozen_string_literal: true

require 'scraped'
require 'pry'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you want/need this to be here.


field :faction do
f = bio.xpath('//p[contains(.,"Political affiliation")]/'\
'following-sibling::ul[not(position() > 1)]/li/text()')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's worth factoring out to a separate method, rather than doing both the finding and the processing here.

'Kowloon West New Dynamic',
'New Territories Association of Societies',
'April Fifth Action',
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a fairly small list, so it's not a massive issue here, but it's worth getting into the habit of using a Set instead of an Array when a list exists solely to be scanned, so that lookup is O(1) instead of O(n)

# Some member pages list more than one group affiliation for that member
# Here, we remove affiliations with known non-party groups
f.map(&:to_s).map(&:tidy).find do |party|
!non_party_groups.to_s.include? party
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a find with a not like this is a little bit awkward, and also disguises slightly the case of what's happening if there's more than one non-party-group party.

I think this might be a little bit clearer and more explicit as .reject { … }.first


field :area do
# splitting here by en-dash (not hyphen)
area_parts.last.split('–').last.tidy
Copy link
Contributor

@tmtmtmtm tmtmtmtm Apr 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this to a hyphen doesn't cause a test failure, so either it's unnecessary or you want an extra test case…

BTW: rather than having to draw attention to the specific character with a comment, perhaps it might be clearer to use a unicode string explicitly: .split("\u{2013}")?

decorator Scraped::Response::Decorator::CleanUrls

field :member_urls do
noko.css('.bio-member-detail-1 a/@href').map(&:to_s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually use .text rather than .to_s there, so unless there's a reason why that doesn't work here, it's better to be consistent.

end

field :faction do
return 'Independent' if (affiliation = political_affiliation).empty?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating an extra affiliation variable here is a little clumsy, and doesn't really buy us anything, other than the cost of an extra method call to political_affiliation (if that method in turn were slow then we could memoise it, but as it's just an XPath lookup, then I don't think there's any need for that either). I'd just leave this as if political_affiliation.empty?

# Some member pages list more than one group affiliation for that member
# Here, we remove affiliations with known non-party groups
affiliation.map(&:to_s).map(&:tidy).reject do |party|
non_party_groups.to_s.include? party
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you casting non_party_groups to a string here? The idea of it being a Set is that you have an O(1) lookup directly into it…

end

field :name do
name_parts.first.to_s.gsub(Regexp.union(titles << '.'), '').tidy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous concerns here still apply…

@tmtmtmtm tmtmtmtm removed their assignment May 2, 2017
@ondenman ondenman requested a review from tmtmtmtm May 9, 2017 11:23
@ondenman ondenman removed the request for review from tmtmtmtm May 9, 2017 11:23
@ondenman ondenman requested a review from tmtmtmtm May 9, 2017 11:24
affiliation.map(&:to_s).map(&:tidy).reject do |party|
non_party_groups.to_s.include? party
political_affiliation.map(&:to_s).map(&:tidy).reject do |party|
non_party_groups.include? party
end.first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto the above comment for list1 - list2 vs list1.reject { |e| list2.include? e }

NB: this removal/subtraction appears to be untested. If I drop the reject part of this, and make this simply political_affiliation.map(&:to_s).map(&:tidy).first, the tests still pass.

@@ -15,7 +15,7 @@ class MemberPage < Scraped::HTML
end

field :name do
name_parts.first.to_s.gsub(Regexp.union(titles << '.'), '').tidy
name_parts.first.split.reject { |a| titles.include? a }.map(&:tidy).join(' ')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic here is quite odd.

  1. Why are you calling tidy after the reject? If the parts aren't already suitable tidied, then they won't be correctly filtered out. If they are, then why re-tidy them again? What's that protecting against?

  2. name_parts.first.split is a little hard to follow, possibly as name_parts doesn't really convey what the data is you're dealing with (and worse makes it sound like it's already the parts of the name, in which case why are we splitting them up again). I suspect it would be better to factor this out as another method, and make sure both have suitably descriptive names.

  3. Iterating over a list to remove, one by one, the members of another list, is quite long-winded, confusing, and inefficient. list1 - list2 is much simpler and clearer.

@tmtmtmtm tmtmtmtm removed their assignment May 11, 2017
@ondenman ondenman requested a review from tmtmtmtm May 11, 2017 11:26
@ondenman ondenman assigned tmtmtmtm and unassigned tmtmtmtm May 11, 2017
@ondenman ondenman requested review from tmtmtmtm and removed request for tmtmtmtm May 11, 2017 11:28
@tmtmtmtm
Copy link
Contributor

@ondenman this doesn't autosquash cleanly. Can you squash it down to a series of clean commits?

@tmtmtmtm tmtmtmtm removed their assignment May 12, 2017
Oliver Denman added 3 commits May 12, 2017 12:15
This class represents a document listing members of the legislature.
Oliver Denman added 7 commits May 12, 2017 12:15
@ondenman
Copy link
Contributor Author

I've squashed the commits down and have opened a separate PR (#8) to setup the test framework. I will move the commits to a new branch so Travis runs the regression tests.

@ondenman
Copy link
Contributor Author

@ondenman ondenman closed this May 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants