-
Notifications
You must be signed in to change notification settings - Fork 15
Hpricot Basics
(Part of An Hpricot Showcase.)
You have probably got the gem, right? To load Hpricot:
require 'rubygems' require 'hpricot'
If you’ve installed the plain source distribution, go ahead and just:
require 'hpricot'
The Hpricot()
method takes a string or any IO object and loads the contents into a document object.
doc = Hpricot("<p>A simple <b>test</b> string.</p>")
To load from a file, just get the stream open:
doc = open("index.html") { |f| Hpricot(f) }
To load from a web URL, use open-uri
, which comes with Ruby:
require 'open-uri' doc = Hpricot(open("http://qwantz.com/"))
Use Doc.search
:
doc.search("//p[@class='posted']") #=> #<Hpricot:Elements[{p ...}, {p ...}]>
Doc.search
can take an XPath or CSS expression. In the above example, all paragraph <p>
elements are grabbed which have a class
attribute of "posted"
.
A shortcut is to use the divisor:
(doc/"p.posted") #=> #<Hpricot:Elements[{p ...}, {p ...}]>
If you’re looking for a single element, the at
method will return the first element matched by the expression. In this case, you’ll get back the element itself rather than the Hpricot::Elements
array.
doc.at("body")['onload']
The above code will find the body tag and give you back the onload
attribute. This is the most common reason to use the element directly: when reading and writing HTML attributes.
Just as with browser scripting, the inner_html
property can be used to get the inner contents of an element.
(doc/"#elementID").inner_html #=> "..<b>contents</b>.."
If your expression matches more than one element, you’ll get back the contents of all the matched elements. So you may want to use first
to be sure you get back only one.
(doc/"#elementID").first.inner_html #=> "..<b>contents</b>.."
If you want the HTML for the whole element (not just the contents), use to_html
:
(doc/"#elementID").to_html #=> "<div id='elementID'>...</div>"
All searches return a set of Elements
. Go ahead and loop through them like you would an array.
(doc/"p/a/img").each do |img| puts img.attributes['class'] end
Searches can be continued from a collection of elements, in order to search deeper.
# find all paragraphs. elements = doc.search("/html/body//p") # continue the search by finding any images within those paragraphs. (elements/"img") #=> #<Hpricot::Elements[{img ...}, {img ...}]>
Searches can also be continued by searching within container elements.
# find all images within paragraphs. doc.search("/html/body//p").each do |para| puts "== Found a paragraph ==" pp para imgs = para.search("img") if imgs.any? puts "== Found #{imgs.length} images inside ==" end end
Of course, the most succinct ways to do the above are using CSS or XPath.
# the xpath version (doc/"/html/body//p//img") # the css version (doc/"html > body > p img") # ..or symbols work, too! (doc/:html/:body/:p/:img)
You may certainly edit objects from within your search loops. Then, when you spit out the HTML, the altered elements will show.
(doc/"span.entryPermalink").each do |span| span.set_attribute :class, 'newLinks' end puts doc
This changes all span.entryPermalink
elements to span.newLinks
. Keep in mind that there are often more convenient ways of doing this. Such as the set
method:
(doc/"span.entryPermalink").set(:class => 'newLinks')
Every element can tell you its unique path (either XPath or CSS) to get to the element from the root tag.
The css_path
method:
doc.at("div > div:nth(1)").css_path #=> "div > div:nth(1)" doc.at("#header").css_path #=> "#header"
Or, the xpath
method:
doc.at("div > div:nth(1)").xpath #=> "/div/div:eq(1)" doc.at("#header").xpath #=> "//div[@id='header']"
Return to An Hpricot Showcase.