-
Notifications
You must be signed in to change notification settings - Fork 15
Hpricot Challenge
Trying to parse out some trick HTML? Add your question to this page and we’ll see if we can track down a simpler path to it.
Q: How do I find all images in an HTML email?
I’m testing HTML emails to make sure all the image paths are properly formed and the images exist. I currently know how to find images if they are <TD BACKGROUND=...>
or <IMG SRC=...>
. But if my email template maker guy adds an image elsewhere my test code will phail. Fetching all images will solve my problem without me having to talk to a human. Thank you!
A: Simple! Use the img
CSS selector: (doc/"img")
.
So, to throw an error when an image is found:
unless (doc/"img").empty? raise Exception, "no images allowed" end
Q: I am new to ruby, rails and Hpricot, and don’t understand most of the XPath or cSS stuff! I have managed to get hpricot to scrape through a document to find the section that I want, but now I am stuck with a table which is in the form.
<table> <tr> <td>...stuff I don't want...</td> </tr> <tr> <td> <table> ------------rows i want <tr> <td> <table> <tr> <td>Field 1</td> <td>Field 2</td> </tr> </table> </td> <td>Field 3</td> <td>Field 4, Field 5</td> </tr> ------------end of rows i want </table> </td> </tr> </table>
…and I really need to be able to have these in the form [“Field 1”, “Field 2”, “Field 3”, “Field 4”, “Field 5”] for each row [there will be many rows]. I tried telling it to remove the first child to get rid of the first contents, however it seems to go through all the code and also removes the Field 4 . Anybody able to help me do that please?
A: This might not be optimal, but it seems to get the job done for what you want:
(doc/"table//table//td").collect do |k| k.inner_html.split(',') unless k.inner_html =~ /</ end.flatten.compact
Q: I have an XML product feed that has some nodes that are always named the same, and some that can be different for different products. I know how to parse nodes when I know their names. How do I parse nodes when I don’t know in advance what they will be called? These “dynamic” nodes are always the children of a given node — how do I parse dynamically just for one node?
A: Looking for a solution to this same problem, I came up with traversing the document using #containers: “Return all children of this node which can contain other nodes. This is a good way to get all HTML elements which aren‘t text, comment, doctype or processing instruction nodes.”
doc.at(:parent_of_dynamic_nodes).containers.each do |node| #process node end
Q: How do I strip all HTML tags from a page?
A: Try the to_plain_text or inner_text methods.
doc = Hpricot(‘<a href=\“http://www.math.com” title=“1 > 2”>1 is > 2’) 1. Broken doc.to_s.gsub(/<\/?[^>]*>/, "") # => 2">1 is > 2 2. Better doc.to_plain_text # => 1 is > 2 [http://www.math.com] doc.inner_text # => 1 is > 2
Q: So, I’ve got an Hpricot::Elem, whose HTML looks like:
<ul> <ul> <li>A</li> </ul> <li>B</li> <li>C</li> </ul>
How does one find only its immediate children li’s (i.e. B and C, but not A)? For example, e.search("li")
problematically gives me all of e
’s descendants, not just immediate children. I want something like e.search("./li")
, but that totally doesn’t work.
A: There are two possible selectors which may be used. The XPath selector would be /li
. The CSS selector would be >li
. Neither selector should have spaces in it. Spaces will trip up 0.5.
When you continue a search from an element, that element is treated as a root node.
Q: Assume you have this HTML:
<body> <div class="test">one</div> <div class="test">two</div> <div class="test">three</div> </body>
I know I can select the first div
element with the expression div.test:first-child
, but how to I select the other two elements? I’d like to remove any divs of the test
class which aren’t first children.
A: This is a perfect place to use the :not
operator. This operator is listed among the Supported CSS Selectors.
E:not(s)
an E element that does not match simple selector s
We can negate the :first-child
selector to select everything but the first child. Like this: div.test:not(:first-child)
.
Your removal code will end up like so:
(doc/"div#test:not(:first-child)").remove
Q: Assume you have this HTML:
<a href="http://www.somewebsite.com">Click Me!</a>
I know how to search for an element based on its attributes, but is it possible to do a search using the tag’s inner_html
? For example selecting all links that contain the text “Click.”
A: In Hpricot 0.5, you can use the text()
selector just like any attribute.
click_links = doc.search("a[text()*='Click']")
Alternatively, in older Hpricots, you can simply scan the inner_text
of selected elements. This is also handy if you want to search for a regular expression.
click_links = doc.search("a").select { |ele| ele.inner_text =~ /Click/ }
This approach should be no faster or slower than the first search. They both must scan each node individually.
Q: Given:
doc = Hpricot.parse(%{<div class='outer'><div class='inner'>text</div></div>})
How can a write a
matches?
method such that:
doc.at('.outer').matches?('.inner') # => false doc.at('.inner').matches?('.inner') # => true
A: On further investigation, it appears that
! doc.at('.outer').search('../.inner').empty? # => false ! doc.at('.inner').search('../.outer').empty? # => true
Which is easy enough to wrap up in a method, once I’ve worked out where to put the method.
Q: Can I perform a single search and get all of the elements with “href” or “action” attributes? Something like this:
doc.search("[@href]|[@action]")
Similarly, is it possible to get all elements with both attributes present?
A: In recent Hpricots (after 2006 Mar 17,) you can go ahead and use the search from the question: doc.search("[href]|[action]")
.
In earlier Hpricots, you’ll need to do two searches:
ele = doc.search("[@href]") ele.push *doc.search("[@type]")
As for doing a search which finds elements with both attributes, you can go ahead and stack the search in newer Hpricots:
doc.search("[@href][@type]")
Q: Can I search for elements where the attribute has a specific value and ignore case? I guess I would like to use something similar to XPath string functions to normalize text:
doc.search("span[lower-case(@title)='yes']")
A: I’ve come up with a half-assed solution that isn’t exactly valid XPath, but works. It also involves editing hpricot source :( !
I’ve added the following into elements.rb around line 473
filter :contains_lowercase do |arg, ignore| html.include? arg.downcase end filter :contains_uppercase do |arg, ignore| html.include? arg.upcase end
You can then do the following:
irb(main):010:0> doc/"strong:contains('one')" => #<Hpricot::Elements[{elem <strong> "this is strong one" </strong>}]> irb(main):011:0> doc/"strong:contains('ONE')" => #<Hpricot::Elements[]> irb(main):012:0> doc/"strong:contains_lowercase('ONE')" => #<Hpricot::Elements[{elem <strong> "this is strong one" </strong>}]>
Alternatively you can just alter the :contains filter, but will lose the ability for case sensitive searches
filter :contains do |arg, ignore| html.include? arg html.include? arg.upcase html.include? arg.downcase end
You can then do the following:
irb(main):008:0> (doc/"strong:contains('ONE')") => #<Hpricot::Elements[{elem <strong> "this is strong one" </strong>}, {elem <strong class="hard"> "this is strong one" </strong>}]> irb(main):009:0> (doc/"strong[@class='hard']:contains('ONE')") => #<Hpricot::Elements[{elem <strong class="hard"> "this is strong one" </strong>}]>
The same approach could be taken with gsub to implement something along the lines of a translate() function
Q: So, I’ve got an Hpricot::Elem, whose HTML looks like:
<div> <A>...</A> ... <B>...</B> <a name='articlestart'/> <C>...</C> ... <D>...</D> </div>
How do I find C
to D
? I suppose I somehow have to use preceding-sibling
, but I can’t seem to figure out how…
A: An easy way is to use #following_siblings
. (The opposite is #preceding_siblings
)
c_and_d = doc.at('//a[@name="articlestart"]').following_siblings
Q: How do you solve the preceding question when the tags are interspersed with
text nodes? For example,
<C>...</C> Some text <tag> </tag> More text Even more text <D>...</D>
A: Use the next
method instead of next_sibling
in the previous code snippet. That will get text nodes as well as container nodes. Note that this will also include comments. next_node
is an alias for next.
Here is an implementation of this along with an example web page.
Q: Seth wants to know, “How can i get a list of all non-text elements?”
A. Evan suggests perhaps:
doc.search("*").grep(Hpricot::Elem)
Marcos suggests:
doc.search("*").select{ |e| e.elem? }
(See also: finding text-only elements, the leaves on the html tree.)
Q: Say I needed to get the value of the href attribute in an <A>
tag, how would I do it?
A: Use .first and then Hash syntax to get at the attributes.
doc.search('a').first[:href]
or if you have an element
(doc/:a).first[:href]
The confusing thing can be if you have some XML that only has one item in it. You still need to call .first so you’re working with a single element and not an array.
doc = Hpricot.XML(open('http://feeds.feedburner.com/rubyonrailspodcast')) item = (doc/:item).first type = (item/:enclosure).first[:type] # => 'audio/mpeg'
Also, if you only want to get the first element, you can use %
or at
instead of /
or search
.
doc.at('a')[:href]
or
(doc % :a)[:href]
Q: What if I needed to get the value of all href attributes on a page?
A:
doc.search(‘a[@href]’).map { |x| x[‘href’] }
If anyone knows a more concise way please post.
Q: How would I use Hpricot through a proxy? Where would I setup the url and port before calling?
Hpricot(open('http://myurl'))
A: You need to tell open-uri about the proxy, not Hpricot. This works:
Hpricot(open('http://myurl', :proxy => 'http://myproxy:8080'))
Q: How would I go about locating and removing part of a string if the contents are all different or generated dynamically? Here’s the example:
<a href="out.php?id=1112&url=www.website.com"></a> <a href="out.php?id=2232&url=www.website.com"></a> <a href="out.php?id=3346&url=www.website.com"></a>
I would like to remove the part between php?id=
and &url=
.
A: Your question isn’t clear. If you want to obtain the values (as strings) 1112, 2232, 3346 then
require ‘cgi’ doc.search(‘a[@href]’).map { |x| CGI.parse(x[‘href’][/\?.+/][1..-1])[‘id’].first }
should do.
Q: How would I get the text ‘sample text’ from the example below? inner_text returns texts from all tags and not from the actual node only.
<div id="myid"> <h4>title</h4> sample text </div>
A: The following returns an array of all text nodes that are children of the “myid” div. If you want a single string just join
the array.
doc.at('#myid').children.select{|e| e.text?} # => ["\n ", "\n sample text\n"]
Q: Using just an XPath query, is it possible to return the value of an element’s attribute, such that:
<div id="foo"> Fnord! </div>
doc.someFuncIdontKnowYet("//div/@id") => "foo"
A:
(doc/"div[@id]").get_attribute("id")
Q: How can I retrieve the elements that have no attributes associated?
<div class="foo"></div> <div></div> <div class="bar"></div>
I would like to select the middle div with something like:
doc.search("//div[@=empty()]")
A: …
Q: How can I add embedded script, like PHP or Erb to an attribute without it being escaped?
a = Hpricot("<a href=\"http://w.w.w/\">Hello</a>") => #<Hpricot::Doc {elem <a href="http://w.w.w/"> "Hello" </a>}> (a/"a").first[:href] = "<? echo 'boo' ?>" => "<? echo 'boo' ?>" a.to_html => "<a href=\"<? echo 'boo' ?>\">Hello</a>"
A: Use the element’s
raw_attributes
hash – it doesn’t escape anything.
Q: is there a way to use the wildcard character in doc/"" or doc.search for attributes?
More specifically I have a page where:
<a href="blah.com" id="p-1">Some text</a> <a href="blah.com" id="p-2">Some text</a> <a href="blah.com" id="p-3">Some text</a>
I am attempting to use pure xpath if at all possible, however I am willing to hear other suggestions even though I may have to rethink my design a little. There is enough other code on the page to make it difficult. You can see here:
doc/"a[@id=p-*]"
Would be the ideal statement.
A: Try the following:
doc/"a[@id^=p-]"
This operator matches the beginning of the string.
Q: What’s the best way to combine several HTML fragments into one tree, without just concatenating the strings?
Suppose I have fragment 1:
<p>This is a paragraph la la la.</p>
and fragment 2:
<ul><li>This is a test.</li><li>This is only a test.</li></ul>
What’s the best way to combine them into an Hpricot doc that contains:
<html> <p>This is a paragraph la la la.</p> <ul><li>This is a test.</li><li>This is only a test</li></ul> </html>
… without flattening to strings, concatenating them, and reparsing?
I’d like to stay in the Hpricot domain if possible. It seems to me that it’s much faster to just join the trees than to round-trip through the emitter and parser, and I’m also concerned about what would happen if some of the input is bogus and produces invalid nesting.
A: …
Q: What’s the best way of parse a not valid HTML?
I am trying to extract the body of a HTML page, http://www.c2.com/cgi/wiki?AtsUserStories, with
doc_content = doc.search('html/body')
. The problem is, that page doesn’t have the <html>
and </html>
tags. That kind of problem happens to me a lot, pages that don’t have </body>
, or that <head>
comes before <html>
. I thought Hpricot already deals with that kind of problem, but this not happens now.
So, how can I deal with that kind of problem? Thanks!
A: On pages that I’ve been using this on, I’ve just tried to make it valid html as much as possible with regular expressions before throwing it into Hpricot. For instance if it’s missing the <html></html>
tags like on that page, insert them first before bringing it in. You could insert a </body>
the same way by just searching for a </html>
or inserting a </body></html>
if both are missing.
doc = open(weburl).read doc = doc.sub(/^(.*)/,"<html>\\1") doc = doc.sub(/(.*)$/,"</html>\\1") doc = Hpricot(doc)
If you can find where the body is supposed to be, I’d try to insert that with a regexp. I’ve even removed some tags that were mostly useless to the output rendering (some unmatched tags) because I couldn’t get to the source. Ended up with a happy Hpricot.
[This question wasn’t formatted correctly, I’ve taken a guess at the intention in transferring over the wiki page]
Q: Hpricot seems to output XHTML instead of HTML by default. Is there a way to force HTML?
For example:
Hpricot('<br>').to_s
returns <br />
and not <br>
like I wanted.
A…
Q: Hpricot (or perhaps it’s Ruby in general?) seems to struggle with character encoding. When using Hpricot with documents that contain “funny” characters such as `
, the results are wonky. Does anyone have any advice on how to deal with this?
Q: Given two elements A and B how do I tell which comes before the other ? If I could use sth like start position of the element in the html document I could compare the positions and figure it out but there’s no such feature, or is there ?
Q: Hpricot is throwing a warning, but doing what I want when I use :last-child. What gives?
Here is my code:
require 'hpricot' foo = "<div class=\"blah\"> <p>test</p> <p>go away</p> </div>" doc = Hpricot(foo) (doc/"p:last-child").remove
It properly removes the last p element, but it gives the following warning:
c:/HIDDEN/gems/hpricot-0.6-x86-mswin32/lib/hpricot/elements.rb:429: warning: multiple values for a block parameter (2 for 1)
A: I had the same problem. When I changed to last-of-type instead of last-child the warning went away. Of course that won’t work if you really need the last child of any type and not the last p.
Q: My code should output a snippet of HTML for inclusion in the middle of a document.
Does Hpricot have a role? Can I use it to build my snippet in the abstract
(using code that knows nothing about HTML, but enough about Hpricot),
and then can I call Hpricot to output the HTML of the snippet?
Q: I’m dealing with data on HORRIBLY designed HTML pages, using tables and presentational elements for everything. What’s the best method to grab data from after a known element?
For example:
<tr><td class="oreinfoleft"><b class=ul>Veldspar</b><br> <b>Units per batch:</b> 333<br> <b>Volume:</b> 0.1<br> <b>Cargo per batch:</b> 33.33 <td class="oreinforight"><b>Minerals:</b> Tritanium 100%<br> <b>Variations:</b> Concentrated Veldspar (+5%), Dense Veldspar (+10%)<br> <b>Found in:</b> 1.0<br> <font class=comment>Veldspar has the best cargo/mineral rate for tritanium</font> </tr>
How can I get ahold of, say, the ‘volume’ of 0.1? It’s outside any element except the TD itself – I guess I’m looking for a psuedoselector of some sort for :test combined with a selector for :after, so I can grab whatever text is directly after a given <b>
but before any later element.
A: I would pull out the text of the td with Hpricot, then just make a regular expression to try to get the volume, since it’s not labelled any other particular way.
tdvalue = doc.search("//td[@class='oreinfoleft]'").inner_html volume = tdvalue.match(/Volume:<\/b>(.*)<br/) puts volume[1]
Q: If I have something like:
<form> <input name="1" /> <input name="2" /> <input name="3" /> </form>
And I want the second input, named 2.
Will search always return the inputs in the same order: <input name="1" /><input name="2" /><input name="3" />
?
So would:
x = doc.search("form input")[1]
… always assign <input name="2" />
to x?
If not, how would I do so?
I tried:
doc.at("form input:nth-child(1)")
But I get nil.
Q: Hpricot seems like a very cool tool, but I just ran into it and don’t grok Hpricot (yet).
I’m trying to extract data from a text field populated by tinymce so I can then translate it for pdf output. I have snippets of code that will let me extract sets of data like the table header row values (column headers), the table row row values (the data), items in an ordered or unordered list, etc.
What I’ve done is walk the html stored in the db using doc.each_child
with a recursive method that recognizes “key” elements using elem.stag.name to direct actions in the method. It does pay to look behind the doc and scrounge around in the actual Hpricot code itself (I’m curious why some sections in code are marked nodoc by various authors – that’s usually where I find my questions answered).
Q: I have an XML (gnucash data, to be precise) which has one tag called <act:parent>
. I want to extract the content of this tag, but if I use the XPath "//act:parent"
I find nothing.
I assume that this is a conflict with the parent
axis. I don’t know if this is a bug of hpricot, or if hpricot is behaving right but I should “escape” somehow these “reserved words”. I’m pretty new to XPath and XML
Q: I am downloading a series of WSDLs and I need to download all imported files such as additional WSDLs and schemas. The import element appears in both the WSDL and schema namespaces so it can be in at least two namespaces. Add to that the arbitrary namespace prefixes and the elements can appear under any imaginable namepace prefix (x, xs, xsd, auntdahliasankle, etc.). I need to be able to select all of the import elements in an XML file regardless of the namespace prefix.
<xsd:import schemalocation="bertie.xsd"> <wsdl:import location="jeeves.wsdl"> <x:import schemalocation="gussie.xsd"> <xs:import schemalocation="bingo.xsd"> <jam:import schemalocation="cyrus.xsd">
All these should come back in the search.
Workaround:
I did a search on the attributes location and schemalocation and pulled back all elements that had those attributes. It worked but lacked umph. Namespaces are eeeeeeeeeeeevil.
Q: I want to get namespaced element (eg. <evil:Tag>FAIL</evil:Tag>
).
A: You can use the %()
method: doc.%('evil:tag')
. Beware that Hpricot seems to downcase tag names. Credit: Garrick van Buren.
Q: I’m trying to access a page that does require a POST variable, e.g. some URL http://www.someserver.com/somefile.html
and I want to pass a variable named foo
of value “bar” using a HTTP POST request (I allready tried substituting a HTTP GET request, didn’t work)
A: Use Net::HTTP:
require 'net/http' require 'hpricot' uri = URI.parse('http://www.someserver.com/somefile.html') doc = Hpricot(Net::HTTP.post_form(uri, { :foo => "bar" }).body)
Q: Could be guessed, but just in case…
(@doc/"item | entry").each do |stuff| # what you'll choose to parse… end
Why some feedburner’s feeds can the first or the other ?
Q: How can I find the inner_html of all text elements (e.g h1, h2, …, p, label, …)
It’s easy to get all the elements when the html elements aren’t nested, but as soon as you nest elements (what you surely will do) it gets tough.
(@doc/“html/body/”).each do |e| puts e.inner_html end
A: Basically the same as Retrieving non-text elements above:
doc.search("*").grep(Hpricot::Text)
Q: I have a list of XML tags with height attributes such as the following
<image height="300">x</image> <image hieght="200">y</image>
is it possible to select the image with the largest attribute? i.e. the one with height=300 in this case?