Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace DOM2DTM with DOM2DTMExt the patched version of DOM2DTM (fixes 1749) #1792

Merged
merged 1 commit into from
Dec 2, 2018

Conversation

jvshahid
Copy link
Member

@jvshahid jvshahid commented Sep 5, 2018

Looks like we replaced DOM2DTM with DOM2DTMExt when we fixed
#1320 but forgot to replace it in
the DOM2DTM manager

fixes #1749

@jeremyhaile
Copy link

@jvshahid thank you so much for this fix. I can confirm it fixes the memory issues with nested XPATH searching. However, there is still a pretty large and possibly exponential performance penalty for nested XPATH searches.

Here is the test script I am using:

require 'nokogiri'

items = %Q(
<div class="test">
  <a href="something">derp</a>
</div>
) * ARGV[0].to_i

html = %Q(
<html>
  <body>
    #{items}
  </body>
</html>
)
doc = Nokogiri::HTML html
puts "pid is #{$$}"

start = Time.now
links = doc.css('html body .test a')
puts "Without nesting got #{links.length} links in #{Time.now - start}s" # This will return immediately

start = Time.now
links = doc.css('html body .test').css('a')
puts "With nesting got #{links.length} links in #{Time.now - start}s" # This will take a LONG time to finish

Here is the output with 5000 divs:

pid is 49017
Without nesting got 5000 links in 0.225391s
With nesting got 5000 links in 2.666568s

And here is the output with 50,000 divs:

pid is 49113
Without nesting got 50000 links in 0.7900900000000001s
With nesting got 50000 links in 296.062414s

So in both cases the nested version is several orders of magnitude slower than the unnested version. And it seems to get exponentially worse for larger numbers of matched elements.

This should definitely be merged in to fix the memory issue, as crashing the JVM is much worse than being slow. However there is still something wrong with nokogiri java's handling of nested XPATH queries.

Any idea why the performance might be so bad? Thanks again!

@jvshahid
Copy link
Member Author

jvshahid commented Sep 6, 2018

I think this is a result of the NodeSet implementation in JRuby which uses RubyArray under the hood. This causes the exponential behavior you are seeing as it tries to calculate the union in .css('a') for each node in the result of .css('html body .test'). As a workaround you can try using some variation of the following instead:

links = doc.css('html body .test').map do |node|
  node.css('a')
end
puts "finished"
links = Nokogiri::XML::NodeSet.new doc, links.flatten

I will probably open a new issue to track the exponential behavior of NodeSet and merge this PR to close #1749

@jeremyhaile
Copy link

@jvshahid thanks for the explanation and workaround. Unfortunately for us the badly performant code is in a third-party library that uses nokogiri, although we could fork it and modify.

Please let me know when you open the new issue and I'll subscribe there too. Thanks again!

@jvshahid
Copy link
Member Author

Just a quick update. I looked at libxml source code and it looks like internally it is using an array similar to the current implementation in JRuby. I am still not sure why the JRuby implementation is that slow.

@jvshahid
Copy link
Member Author

created #1795 to track the performance of XMLNodeSet. This PR is ready to merge AFAIK.

@flavorjones
Copy link
Member

Going to rebase this on master to get tests to pass.

@flavorjones flavorjones changed the title Replace DOM2DTM with DOM2DTMExt the patched version of DOM2DTM Replace DOM2DTM with DOM2DTMExt the patched version of DOM2DTM (1749) Dec 1, 2018
@flavorjones flavorjones changed the title Replace DOM2DTM with DOM2DTMExt the patched version of DOM2DTM (1749) Replace DOM2DTM with DOM2DTMExt the patched version of DOM2DTM (fixes 1749) Dec 1, 2018
Looks like we replaced DOM2DTM with DOM2DTMExt when we fixed
#1320 but forgot to replace it in
the DOM2DTM manager

fixes #1749
@WoolenWang
Copy link

why this old issue has merge and have the same issue #1873

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JRuby XPATH Memory Usage
4 participants