-
-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xpath() around 10x slower in JRuby #741
Comments
Can you send an example document and code sample that does xpath queries similar to yours. |
Here is an example repo that exposes the problem https://github.com/TV4/nokogiri-xpath-issue-741 The repo includes an example xml file and code that exposes the problem. |
Using the example code above, I found that this is a regression introduced with nokogiri 1.5.0. It runs much faster with nokogiri 1.4.7 |
thanks, I'll take a look. |
This isn't really a regression, Nokogiri used FFI prior to 1.5.0. Newer versions of Nokogiri have a pure java implementation which looks to be awfully slower than the C implementation. @yokolet this looks like a performance issue with Java's JAXP XPath implementation, see this bug report. What do you think ? |
hmmm, I agree with Java's XPath implementation is very slow. I did some research. There might be a couple of ways to improve performance. Also, we can cache compiled XPathExpression. Nokogiri's xpath method calls XmlXpathContext.evaluate method, which compiles a given xpath. The example code repeatedly calls xpath method with the same xpath string. Our implementation compiles the same xpath every time outer each gives a next element. We can avoid this. |
did you mean XmlXpathContext.compile ? if that's the case then i don't |
I know the real bottle neck is xpath evaluation, which is on line 143 in XmlXpathContext.java. Do you think we can improve xpath evaluation itself? I think that's very hard. So, I thought of other options we might improve performance. Slight possibility is Xerces/JAXP options such as om.sun.org.apache.xml.internal.dtm.DTMManage might help performance. |
i looked further into this problem, one of the problems is that we're creating a new XPathContext every time we evaluate the xpath expression which is very expensive. I looked at a couple of libraries and Xalan seems to be the best option compatible with DOM (i.e. doesn't use an internal implementation). One solution is to use CachedXpathAPI which has the side effect of not exposing setter functions for function and variable resolvers. The other option is to use the internal api (may be subclass CachedXpathAPI). The only drawback is that Nokogiri have to maintain the XPathContext and invalidates it when the document changes, otherwise xpath expressions may return invalid results. I tested the CachedXpathContext on my machine and ran the script in 23 seconds on JRuby as opposed to 0.7 seconds on MRI. This is better than the current state of the java implementation of Nokogiri which runs the script in 130 seconds on my machine. what do you guys think ? |
I've seen this in other platforms/languages, and providing the ability to pre-compile and use cached xpath expressions can greatly speed up processing time. I'm currently working on a project where we are running 10-20 xpath expressions per document, over 1M documents. The penalty for compiling the xpath expression string every time is sever (20M invocations). I'd propose exposing an interface to allow xpath compilation, and then updating the element.xpath method to take either strings or pre-compiled xpath expressions. |
The much slower performance of jruby nokogiri is a problem for cross-platform use. Moving from FFI to native java, one would have thought, would make nokogiri work more reliably and higher performance on jruby -- but it's had the opposite effect, making me not want to use nokogiri on jruby due to performance, and making nokogiri even less suitable for a cross-platform solution. Any hope for any work on this? |
Hi Jonathan, Are you trolling? I honestly can't tell. The FFI port was deprecated a few The state of the world is, that there's one feature, admittedly a -mike On Tue, Jul 9, 2013 at 1:20 PM, Jonathan Rochkind
|
Leaving FFI for native Java did make nokogiri unusable for us, that's why I posted this bug in the first place. These days we no longer need to parse XML documents, otherwise we'd still be stuck at 1.4. |
I made some major modifications to the JRuby xpath code. The code on the
Currently the entire test suite passes. We had good test coverage on most of the operations that modify the dom, but I'm worried I missed a couple of them. @yokolet can you help eyeball the changes. By the way, the test time went down from 80 seconds to 6 seconds. It's still slower than MRI (which takes less than a second), but I think it's good enough for the first pass. Cheers, |
For future reference, would just using css instead of xpath be faster? How does that leave jruby vs MRI? |
On Fri, Nov 15, 2013 at 1:42 PM, Ryan Taylor notifications@github.comwrote:
No. Nokogiri compiles CSS queries to XPath before execution, as underlying
|
Was there any regression or is it just because Travis has become slow in performance? |
I believe this is Travis running on slow machines. The test suite takes
|
OK, let's see if extending the time limit to 20 on Travis CI works... (b094730) |
We should probably disable the test. I'm not really happy with the current way of measuring performance, it's very brittle and will break on slower machines. Is there a better way to do this ? |
@zenspider wrote http://www.zenspider.com/projects/minitest.html#benchmarks http://docs.seattlerb.org/minitest/Minitest/Benchmark.html Would that be appropriate? On Wed, Apr 9, 2014 at 11:57 AM, John Shahid notifications@github.comwrote:
|
I actually prefer this over timing the operation. The test will have to modified slightly to generate a xml document that's proportional to n. I'll try to get this done over the weekend and go through the backlog of the JRuby stories. |
Sounds great. Obviously, an absolute time limit wouldn't be the way to go. I tweaked the value specially for Travis CI for the moment, so unless this is a regression we can move forward to the upcoming release. |
@jvshahid this test is failing in the concourse pipelines, because of noisy neighbor problems when all the tests are running at the same time. Is there a real worry about regression here? If not, maybe we just delete it rather than make it more complicated? I'm sensitive to the amount of work we'd need to put in to make this not-brittle. |
I'm not happy with that test either, but I don't know if there's a better way to test performance regression. I'm okay with deleting it. That code doesn't change frequently to justify having the brittle tests. |
Roger, gonna skip it for all platforms. |
See 2d2893a |
For some reason the xpath() method is a lot slower in nokogiri-java 1.5.5 than the MRI version.
I have code that processes around 1.5 MB of XML, using a lot of nested xpath loops. It takes only 1.7 seconds in MRI but around 150 seconds in JRuby 1.6.7.2 on Java 7. Running with VisualVM shows that almost all time is spent in XmlXpathContext.tryGetNodeSet().
I think this is a regression, because I don't recall this being so slow with older versions of Nokogiri.
The text was updated successfully, but these errors were encountered: