-
-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nokogiri.parse accepts Pathname object as argument, but only work with small files then. #1821
Conversation
…ds to read, but reads allways from beginning.
Thank you for submitting this! I will take a look as soon as I can. |
@phokz Thanks again for submitting this. I'm looking at the implementation for The fix you're suggesting in this PR, though, has one drawback that I think can be avoided: it reads the entire document into memory, which for large docs can be a performance penalty (memory usage). Instead what I'd like to do is open a I'll submit my changes as a separate PR and then we can chat about the advantages. Either way, I'd like to credit you in the CHANGELOG and use your tests, so you'll have a commit in git history as well. |
There are also a few other places in the code where we're calling |
Hmm. Thinking about this a bit more, and this feels like we're adding a new feature, which is to allow @tenderlove I'm curious what you think about this. I ❤️ ducktyping, but here Pathname is not obeying the If we do that, then |
Hi, thank you for diving into this issue. My solution was a quick hack. I quite like checking kind_of?(IO) way. Best regards |
@phokz Thanks, I think I will schedule a fix for v1.11.0. |
I think we may want to check for |
Interestingly, StringIO does not implement |
See additional context from user help request at https://gitter.im/sparklemotion/nokogiri?at=5f5a485bf969413294d84734 hello = Pathname("hello.xml")
puts hello.read
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
</project>
# => nil
require 'nokogiri'
# => true
doc = Nokogiri::XML(hello)
puts doc
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
</project>
<?xml version="1.0" encoding="UTF-8"?>
# => nil |
Crashing through the wall from #2110; if I was gonna do this I'd explicitly check |
- no need to resolve Pathname in Nokogiri.parse - added test coverage for XML::Document and HTML::Document#parse - moved resolution of Pathname to after `url` is set in HTML::Document#parse - open files with implict "r" because "b" assumes ASCII-8BIT encoding Part of sparklemotion#2110, sparklemotion#1821
- no need to resolve Pathname in Nokogiri.parse - added test coverage for XML::Document and HTML::Document#parse - moved resolution of Pathname to after `url` is set in HTML::Document#parse - open files with implict "r" because "b" assumes ASCII-8BIT encoding Part of sparklemotion#2110, sparklemotion#1821
I hit this yesterday wit code like
doc = Nokogiri.parse(Rails.root.join('config','file.xml'))
when the file.xml was smaller than about 4k bytes, it worked as a charm.
With larger files, weird behavior was happening. E.g. resulting parsed document
was containing duplicate records.
Then I found this:
p = Pathname.new('/etc/passwd')
[p.read(10), p.read(10)]
=> ["root:x:0:0", "root:x:0:0"]
Nokogiri accepts either string or whatever responds to 'read' method.
Pathname objects respond to read, but they do not keep state or file opened.
Nokogiri reads input in chunks, calling read(SOME_CHUNK_SIZE) possibly several times, based on file size. Pathname#read allways reads from the beginning of file.
With correct alignment of last element before SOME_CHUNK_SIZE boundary, this could be possibly
used to DoS attack that causes OOM (demo attached in commit d90bf40 in my fork).
In PR there are failing tests and proposed fix. It might not be the best approach to fix this.
Maybe other solution would be just to raise an exception when object in string_or_io has none of well known ancestors like String, File, StringIO ...