[bug] Empty HTML4 DocumentFragment serialization doesn't respect encoding #2649
Description
Please describe the bug
Parsing and serializing a HTML4::DocumentFragment
will produce a UTF-8
string by default. However, when the input string is empty, it produces a US-ASCII
encoded string instead, regardless of the passed encoding option.
Let me know if this is actually expected behaviour and something I should be working around!
Help us reproduce what you're seeing
This script produces a failing test:
#! /usr/bin/env ruby
require 'nokogiri'
require 'minitest/autorun'
class Test < MiniTest::Spec
describe "HTML4::DocumentFragment" do
it "should respect UTF-8 encoding for empty strings" do
doc = Nokogiri::HTML::DocumentFragment.parse("", "UTF-8")
assert_equal "UTF-8", doc.to_html.encoding
end
end
end
Output:
# Running:
F
Finished in 0.001808s, 553.0974 runs/s, 553.0974 assertions/s.
1) Failure:
HTML4::DocumentFragment#test_0001_should respect UTF-8 encoding for empty strings [test.rb:11]:
Expected: "UTF-8"
Actual: #<Encoding:US-ASCII>
Environment
# Nokogiri (1.12.5)
---
warnings: []
nokogiri:
version: 1.12.5
cppflags:
- "-I/Library/Ruby/Gems/2.6.0/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri"
- "-I/Library/Ruby/Gems/2.6.0/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri/include"
- "-I/Library/Ruby/Gems/2.6.0/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri/include/libxml2"
ldflags: []
ruby:
version: 2.6.3
platform: universal.x86_64-darwin20
gem_platform: universal-darwin-20
description: ruby 2.6.3p62 (2019-04-16 revision 67580) [universal.x86_64-darwin20]
engine: ruby
libxml:
source: packaged
precompiled: true
patches:
- 0001-Remove-script-macro-support.patch
- 0002-Update-entities-to-remove-handling-of-ssi.patch
- 0003-libxml2.la-is-in-top_builddir.patch
- 0004-use-glibc-strlen.patch
- 0005-avoid-isnan-isinf.patch
- 0006-update-automake-files-for-arm64.patch
- 0007-Fix-XPath-recursion-limit.patch
libxml2_path: "/Library/Ruby/Gems/2.6.0/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri"
memory_management: ruby
iconv_enabled: true
compiled: 2.9.12
loaded: 2.9.12
libxslt:
source: packaged
precompiled: true
patches:
- 0001-update-automake-files-for-arm64.patch
- 0002-Fix-xml2-config-check-in-configure-script.patch
datetime_enabled: true
compiled: 1.1.34
loaded: 1.1.34
other_libraries:
zlib: 1.2.11
libiconv: '1.15'
libgumbo: 1.0.0-nokogiri
Additional information
From what I can tell, the issue comes from here: https://github.com/sparklemotion/nokogiri/blob/main/lib/nokogiri/xml/node_set.rb#L283
When there's a single node (which there always is for proper documents), that node sets the encoding properly. But when there are no nodes, we're effectively returning [].join
, which is US-ASCII
encoding.