Skip to content

[bug] Empty HTML4 DocumentFragment serialization doesn't respect encoding #2649

Closed
@sgoedecke

Description

Please describe the bug

Parsing and serializing a HTML4::DocumentFragment will produce a UTF-8 string by default. However, when the input string is empty, it produces a US-ASCII encoded string instead, regardless of the passed encoding option.

Let me know if this is actually expected behaviour and something I should be working around!

Help us reproduce what you're seeing

This script produces a failing test:

#! /usr/bin/env ruby

require 'nokogiri'
require 'minitest/autorun'

class Test < MiniTest::Spec
  describe "HTML4::DocumentFragment" do
    it "should respect UTF-8 encoding for empty strings" do
      doc = Nokogiri::HTML::DocumentFragment.parse("", "UTF-8")
      
      assert_equal "UTF-8", doc.to_html.encoding
    end
  end
end

Output:

# Running:

F

Finished in 0.001808s, 553.0974 runs/s, 553.0974 assertions/s.

  1) Failure:
HTML4::DocumentFragment#test_0001_should respect UTF-8 encoding for empty strings [test.rb:11]:
Expected: "UTF-8"
  Actual: #<Encoding:US-ASCII>


Environment

# Nokogiri (1.12.5)
    ---
    warnings: []
    nokogiri:
      version: 1.12.5
      cppflags:
      - "-I/Library/Ruby/Gems/2.6.0/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri"
      - "-I/Library/Ruby/Gems/2.6.0/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri/include"
      - "-I/Library/Ruby/Gems/2.6.0/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 2.6.3
      platform: universal.x86_64-darwin20
      gem_platform: universal-darwin-20
      description: ruby 2.6.3p62 (2019-04-16 revision 67580) [universal.x86_64-darwin20]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - 0004-use-glibc-strlen.patch
      - 0005-avoid-isnan-isinf.patch
      - 0006-update-automake-files-for-arm64.patch
      - 0007-Fix-XPath-recursion-limit.patch
      libxml2_path: "/Library/Ruby/Gems/2.6.0/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.9.12
      loaded: 2.9.12
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      - 0002-Fix-xml2-config-check-in-configure-script.patch
      datetime_enabled: true
      compiled: 1.1.34
      loaded: 1.1.34
    other_libraries:
      zlib: 1.2.11
      libiconv: '1.15'
      libgumbo: 1.0.0-nokogiri

Additional information

From what I can tell, the issue comes from here: https://github.com/sparklemotion/nokogiri/blob/main/lib/nokogiri/xml/node_set.rb#L283

When there's a single node (which there always is for proper documents), that node sets the encoding properly. But when there are no nodes, we're effectively returning [].join, which is US-ASCII encoding.

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions