Skip to content

How to handle undefined conversions? #12

Open
@krepflap

Description

I was wondering how to replace undefined conversions by a substitute character when they are outside of the destination encoding, e.g. when I try to convert the euro sign (€) to SHIFT JIS encoding.

In Ruby, we can do this:

"xx€xx".encode('SHIFT_JIS', 'UTF-8', undef: :replace)
=> "xx?xx"

And the € which cannot be converted is replaced by a "?" character. This is important when doing text comparison i.e. https://unicode.org/reports/tr36/#Text_Comparison

When converting charsets, never simply omit characters that cannot be converted; at least substitute U+FFFD (when converting to Unicode) or 0x1A (when converting to bytes) to reduce security problems.

Can we do this using iconv library in Elixir/Erlang? Currently the undefined character is omitted. I guess I could do the conversion char by char and check if it returns an empty string but I was hoping if there is anything more elegant possible?

Activity

krepflap

krepflap commented on Jun 28, 2019

@krepflap
Author

If any one stumbles upon this, I'm using this to handle the case above, though it does call :iconv.convert for every character.

  defp to_shift_jis(input) do
    convert = fn x ->
      case :iconv.convert("utf-8", "shift-jis", <<x::utf8>>) do
        "" -> "?"
        c -> c
      end
    end

    for <<c::utf8 <- input>>, do: convert.(c), into: ""
  end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      How to handle undefined conversions? · Issue #12 · processone/iconv