Skip to content

Struct.UTFString.get() fails for UTF-16 #30

Open
@blschatz

Description

This fails due to the underlying call to IO.getZeroTerminatedByteArray - this should really be looking for double nulls not single nulls for wide Charsets.

Activity

headius

headius commented on Apr 23, 2015

@headius
Member

This should probably be using Java's charset logic to decode. Will investigate.

headius

headius commented on Apr 23, 2015

@headius
Member

Ahh I see, it's just looking for the nulls to peel them off. Will see what I can do.

headius

headius commented on Apr 23, 2015

@headius
Member

Ok, I understand now.

getZeroTerminatedByteArray is used to return the bytes of a string sans the null terminator. It does this by taking the given string address and calling strlen on it. strlen only looks for \0, and then that length is used to allocate and populate a Java byte array.

This would be a problem if there's any embedded null bytes, which is obviously a problem for UTF-16 in ASCII range.

This is going to be a much more difficult fix, since the actual strlen call happens inside native code. Whenever we change native code, we need to rebuild the native stubs across platforms.

I'm also not sure that just changing strlen is the right fix. These functions have no way of knowing what encoding the bytes are in.

Here's what I think we should do:

  1. As a workaround, you could work with the strings as bytes and deal with the nulls yourself. Not ideal, I know.
  2. Add a second version of this logic that takes either an encoding or an explicit terminator to look for, along the lines of getTerminatedByteArray(addr, [terminator|encoding]).
  3. Finally figure out how to set up VMs for all the platforms we support, so we can more easily update the native bits (ping @tduehr).
blschatz

blschatz commented on Apr 27, 2015

@blschatz
Author

My fix was as follows:

public class UTF16String extends String {

public UTF16String(int length, Charset cs) {
        super(length * 8, 8, length, cs); 

    }
    protected jnr.ffi.Pointer getStringMemory() {
        return getMemory().slice(offset(), length());
    }

    public final void set(java.lang.String value) {
        getStringMemory().putString(0, value, length, charset);
    }

    public final java.lang.String get() {
        jnr.ffi.Pointer memory = getStringMemory();
        byte[] bytes = new byte[length];
        memory.get(0, bytes, 0, length);

        // find the null terminator first
        int nullPos = bytes.length;
        for (int i=0; i< nullPos ; i+=2) {
            if (bytes[i] == 0 && bytes[i+1] == 0) {
                nullPos = i;
                break;
            }
        }
        CharBuffer res = charset.decode(ByteBuffer.wrap(bytes, 0, nullPos));
        return res.toString();
    }

}

headius

headius commented on Sep 26, 2016

@headius
Member

@blschatz Possible for you to turn that into a pull request we can integrate? I'm not sure how you're using that within jnr-ffi and your own code (i.e. I'd like to see some examples and ideally tests in a PR).

added a commit that references this issue on Jul 6, 2021
a88b8a6
linked a pull request that will close this issue on Jul 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Struct.UTFString.get() fails for UTF-16 · Issue #30 · jnr/jnr-ffi