-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle the case where a split is about to happen in the middle of a UTF-16 surrogate pair #399
base: main
Are you sure you want to change the base?
Conversation
…tf 16 surrogate pair
for reference: #386 |
I don't think that's the root issue, but nonetheless the fact that you've found a repeatable case is very helpful. I'll take a look at it during Thu-Fri. The fix will arrive as part of v0.18.1 release by the end of this week. |
@xyzhangfred wouldn't you mind attaching the test case to recreate the issue? What I did was using test case from original issue: let doc = Doc::with_options(Options {
// use UTF-16, since this is the offset kind that Yjs uses (because of UTF16 string encoding in JavaScript)
offset_kind: OffsetKind::Utf16,
..Default::default()
});
let txt = doc.get_or_insert_text("quill");
let mut txn = doc.transact_mut();
let str = "🇿🇿🇿🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩";
println!("length: {}", str.len());
txt.insert(&mut txn, 0, str);
txt.remove_range(&mut txn, 0, 5);
// txt: 🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩 However this test doesn't bring any errors. I've also checked this in Yjs: const doc = new Y.Doc()
const txt = doc.getText('quill')
const str ="🇿🇿🇿🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩"
txt.insert(0, str)
txt.delete(0, 5)
console.log(txt.toString()) // �🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩 Since the split point was defined in a middle of multi-code character it was replaced by surrogate. |
Maybe it only happens in the C FFI? I used the following code to reproduce a crash (replacing main.cpp in y-crdt/tests-ffi, compile and run). `#include <stdio.h> extern "C" { int main(){
}` |
After checking out this code more thoroughly, I see that your example couldn't work in the first place when using using the default OffsetKind (which targets UTF8 byte lenghts per character). Deleting byte-range of 0..5, when affected characters are 4 bytes long, means splitting a character in half which is not allowed operation in Rust. Changing OffsetKind to Utf16 (so the same lengths as Yjs/JavaScript uses) fixes the issue. Example: TEST_CASE("Unicode support") {
YOptions o = yoptions();
o.encoding = Y_OFFSET_UTF16;
YDoc* doc = ydoc_new_with_options(o);
Branch* txt = ytext(doc, "quill");
YTransaction* txn = ydoc_write_transaction(doc, 0, NULL);
ytext_insert(txt, txn, 0, u8"🇿🇿🇿🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩", NULL);
ytext_remove_range(txt, txn, 0, 5);
char* actual = ytext_string(txt, txn);
REQUIRE(!strcmp(actual, u8"🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩"));
ystring_destroy(actual);
ytransaction_commit(txn);
ydoc_destroy(doc);
} |
I've identified an issue in the split_str function where splitting a string at an offset that divides a surrogate pair results in incorrect behavior. Specifically, when split_str is called with an offset that bisects a surrogate pair, the entire character is retained in the left part of the split, leaving the right part empty. This can lead to inaccuracies in length calculations ( there are places in the code where the "offset" is taken to be the correct length of the left splice, after the split) and potential empty strings where they're not expected.
Example:
Input: split_str("🌉", 1, OffsetKind::Utf16)
Current Output: left = "🌉", right = ""
Expected Behavior: Ideally, the function should either somehow handle surrogate pairs gracefully (although I am not sure if that's possible with rust),or handle it in a way that does not lead to incorrect string lengths/ block splice lengths.
Temporary Workaround:
I've implemented a temporary workaround that replaces problematic splits with empty strings. This approach prevents crashes and highlights the issue, though it's far from an ideal solution.
This workaround is intended as a stopgap measure. I welcome suggestions for a more elegant and robust solution to this problem. Please feel free to discuss this further or propose alternatives.