Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The json filter escapes the non ASCII characters by default. #737

Closed
JC-coder-lab opened this issue Dec 26, 2024 · 2 comments · Fixed by #739
Closed

The json filter escapes the non ASCII characters by default. #737

JC-coder-lab opened this issue Dec 26, 2024 · 2 comments · Fixed by #739

Comments

@JC-coder-lab
Copy link

JC-coder-lab commented Dec 26, 2024

The default json filter escapes non-ASCII characters by converting them into Unicode strings during the serialization of JSON objects. However, in our use case with the Fluid library, where we construct prompts to pass to a large language model, the converted Unicode strings become difficult for the model to process. Therefore, we would prefer the json filter to avoid escaping non-ASCII characters during serialization.

For example: 你好,这是一条短信 will be converted to \u4F60\u597D\uFF0C\u8FD9\u662F\u4E00\u6761\u77ED\u4FE1. But the Unicode string is harder for the model to understand since the model is pretrained on more text string data instead of unicode string data.

It would be better if the json filter in the Fluid library allowed us to configure the serialization options for non-ASCII characters.

@hishamco
Copy link
Collaborator

@sebastienros I think this something needs to be supported by default unless you have a reason for not doing it

I can submit a PR if you don't mind

@hishamco
Copy link
Collaborator

I was waiting for your reply to submit a PR but seems you do it instead :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants