Handling '\u0000' in Parquet Data Causes Error: Unsupported Unicode Escape Sequence #188
Open
Description
What happens?
Some specific data may cause errors(ERROR: unsupported Unicode escape sequence). The data contains '\u0000'.
pg_analytics=# select * from tulu_3_sft_mixture1 ;
ERROR: unsupported Unicode escape sequence
DETAIL: \u0000 cannot be converted to text.
CONTEXT: JSON data, line 1: ...产生一种隐约的紧张感。因此,\u0000...
To Reproduce
After #187 merged. And load dataset on Huggingface.
OS:
x86
ParadeDB Version:
v0.2.4
Are you using ParadeDB Docker, Helm, or the extension(s) standalone?
ParadeDB pg_analytics Extension
Full Name:
kysshsy
Affiliation:
NA
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include the code required to reproduce the issue?
- Yes, I have
Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?
- Yes, I have