Skip to content

Handling '\u0000' in Parquet Data Causes Error: Unsupported Unicode Escape Sequence #188

Open
@kysshsy

Description

What happens?

Some specific data may cause errors(ERROR: unsupported Unicode escape sequence). The data contains '\u0000'.

pg_analytics=# select * from tulu_3_sft_mixture1 ;
ERROR:  unsupported Unicode escape sequence
DETAIL:  \u0000 cannot be converted to text.
CONTEXT:  JSON data, line 1: ...产生一种隐约的紧张感。因此,\u0000...

To Reproduce

After #187 merged. And load dataset on Huggingface.

OS:

x86

ParadeDB Version:

v0.2.4

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB pg_analytics Extension

Full Name:

kysshsy

Affiliation:

NA

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomerspriority-mediumMedium priority issueuser-requestThis issue was directly requested by a user

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions