Skip to content

paradedb.exists missing rows when querying text fields. #1994

Closed
@Moosieus

Description

What happens?

I run this query to get entries that have been transcribed:

SELECT
    c0."id", c0."filename", c0."call_length", c0."transcript"
FROM "calls" AS c0
WHERE
    (c0."id" @@@ paradedb.exists('transcript'::paradedb.fieldname))
    AND (c0."system_id" @@@ 'moco_md_ps')
ORDER BY c0."id" DESC;

For context, there's a brief window of time where a call has been uploaded but has yet to be transcribed. It's desirable not to display these.

ID 25062 is missing in the results, even though its transcript is present:

-[ RECORD 50 ]----------------------------------------------------------------------------------------------------------------------------------
id          | 25063
filename    | 6000-1733067200_852912500.1-call_36.wav
call_length | 2
transcript  | █████████████████████
-[ RECORD 51 ]----------------------------------------------------------------------------------------------------------------------------------
id          | 25061
filename    | 6020-1733067184_852912500.1-call_34.wav
call_length | 8
transcript  | ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

I've redacted them for privacy sake.

When I remove the paradedb.exists clause:

SELECT
    c0."id", c0."filename", c0."call_length", c0."transcript"
FROM "calls" AS c0
WHERE
    (c0."system_id" @@@ 'moco_md_ps')
ORDER BY c0."id" DESC;

ID 25602 appears as expected:

-[ RECORD 52 ]-----------------------------------------------------------------
id          | 25063
filename    | 6000-1733067200_852912500.1-call_36.wav
call_length | 2
transcript  | █████████████████████
-[ RECORD 53 ]-----------------------------------------------------------------
id          | 25062
filename    | 6000-1733067166_851337500.1-call_29.wav
call_length | 27
transcript  | ██████████████████████████████████████████████████████████████...
-[ RECORD 54 ]-----------------------------------------------------------------
id          | 25061
filename    | 6020-1733067184_852912500.1-call_34.wav
call_length | 8
transcript  | ██████████████████████████████████████████████████████████████...

The problem occurs independent of the (c0."system_id" @@@ 'moco_md_ps') clause's presence.

To Reproduce

I haven't figured out what exactly causes this to happen. It seems to affect longer passages of text, but perhaps that's just bias because their absence stands out more.

OS:

Ubuntu 24.04.1 LTS, x64, Ryzen 7900X

ParadeDB Version:

v0.13.0

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB Docker Image

Full Name:

Cameron Duley

Affiliation:

My own behalf

Did you include all relevant data sets for reproducing the issue?

No - I cannot share the data sets because they are confidential

Did you include the code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

  • Yes, I have

Activity

neilyio

neilyio commented on Dec 11, 2024

@neilyio
Contributor

As discussed in our community slack, this is happening because the fast field normalizers use the same remove_long behavior as tokenizers, defaulting to removing strings that are 255 characters or longer.

While tokenizers offer a way to configure this, normalizers do not.

eeeebbbbrrrr

eeeebbbbrrrr commented on Jan 24, 2025

@eeeebbbbrrrr
Collaborator

I believe @neilyio's answer resolves this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpriority-mediumMedium priority issueuser-requestThis issue was directly requested by a user

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      `paradedb.exists` missing rows when querying text fields. · Issue #1994 · paradedb/paradedb