Description
What happens?
I run this query to get entries that have been transcribed:
SELECT
c0."id", c0."filename", c0."call_length", c0."transcript"
FROM "calls" AS c0
WHERE
(c0."id" @@@ paradedb.exists('transcript'::paradedb.fieldname))
AND (c0."system_id" @@@ 'moco_md_ps')
ORDER BY c0."id" DESC;
For context, there's a brief window of time where a call has been uploaded but has yet to be transcribed. It's desirable not to display these.
ID 25062
is missing in the results, even though its transcript is present:
-[ RECORD 50 ]----------------------------------------------------------------------------------------------------------------------------------
id | 25063
filename | 6000-1733067200_852912500.1-call_36.wav
call_length | 2
transcript | █████████████████████
-[ RECORD 51 ]----------------------------------------------------------------------------------------------------------------------------------
id | 25061
filename | 6020-1733067184_852912500.1-call_34.wav
call_length | 8
transcript | ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
I've redacted them for privacy sake.
When I remove the paradedb.exists
clause:
SELECT
c0."id", c0."filename", c0."call_length", c0."transcript"
FROM "calls" AS c0
WHERE
(c0."system_id" @@@ 'moco_md_ps')
ORDER BY c0."id" DESC;
ID 25602
appears as expected:
-[ RECORD 52 ]-----------------------------------------------------------------
id | 25063
filename | 6000-1733067200_852912500.1-call_36.wav
call_length | 2
transcript | █████████████████████
-[ RECORD 53 ]-----------------------------------------------------------------
id | 25062
filename | 6000-1733067166_851337500.1-call_29.wav
call_length | 27
transcript | ██████████████████████████████████████████████████████████████...
-[ RECORD 54 ]-----------------------------------------------------------------
id | 25061
filename | 6020-1733067184_852912500.1-call_34.wav
call_length | 8
transcript | ██████████████████████████████████████████████████████████████...
The problem occurs independent of the (c0."system_id" @@@ 'moco_md_ps')
clause's presence.
To Reproduce
I haven't figured out what exactly causes this to happen. It seems to affect longer passages of text, but perhaps that's just bias because their absence stands out more.
OS:
Ubuntu 24.04.1 LTS, x64, Ryzen 7900X
ParadeDB Version:
v0.13.0
Are you using ParadeDB Docker, Helm, or the extension(s) standalone?
ParadeDB Docker Image
Full Name:
Cameron Duley
Affiliation:
My own behalf
Did you include all relevant data sets for reproducing the issue?
No - I cannot share the data sets because they are confidential
Did you include the code required to reproduce the issue?
- Yes, I have
Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?
- Yes, I have
Activity
neilyio commentedon Dec 11, 2024
As discussed in our community slack, this is happening because the fast field normalizers use the same
remove_long
behavior as tokenizers, defaulting to removing strings that are 255 characters or longer.While tokenizers offer a way to configure this, normalizers do not.
eeeebbbbrrrr commentedon Jan 24, 2025
I believe @neilyio's answer resolves this issue.