Skip to content

Update default postgresql.conf values in Dockerfile #2111

Open
@CatMe0w

Description

What happens?

Search performance in v0.14.0 appears to have significantly degraded compared to v0.13.2, with execution time being over 200 times slower.

To Reproduce

A public dataset is used to demonstrate the issue. The dataset was imported 10 times to create a table with ~12 million rows, amplifying the effect. Any sufficiently large dataset should show similar behavior.

Table schema:

CREATE TABLE legitimate_account
(
    id serial primary key,
    content TEXT
);

The content column was imported from legitimate_account.csv and repeated 10 times.

dataset=# SELECT COUNT(*) FROM legitimate_account;
  count   
----------
 12296170
(1 row)

Index creation:

CREATE INDEX search_idx ON legitimate_account
USING bm25 (id, content)
WITH (key_field = 'id');

Even though the dataset is in Chinese, the default tokenizer was used, eliminating potential tokenizer-related interference. I later tested with chinese_lindera tokenizer, and the results were similar.

Query used:

EXPLAIN ANALYZE SELECT * FROM legitimate_account WHERE "content" @@@ '新闻' LIMIT 1000;

Performance comparison
v0.13.2:

Limit  (cost=10.00..2010.00 rows=1000 width=169) (actual time=1.057..6.852 rows=1000 loops=1)
  ->  Custom Scan (ParadeDB Scan) on legitimate_account  (cost=10.00..2010.00 rows=1000 width=169) (actual time=1.056..6.749 rows=1000 loops=1)
        Table: legitimate_account
        Index: search_idx
        Heap Fetches: 1000
        Exec Method: NormalScanExecState
        Scores: false
        Tantivy Query: {"with_index":{"oid":86688,"query":{"parse_with_field":{"field":"content","query_string":"新闻","lenient":null,"conjunction_mode":null}}}}
Planning Time: 10.886 ms
Execution Time: 13.211 ms

v0.14.0:

Limit  (cost=10.00..2010.00 rows=1000 width=169) (actual time=22.786..41.805 rows=1000 loops=1)
  ->  Custom Scan (ParadeDB Scan) on legitimate_account  (cost=10.00..2010.00 rows=1000 width=169) (actual time=22.784..41.735 rows=1000 loops=1)
        Table: legitimate_account
        Index: search_idx
        Heap Fetches: 1000
        Exec Method: TopNScanExecState
        Scores: false
           Top N Limit: 1000
        Tantivy Query: {"with_index":{"oid":78357,"query":{"parse_with_field":{"field":"content","query_string":"新闻","lenient":null,"conjunction_mode":null}}}}
Planning Time: 2702.707 ms
Execution Time: 2304.358 ms

Both versions were tested in newly created containers with no persistent storage. Each container started from scratch, and the dataset was re-imported in both cases.

Docker image used: 17-v0.14.0 and 17-v0.13.2

OS:

Linux

ParadeDB Version:

v0.14.0

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB Docker Image

Full Name:

(prefer not to say)

Affiliation:

N/A

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Assignees

Labels

dockerPull requests that update Docker codefeatureNew feature or requestpriority-mediumMedium priority issueuser-requestThis issue was directly requested by a user

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions