Update default postgresql.conf values in Dockerfile #2111
Description
What happens?
Search performance in v0.14.0 appears to have significantly degraded compared to v0.13.2, with execution time being over 200 times slower.
To Reproduce
A public dataset is used to demonstrate the issue. The dataset was imported 10 times to create a table with ~12 million rows, amplifying the effect. Any sufficiently large dataset should show similar behavior.
Table schema:
CREATE TABLE legitimate_account
(
id serial primary key,
content TEXT
);
The content
column was imported from legitimate_account.csv
and repeated 10 times.
dataset=# SELECT COUNT(*) FROM legitimate_account;
count
----------
12296170
(1 row)
Index creation:
CREATE INDEX search_idx ON legitimate_account
USING bm25 (id, content)
WITH (key_field = 'id');
Even though the dataset is in Chinese, the default tokenizer was used, eliminating potential tokenizer-related interference. I later tested with chinese_lindera
tokenizer, and the results were similar.
Query used:
EXPLAIN ANALYZE SELECT * FROM legitimate_account WHERE "content" @@@ '新闻' LIMIT 1000;
Performance comparison
v0.13.2:
Limit (cost=10.00..2010.00 rows=1000 width=169) (actual time=1.057..6.852 rows=1000 loops=1)
-> Custom Scan (ParadeDB Scan) on legitimate_account (cost=10.00..2010.00 rows=1000 width=169) (actual time=1.056..6.749 rows=1000 loops=1)
Table: legitimate_account
Index: search_idx
Heap Fetches: 1000
Exec Method: NormalScanExecState
Scores: false
Tantivy Query: {"with_index":{"oid":86688,"query":{"parse_with_field":{"field":"content","query_string":"新闻","lenient":null,"conjunction_mode":null}}}}
Planning Time: 10.886 ms
Execution Time: 13.211 ms
v0.14.0:
Limit (cost=10.00..2010.00 rows=1000 width=169) (actual time=22.786..41.805 rows=1000 loops=1)
-> Custom Scan (ParadeDB Scan) on legitimate_account (cost=10.00..2010.00 rows=1000 width=169) (actual time=22.784..41.735 rows=1000 loops=1)
Table: legitimate_account
Index: search_idx
Heap Fetches: 1000
Exec Method: TopNScanExecState
Scores: false
Top N Limit: 1000
Tantivy Query: {"with_index":{"oid":78357,"query":{"parse_with_field":{"field":"content","query_string":"新闻","lenient":null,"conjunction_mode":null}}}}
Planning Time: 2702.707 ms
Execution Time: 2304.358 ms
Both versions were tested in newly created containers with no persistent storage. Each container started from scratch, and the dataset was re-imported in both cases.
Docker image used: 17-v0.14.0
and 17-v0.13.2
OS:
Linux
ParadeDB Version:
v0.14.0
Are you using ParadeDB Docker, Helm, or the extension(s) standalone?
ParadeDB Docker Image
Full Name:
(prefer not to say)
Affiliation:
N/A
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include the code required to reproduce the issue?
- Yes, I have
Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?
- Yes, I have