Skip to content

Latest commit

 

History

History
 
 

pg_search

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

pg_search

Test pg_search

Overview

pg_search is a Postgres extension that enables full text search over heap tables using the BM25 algorithm. It is built on top of Tantivy, the Rust-based alternative to Apache Lucene, using pgrx.

pg_search is supported on all versions supported by the PostgreSQL Global Development Group, which includes PostgreSQL 12+.

Check out the pg_search benchmarks here.

pg_search uses Tantivy version 0.22.0.

Roadmap

  • Custom tokenizers and multi-language support
  • BM25 scoring
  • Highlighting
  • Filtering
  • Autocomplete
  • Fuzzy search
  • Hybrid search
  • JSON fields
  • Datetime fields
  • Aggregations/facets
  • Distributed search

Installation

From ParadeDB

The easiest way to use the extension is to run the ParadeDB Dockerfile:

docker run \
  --name paradedb \
  -e POSTGRESQL_USERNAME=<user> \
  -e POSTGRESQL_PASSWORD=<password> \
  -e POSTGRESQL_DATABASE=<dbname> \
  -e POSTGRESQL_POSTGRES_PASSWORD=<superuser_password> \
  -v paradedb_data:/bitnami/postgresql \
  -p 5432:5432 \
  -d \
  paradedb/paradedb:latest

This will spin up a Postgres instance with pg_search preinstalled.

From Self-Hosted PostgreSQL

If you are self-hosting Postgres and would like to use the extension within your existing Postgres, follow the steps below.

It's very important to make the following change to your postgresql.conf configuration file. pg_search must be in the list of shared_preload_libraries:

shared_preload_libraries = 'pg_search'

This enables the extension to spawn a background worker process that performs writes to the index. If this background process is not started because of an incorrect postgresql.conf configuration, your database connection will crash or hang when you attempt to create a pg_search index.

Debian/Ubuntu

We provide prebuilt binaries for Debian-based Linux for Postgres 16, 15 and 14. You can download the latest version for your architecture from the releases page.

Our prebuilt binaries come with the ICU tokenizer enabled, which requires the libicu library. If you don't have it installed, you can do so with:

# Ubuntu 20.04 or 22.04
sudo apt-get install -y libicu70

# Ubuntu 24.04
sudo apt-get install -y libicu74

Or, you can compile the extension from source without --features icu to build without the ICU tokenizer.

ParadeDB collects anonymous telemetry to help us understand how many people are using the project. You can opt out of telemetry by setting export PARADEDB_TELEMETRY=false (or unsetting the variable) in your shell or in your ~/.bashrc file before running the extension.

macOS

We don't suggest running production workloads on macOS. As a result, we don't provide prebuilt binaries for macOS. If you are running Postgres on macOS and want to install pg_search, please follow the development instructions, but do cargo pgrx install --release instead of cargo pgrx run. This will build the extension from source and install it in your Postgres instance.

You can then create the extension in your database by running:

CREATE EXTENSION pg_search;

Note: If you are using a managed Postgres service like Amazon RDS, you will not be able to install pg_search until the Postgres service explicitly supports it.

Windows

Windows is not supported. This restriction is inherited from pgrx not supporting Windows.

Usage

Indexing

pg_search comes with a helper function that creates a test table that you can use for quick experimentation.

CALL paradedb.create_bm25_test_table(
  schema_name => 'public',
  table_name => 'mock_items'
);

To index the table, use the following SQL command:

CALL paradedb.create_bm25(
        index_name => 'search_idx',
        schema_name => 'public',
        table_name => 'mock_items',
        key_field => 'id',
        text_fields => '{description: {tokenizer: {type: "en_stem"}}, category: {}}',
        numeric_fields => '{rating: {}}'
);

Note the mandatory key_field option in the WITH code. Every bm25 index needs a key_field, which should be the name of a column that will function as a row's unique identifier within the index. Usually, the key_field can just be the name of your table's primary key column.

Once the indexing is complete, you can run various search functions on it.

Basic Search

Execute a search query on your indexed table:

SELECT description, rating, category
FROM search_idx.search(
  '(description:keyboard OR category:electronics) AND rating:>2',
  limit_rows => 5
);

This will return:

         description         | rating |  category
-----------------------------+--------+-------------
 Plastic Keyboard            |      4 | Electronics
 Ergonomic metal keyboard    |      4 | Electronics
 Innovative wireless earbuds |      5 | Electronics
 Fast charging power bank    |      4 | Electronics
 Bluetooth-enabled speaker   |      3 | Electronics
(5 rows)

Note the usage of limit_rows instead of the SQL LIMIT clause. For optimal performance, we recommend always using limit_rows and offset_rows instead of LIMIT and OFFSET.

Similarly, the rating:>2 filter was used instead of the SQL WHERE clause for efficient filtering.

Advanced features like BM25 scoring, highlighting, custom tokenizers, fuzzy search, and more are supported. Please refer to the documentation and quickstart for a more thorough overview of pg_search's query support.

Development

Prerequisites

To develop the extension, first install stable Rust using rustup:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup install stable

Note: While it is possible to install Rust via your package manager, we recommend using rustup as we've observed inconcistencies with Homebrew's Rust installation on macOS.

Then, install the PostgreSQL version of your choice using your system package manager. Here we provide the commands for the default PostgreSQL version used by this project:

# macOS
brew install postgresql@16

# Ubuntu
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
sudo apt-get update && sudo apt-get install -y postgresql-16 postgresql-server-dev-16

If you are using Postgres.app to manage your macOS PostgreSQL, you'll need to add the pg_config binary to your path before continuing:

export PATH="$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin"

Then, install and initialize pgrx:

# Note: Replace --pg16 with your version of Postgres, if different (i.e. --pg15, --pg14, etc.)
cargo install --locked cargo-pgrx --version 0.11.3

# macOS arm64
cargo pgrx init --pg16=/opt/homebrew/opt/postgresql@16/bin/pg_config

# macOS amd64
cargo pgrx init --pg16=/usr/local/opt/postgresql@16/bin/pg_config

# Ubuntu
cargo pgrx init --pg16=/usr/lib/postgresql/16/bin/pg_config

If you prefer to use a different version of Postgres, update the --pg flag accordingly.

Note: While it is possible to develop using pgrx's own Postgres installation(s), via cargo pgrx init without specifying a pg_config path, we recommend using your system package manager's Postgres as we've observed inconsistent behaviours when using pgrx's.

ICU Tokenizer

pg_search comes with multiple tokenizers for different languages. The ICU tokenizer, which enables tokenization for Arabic, Amharic, and Greek, is not enabled by default in development due to the additional dependencies it requires. To develop with the ICU tokenizer enabled, first:

Ensure that the libicu library is installed. It should come preinstalled on most distros, but you can install it with your system package manager if it isn't:

# macOS
brew install icu4c

# Ubuntu 20.04 or 22.04
sudo apt-get install -y libicu70

# Ubuntu 24.04
sudo apt-get install -y libicu74

Additionally, on macOS you'll need to add the icu-config binary to your path before continuing:

# ARM macOS
export PKG_CONFIG_PATH="/opt/homebrew/opt/icu4c/lib/pkgconfig"

# Intel macOS
export PKG_CONFIG_PATH="/usr/local/opt/icu4c/lib/pkgconfig"

Finally, to enable the ICU tokenizer in development, pass --features icu to the cargo pgrx run and cargo pgrx test commands.

Running the Extension

First, start pgrx:

cargo pgrx run

This will launch an interactive connection to Postgres. Inside Postgres, create the extension by running:

CREATE EXTENSION pg_search;

Now, you have access to all the extension functions.

Modifying the Extension

If you make changes to the extension code, follow these steps to update it:

  1. Recompile the extension:
cargo pgrx run
  1. Recreate the extension to load the latest changes:
DROP EXTENSION pg_search;
CREATE EXTENSION pg_search;

Testing

We use cargo test as our runner for pg_lakehouse tests.

License

pg_search is licensed under the GNU Affero General Public License v3.0 and as commercial software. For commercial licensing, please contact us at sales@paradedb.com.