lakebase_text

Important

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Azure Databricks previews.

The lakebase_text extension adds BM25 full-text search to Lakebase via the lakebase_bm25 index type. It is compatible with PostgreSQL's standard tsvector type and query operators.

Install

First, enable Lakebase Search in your project settings. Then install the extension:

CREATE EXTENSION IF NOT EXISTS lakebase_text;

PostgreSQL's built-in full-text search uses GIN indexes and ts_rank for relevance scoring. ts_rank does not use global corpus statistics, so scores degrade as data grows. lakebase_text improves on this in two ways:

  • BM25 ranking accounts for term frequency, document length, and corpus-wide statistics simultaneously, producing more accurate relevance scores than TF-IDF.
  • Top-K pushdown uses Block-Max WAND to return only the K most relevant results from the index, without scoring every match in the result set.

Quick start

Build the lakebase_bm25 index after inserting data. BM25 computes corpus-wide statistics at index build time, not incrementally, so the index must be created on a populated table.

-- Create a table with a generated tsvector column
CREATE TABLE documents (
  id      SERIAL PRIMARY KEY,
  passage TEXT,
  vector  TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', passage)) STORED
);

-- Insert data before building the BM25 index
INSERT INTO documents (passage) VALUES
  ('Postgres is a powerful open-source relational database.'),
  ('Vector search finds semantically similar results.'),
  ('BM25 ranking improves full-text search relevance scores.');

-- Create the BM25 index on the populated table
CREATE INDEX documents_passage_bm25 ON documents USING lakebase_bm25 (vector);

-- Query: lower score means more relevant
SELECT id, passage,
  vector <@> to_bm25query(to_tsvector('english', 'database'), 'documents_passage_bm25') AS score
FROM documents
ORDER BY score
LIMIT 5;

The <@> operator returns a negative BM25 score. Ordering by ascending score returns the most relevant results first.

Keep the index accurate

BM25 statistics are computed at index build time and updated by VACUUM. For most workloads, regular VACUUM keeps scores accurate. After bulk-loading a large amount of new data, run VACUUM manually:

VACUUM documents;

Session-level GUCs

Parameter Type Default Description
lakebase_bm25.default_limit integer 1000 Maximum number of results returned from the index.
lakebase_bm25.prefilter boolean false When true, evaluates WHERE conditions before computing BM25 scores. Use when filters eliminate many rows and are cheap to evaluate.
lakebase_bm25.enable_scan boolean true Set to false to force a sequential scan, bypassing the index. Useful for testing.
SET lakebase_bm25.default_limit TO 20;
SET lakebase_bm25.prefilter = on;

GUCs take precedence over index storage parameters when both are set.

Index storage parameters

Set these options at index creation time or with ALTER INDEX:

Parameter Type Default Range Description
k1 real 1.2 1.2 to 2.0 Term frequency saturation. Higher values give more weight to repeated terms.
b real 0.75 0.0 to 1.0 Document length normalization. 0.0 disables length normalization; 1.0 applies full normalization.
default_limit integer 1000 1 to 65535 Fallback limit when the session GUC is not set.
prefilter boolean false N/A Fallback prefilter setting when the session GUC is not set.
-- Set parameters at index creation
CREATE INDEX documents_passage_bm25 ON documents USING lakebase_bm25 (vector)
  WITH (default_limit = 20, k1 = 1.5);

-- Update parameters on an existing index
ALTER INDEX documents_passage_bm25 SET (default_limit = 50);

API reference

Types

bm25query_tsvector: combines a query tsvector with the target index identifier. Used as the right operand of <@>.

Operators

Operator Signature Returns Description
<@> tsvector <@> bm25query_tsvector double precision Returns a negative BM25 score. Order ascending to get the most relevant results first.

Functions

Function Returns Description
to_bm25query(query tsvector, index regclass) bm25query_tsvector Constructs a BM25 query object from a tsvector and the index's object identifier.

Operator classes

Class Default for Description
tsvector_bm25_ops tsvector Maps tsvector columns to the <@> operator for BM25 scoring. This is the default operator class for tsvector with lakebase_bm25; you do not need to specify it explicitly.

Next steps