Bemærk
Adgang til denne side kræver godkendelse. Du kan prøve at logge på eller ændre mapper.
Adgang til denne side kræver godkendelse. Du kan prøve at ændre mapper.
Important
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Azure Databricks previews.
The lakebase_text extension adds BM25 full-text search to Lakebase via the lakebase_bm25 index type. It is compatible with PostgreSQL's standard tsvector type and query operators.
Install
First, enable Lakebase Search in your project settings. Then install the extension:
CREATE EXTENSION IF NOT EXISTS lakebase_text;
Why lakebase_text instead of standard GIN full-text search
PostgreSQL's built-in full-text search uses GIN indexes and ts_rank for relevance scoring. ts_rank does not use global corpus statistics, so scores degrade as data grows. lakebase_text improves on this in two ways:
- BM25 ranking accounts for term frequency, document length, and corpus-wide statistics simultaneously, producing more accurate relevance scores than TF-IDF.
- Top-K pushdown uses Block-Max WAND to return only the K most relevant results from the index, without scoring every match in the result set.
Quick start
Build the lakebase_bm25 index after inserting data. BM25 computes corpus-wide statistics at index build time, not incrementally, so the index must be created on a populated table.
-- Create a table with a generated tsvector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
passage TEXT,
vector TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', passage)) STORED
);
-- Insert data before building the BM25 index
INSERT INTO documents (passage) VALUES
('Postgres is a powerful open-source relational database.'),
('Vector search finds semantically similar results.'),
('BM25 ranking improves full-text search relevance scores.');
-- Create the BM25 index on the populated table
CREATE INDEX documents_passage_bm25 ON documents USING lakebase_bm25 (vector);
-- Query: lower score means more relevant
SELECT id, passage,
vector <@> to_bm25query(to_tsvector('english', 'database'), 'documents_passage_bm25') AS score
FROM documents
ORDER BY score
LIMIT 5;
The <@> operator returns a negative BM25 score. Ordering by ascending score returns the most relevant results first.
Keep the index accurate
BM25 statistics are computed at index build time and updated by VACUUM. For most workloads, regular VACUUM keeps scores accurate. After bulk-loading a large amount of new data, run VACUUM manually:
VACUUM documents;
Tune search
Session-level GUCs
| Parameter | Type | Default | Description |
|---|---|---|---|
lakebase_bm25.default_limit |
integer | 1000 |
Maximum number of results returned from the index. |
lakebase_bm25.prefilter |
boolean | false |
When true, evaluates WHERE conditions before computing BM25 scores. Use when filters eliminate many rows and are cheap to evaluate. |
lakebase_bm25.enable_scan |
boolean | true |
Set to false to force a sequential scan, bypassing the index. Useful for testing. |
SET lakebase_bm25.default_limit TO 20;
SET lakebase_bm25.prefilter = on;
GUCs take precedence over index storage parameters when both are set.
Index storage parameters
Set these options at index creation time or with ALTER INDEX:
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
k1 |
real | 1.2 |
1.2 to 2.0 | Term frequency saturation. Higher values give more weight to repeated terms. |
b |
real | 0.75 |
0.0 to 1.0 | Document length normalization. 0.0 disables length normalization; 1.0 applies full normalization. |
default_limit |
integer | 1000 |
1 to 65535 | Fallback limit when the session GUC is not set. |
prefilter |
boolean | false |
N/A | Fallback prefilter setting when the session GUC is not set. |
-- Set parameters at index creation
CREATE INDEX documents_passage_bm25 ON documents USING lakebase_bm25 (vector)
WITH (default_limit = 20, k1 = 1.5);
-- Update parameters on an existing index
ALTER INDEX documents_passage_bm25 SET (default_limit = 50);
API reference
Types
bm25query_tsvector: combines a query tsvector with the target index identifier. Used as the right operand of <@>.
Operators
| Operator | Signature | Returns | Description |
|---|---|---|---|
<@> |
tsvector <@> bm25query_tsvector |
double precision |
Returns a negative BM25 score. Order ascending to get the most relevant results first. |
Functions
| Function | Returns | Description |
|---|---|---|
to_bm25query(query tsvector, index regclass) |
bm25query_tsvector |
Constructs a BM25 query object from a tsvector and the index's object identifier. |
Operator classes
| Class | Default for | Description |
|---|---|---|
tsvector_bm25_ops |
tsvector |
Maps tsvector columns to the <@> operator for BM25 scoring. This is the default operator class for tsvector with lakebase_bm25; you do not need to specify it explicitly. |