Benchmark Methodology - pg

Overview

pg_textsearch benchmarks measure full-text search performance using real-world datasets. Benchmarks run nightly on the main branch and on-demand for feature branches via GitHub Actions.

Datasets

MS MARCO (Primary)

The MS MARCO passage ranking dataset is our primary benchmark. It contains 8.8 million passages from web documents with real search queries from Bing users.

Metric	Value
Documents	8,841,823
Average document length	~35 tokens
Test queries	800 (100 per token bucket)
Query token buckets	1, 2, 3, 4, 5, 6, 7, 8+

MS MARCO v2

The MS MARCO v2 passage ranking dataset scales the primary benchmark to 138 million passages for large-corpus testing.

Metric	Value
Documents	138,364,198
Test queries	691 (sampled from dev set)
Query token buckets	1, 2, 3, 4, 5, 6, 7, 8+

Wikipedia

Wikipedia article abstracts provide longer documents for testing scalability. Available in 10K, 100K, 1M, and full (~6M) configurations.

Cranfield

The classic Cranfield collection (1,400 documents) is used for quick regression testing.

Environment

GitHub Actions Runner

Component	Specification
Platform	Ubuntu 24.04 (ubuntu-latest)
CPU	2-core AMD EPYC (GitHub-hosted)
Memory	7 GB RAM
Storage	~30 GB available after cleanup
Postgres	PostgreSQL 17

Postgres Configuration

shared_buffers = 4GB
maintenance_work_mem = 1GB
work_mem = 256MB
effective_cache_size = 6GB
random_page_cost = 1.1

Metrics Collected

Index Build

Build time: Wall-clock time to create the BM25 index
Index size: On-disk size of the index (from pg_relation_size)

Query Latency

Queries are grouped by token count (1-8+). For each bucket, we run 100 queries and report:

p50: Median latency
p95: 95th percentile latency
p99: 99th percentile latency
avg: Mean latency

Throughput

Total time to execute all test queries sequentially, reported as average milliseconds per query.

Weighted-Average Latency

A single summary metric that weights per-bucket p50 latencies by the observed query-length distribution from MS MARCO v1 (1,010,905 queries after to_tsvector). This reflects realistic workload performance more accurately than an equal-weight average across buckets.

Token Bucket	Queries	Weight
1	35,638	3.5%
2	165,033	16.3%
3	304,887	30.2%
4	264,177	26.1%
5	143,765	14.2%
6	59,558	5.9%
7	22,595	2.2%
8+	15,252	1.5%

Formula: weighted_p50 = Σ(bucket_p50 × weight) / Σ(weight)

Benchmark Procedure

Setup: Start fresh Postgres instance, create extension
Load: Bulk load dataset using \copy
Index: Create BM25 index, force spill to segments
Warmup: Run each query once to warm caches
Measure: Run timed queries, collect statistics
Report: Extract metrics, publish to dashboard

Note: All queries use LIMIT 10 to simulate typical search result pages. The Block-Max WAND optimization is enabled by default.

System X Comparison

We run identical benchmarks against System X (a competitive Postgres BM25 extension) to provide context for our performance numbers. Both extensions:

Run on the same GitHub Actions runner
Use identical Postgres configuration
Process the same dataset and queries
Use their default configurations

See the detailed comparison for latest results.

Reproducibility

All benchmark code is in the benchmarks/ directory. To run locally:

# MS MARCO benchmark
cd benchmarks/datasets/msmarco
./download.sh full
psql -f load.sql

# Or use the runner script
./benchmarks/runner/run_benchmark.sh msmarco

Limitations

GitHub Actions runners have variable performance; expect ~10% variance
Single-threaded query execution (no concurrent load testing)
Cold start not measured (cache warmup before timing)
Network latency not included (local socket connection)