PostgreSQL Full-Text Search as a Starting Point
PostgreSQL Full-Text Search as a Starting Point
The Symptom
A startup builds a documentation platform for small teams. The MVP has 500 documents. The team evaluates OpenSearch, spends two weeks setting up a cluster, configuring mappings, building a sync pipeline, and deploying monitoring. The search quality is excellent. The operational burden is disproportionate: 70% of infrastructure management time goes to a search cluster that serves 50 queries per hour.
The Internals
PostgreSQL’s full-text search uses tsvector (a sorted list of lexemes with positions) and tsquery (a search expression) to provide text search with stemming, ranking, and index support.
Capabilities:
- Stemming via language-specific dictionaries (English, German, etc.)
- Weighted fields (A, B, C, D weights for title, body, etc.)
- GIN index for fast tsvector lookups
ts_rankandts_rank_cdfor relevance scoring- Phrase search with
<->(adjacent) and<N>(N-distance) operators
Limitations:
- No BM25. Ranking uses term frequency and inverse document frequency but with a simpler formula.
- No multi-field analysis pipelines. All fields must be concatenated into a single tsvector.
- No synonyms at query time without application-level expansion.
- No fuzzy matching (edit distance).
- No aggregations/faceted search without additional SQL queries.
- No horizontal scaling beyond PostgreSQL’s replication.
The Implementation
Spring Data JPA Integration
@Entity
@Table(name = "documents")
public class Document {
@Id
private String slug;
private String tenantId;
private String title;
@Column(columnDefinition = "text")
private String body;
private String contentType;
private String version;
@Column(columnDefinition = "tsvector")
private String searchVector;
}
@Repository
public interface DocumentRepository extends JpaRepository<Document, String> {
@Query(value = """
SELECT d.* FROM documents d
WHERE d.tenant_id = :tenantId
AND d.search_vector @@ plainto_tsquery('english', :query)
ORDER BY ts_rank(
setweight(to_tsvector('english', d.title), 'A') ||
setweight(to_tsvector('english', d.body), 'B'),
plainto_tsquery('english', :query)
) DESC
LIMIT :limit
""", nativeQuery = true)
List<Document> search(
@Param("tenantId") String tenantId,
@Param("query") String query,
@Param("limit") int limit
);
@Query(value = """
SELECT d.content_type, COUNT(*) as count
FROM documents d
WHERE d.tenant_id = :tenantId
AND d.search_vector @@ plainto_tsquery('english', :query)
GROUP BY d.content_type
ORDER BY count DESC
""", nativeQuery = true)
List<Object[]> searchWithFacets(
@Param("tenantId") String tenantId,
@Param("query") String query
);
}
Search Vector Maintenance
// Trigger to maintain the search_vector column
@Component
public class SearchVectorMigration implements Flyway callback {
public String migrationSql() {
return """
-- Create the tsvector column
ALTER TABLE documents ADD COLUMN IF NOT EXISTS
search_vector tsvector;
-- GIN index for fast search
CREATE INDEX IF NOT EXISTS idx_documents_search
ON documents USING GIN(search_vector);
-- Trigger to update search_vector on insert/update
CREATE OR REPLACE FUNCTION update_search_vector()
RETURNS trigger AS $$
BEGIN
NEW.search_vector :=
setweight(to_tsvector('english',
coalesce(NEW.title, '')), 'A') ||
setweight(to_tsvector('english',
coalesce(NEW.body, '')), 'B');
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE TRIGGER trg_search_vector
BEFORE INSERT OR UPDATE OF title, body ON documents
FOR EACH ROW EXECUTE FUNCTION update_search_vector();
-- Backfill existing documents
UPDATE documents SET search_vector =
setweight(to_tsvector('english',
coalesce(title, '')), 'A') ||
setweight(to_tsvector('english',
coalesce(body, '')), 'B');
""";
}
}
The Measurement
PostgreSQL FTS vs OpenSearch performance at different data scales:
| Metric | 10K docs (PG) | 100K docs (PG) | 1M docs (PG) | 1M docs (OS) |
|---|---|---|---|---|
| p50 latency | 2ms | 5ms | 25ms | 8ms |
| p99 latency | 8ms | 18ms | 95ms | 22ms |
| Relevance (NDCG@5) | 0.68 | 0.68 | 0.68 | 0.77 |
| Faceted search | 4ms (SQL) | 12ms (SQL) | 85ms (SQL) | 8ms |
| Infrastructure cost | $0 | $0 | $0 | $3,500/mo |
PostgreSQL FTS performs well up to 100K documents. At 1M documents, latency degrades and the relevance gap becomes significant (0.68 vs 0.77 NDCG@5) because PostgreSQL lacks BM25, multi-field analysis, and synonym expansion.
The Decision Rule
Start with PostgreSQL FTS when the document count is below 100,000 and the team does not have search engineering expertise. The zero-infrastructure-cost and zero-synchronization-complexity advantages outweigh the relevance quality gap for small datasets.
Migrate to OpenSearch when any of these conditions are met:
- Document count exceeds 100,000 and latency requirements are under 20ms p99
- Search quality requirements demand multi-field analysis, synonyms, or fuzzy matching
- Faceted search with aggregations becomes a core product feature
- The search query volume exceeds what a single PostgreSQL replica can serve
The migration path is well-defined: build a CDC pipeline (Chapter 18), create the OpenSearch index with the desired mapping, verify NDCG improvement, switch the search endpoint.