Log-Based Search Analytics Pipeline
Log-Based Search Analytics Pipeline
The Symptom
The product manager asks: “What do our users search for most?” The team checks the application logs. Search queries are logged as unstructured text mixed with HTTP access logs. Extracting the top queries requires a custom grep pipeline that misses 30% of queries due to inconsistent log formatting.
The Internals
Search logs are the raw material for understanding user behavior. Every search query, its results, and the user’s subsequent actions form a feedback loop that drives search improvement. Without structured search logs, this feedback loop is broken.
The search analytics pipeline:
- Structured logging. Every search request produces a structured log entry with the query, results, latency, and user context.
- Indexing. Log entries are indexed into a dedicated analytics index with keyword fields for exact aggregation.
- Aggregation. Daily and weekly rollups produce top-queries, zero-result-queries, and slow-queries reports.
- Action. Each report maps to a specific improvement action: add synonyms, create missing content, optimize slow queries.
The Implementation
Search Event Schema
public record SearchEvent(
@JsonProperty("event_id") String eventId,
@JsonProperty("timestamp") Instant timestamp,
@JsonProperty("tenant_id") String tenantId,
@JsonProperty("user_id") String userId,
@JsonProperty("query") String query,
@JsonProperty("query_normalized") String queryNormalized,
@JsonProperty("result_count") int resultCount,
@JsonProperty("latency_ms") long latencyMs,
@JsonProperty("page") int page,
@JsonProperty("results_shown") List<String> resultsShown,
@JsonProperty("filters_applied") Map<String, String> filtersApplied,
@JsonProperty("search_type") String searchType // lexical, semantic, hybrid
) {}
public record ClickEvent(
@JsonProperty("event_id") String eventId,
@JsonProperty("timestamp") Instant timestamp,
@JsonProperty("search_event_id") String searchEventId,
@JsonProperty("tenant_id") String tenantId,
@JsonProperty("user_id") String userId,
@JsonProperty("clicked_doc_slug") String clickedDocSlug,
@JsonProperty("click_position") int clickPosition
) {}
Search Event Logger
public class SearchEventLogger {
private final OpenSearchClient client;
public SearchEventLogger(OpenSearchClient client) {
this.client = client;
}
public void logSearch(SearchEvent event) throws IOException {
client.index(i -> i
.index("search-events-" + formatMonth(event.timestamp()))
.document(event)
.refresh(Refresh.False)
);
}
public void logClick(ClickEvent event) throws IOException {
client.index(i -> i
.index("click-events-" + formatMonth(event.timestamp()))
.document(event)
.refresh(Refresh.False)
);
}
private String formatMonth(Instant timestamp) {
return timestamp.atZone(ZoneOffset.UTC)
.format(DateTimeFormatter.ofPattern("yyyy-MM"));
}
}
Analytics Reports
public class SearchAnalyticsReporter {
private final OpenSearchClient client;
public SearchAnalyticsReporter(OpenSearchClient client) {
this.client = client;
}
public record TopQuery(String query, long count, double avgResultCount,
double avgLatencyMs) {}
public List<TopQuery> topQueries(String tenantId, int days, int topN)
throws IOException {
var response = client.search(s -> s
.index("search-events-*")
.size(0)
.query(q -> q.bool(b -> b
.filter(f -> f.term(t -> t.field("tenant_id").value(tenantId)))
.filter(f -> f.range(r -> r
.field("timestamp")
.gte(JsonData.of("now-" + days + "d"))
))
))
.aggregations("top_queries", a -> a
.terms(t -> t
.field("query_normalized")
.size(topN)
)
.aggregations("avg_result_count", sub -> sub
.avg(avg -> avg.field("result_count"))
)
.aggregations("avg_latency", sub -> sub
.avg(avg -> avg.field("latency_ms"))
)
),
Void.class
);
return response.aggregations().get("top_queries")
.sterms().buckets().array().stream()
.map(bucket -> new TopQuery(
bucket.key().stringValue(),
bucket.docCount(),
bucket.aggregations().get("avg_result_count").avg().value(),
bucket.aggregations().get("avg_latency").avg().value()
))
.toList();
}
public record ZeroResultQuery(String query, long count) {}
public List<ZeroResultQuery> zeroResultQueries(String tenantId, int days)
throws IOException {
var response = client.search(s -> s
.index("search-events-*")
.size(0)
.query(q -> q.bool(b -> b
.filter(f -> f.term(t -> t.field("tenant_id").value(tenantId)))
.filter(f -> f.term(t -> t.field("result_count").value(0)))
.filter(f -> f.range(r -> r
.field("timestamp")
.gte(JsonData.of("now-" + days + "d"))
))
))
.aggregations("zero_result_queries", a -> a
.terms(t -> t
.field("query_normalized")
.size(50)
.minDocCount(3)
)
),
Void.class
);
return response.aggregations().get("zero_result_queries")
.sterms().buckets().array().stream()
.map(bucket -> new ZeroResultQuery(
bucket.key().stringValue(),
bucket.docCount()
))
.toList();
}
}
The Measurement
Weekly search analytics for the documentation platform (Tenant: Acme Corp):
| Report | Metric | Count | Action |
|---|---|---|---|
| Top queries | ”authentication” | 2,340 | Verify ranking quality |
| Top queries | ”rate limiting” | 1,890 | Verify ranking quality |
| Zero-result | ”webhook retry policy” | 45 | Content gap → create doc |
| Zero-result | ”graphql subscription” | 38 | Synonym gap → add synonym |
| Slow queries | ”how to implement*” (wildcard) | 12 | Query rewrite → remove wildcard |
| Low CTR | ”error handling” (CTR: 8%) | 890 | Ranking problem → tune weights |
Each row maps to a concrete improvement action. Zero-result queries expose content gaps and vocabulary mismatches. Low click-through rate queries expose ranking problems. Slow queries expose query optimization opportunities.
The Decision Rule
Log every search event with a structured schema that includes the normalized query, result count, latency, and tenant. Normalization (lowercase, whitespace trim, stopword removal) ensures that “API Key” and “api key” aggregate into the same bucket.
Generate weekly reports for: top 50 queries, top 50 zero-result queries, top 20 slow queries (> p95 latency), and bottom 20 CTR queries. Each report should map to a responsible team and a concrete action category (content creation, synonym addition, query optimization, ranking tuning).
Index search events into monthly indices (e.g., search-events-2024-03) with an ISM policy that deletes indices older than 12 months. Search analytics data grows linearly with traffic and provides diminishing value beyond 12 months.