Performance Profiling with the Profile API and Slow Logs

A search query takes 800ms in production. The team’s first reaction: “The cluster is slow.” The cluster is not slow. One query is slow. Profiling reveals that a single wildcard query inside a bool clause scans 4 million terms in the inverted index. The rest of the query executes in 12ms.

The Profile API

The Profile API adds detailed timing information to a search response. Every query clause, every collector, every aggregation reports its execution time, the number of documents it matched, and the Lucene-level operations it performed.

// HARDENED: Profile a slow query to identify the bottleneck

public SearchResponse<DocPage> profileSearch(String tenantId, String query)
        throws IOException {
    return client.search(s -> s
        .index("docs-" + tenantId)
        .profile(true)  // Enable profiling
        .query(q -> q.bool(b -> b
            .must(mu -> mu.multiMatch(mm -> mm
                .query(query)
                .fields("title^3", "body", "code_snippets")
                .type(TextQueryType.CrossFields)
            ))
            .should(sh -> sh.matchPhrase(mp -> mp
                .field("title")
                .query(query)
                .boost(2.0f)
            ))
        ))
        .size(10),
        DocPage.class
    );
}

Profile Output Interpretation

The profile output is a tree of query nodes. Each node reports:

Field	Meaning
`type`	Lucene query class (e.g., `BooleanQuery`, `TermQuery`, `WildcardQuery`)
`description`	The query’s string representation
`time_in_nanos`	Total wall-clock time for this query node
`breakdown`	Sub-timings: `build_scorer`, `next_doc`, `advance`, `score`, `match`
`children`	Child query nodes (for compound queries)

// Extract the slowest query clause from profile output

public record QueryProfile(
    String type,
    String description,
    long timeNanos,
    Map<String, Long> breakdown
) {}

public List<QueryProfile> extractSlowClauses(SearchResponse<DocPage> response,
        long thresholdMs) {

    List<QueryProfile> slowClauses = new ArrayList<>();

    for (var shardProfile : response.profile().shards()) {
        for (var searchProfile : shardProfile.searches()) {
            for (var queryProfile : searchProfile.query()) {
                collectSlowClauses(queryProfile, thresholdMs, slowClauses);
            }
        }
    }

    return slowClauses.stream()
        .sorted(Comparator.comparingLong(QueryProfile::timeNanos).reversed())
        .toList();
}

private void collectSlowClauses(
        co.elastic.clients.elasticsearch.core.search.QueryProfile node,
        long thresholdMs,
        List<QueryProfile> results) {

    long timeMs = node.timeInNanos() / 1_000_000;
    if (timeMs >= thresholdMs) {
        results.add(new QueryProfile(
            node.type(),
            node.description(),
            node.timeInNanos(),
            node.breakdown()
        ));
    }

    if (node.children() != null) {
        for (var child : node.children()) {
            collectSlowClauses(child, thresholdMs, results);
        }
    }
}

Common Profile Patterns

Pattern	Profile Signature	Root Cause	Fix
Wildcard scan	`WildcardQuery` with high `next_doc` count	Leading wildcard `error`	Use n-gram field or keyword search
Deep scoring	`FunctionScoreQuery` with high `score` time	Complex score function	Simplify or use rescore on top N
Many clause expansion	`BooleanQuery` with 1000+ children	Fuzzy query or synonym expansion	Limit fuzzy expansion, reduce synonyms
Slow fetch	Low query time, high fetch time	Large `_source`, many highlight fields	Use source filtering, limit highlights

Slow Logs

The Profile API is for ad-hoc diagnosis. Slow logs are for production monitoring. They log queries and fetches that exceed a time threshold.

# Index-level slow log settings
index.search.slowlog.threshold.query.warn: 5s
index.search.slowlog.threshold.query.info: 2s
index.search.slowlog.threshold.query.debug: 500ms
index.search.slowlog.threshold.query.trace: 100ms

index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 500ms

# Log the full query source (required for debugging)
index.search.slowlog.level: info
index.search.slowlog.source: 1000 # Max chars of query source to log

// Set slow log thresholds via the Java client

public void configureSlowLogs(String indexName) throws IOException {
    client.indices().putSettings(ps -> ps
        .index(indexName)
        .settings(s -> s
            .putAll(Map.of(
                "index.search.slowlog.threshold.query.warn",
                JsonData.of("5s"),
                "index.search.slowlog.threshold.query.info",
                JsonData.of("2s"),
                "index.search.slowlog.threshold.query.debug",
                JsonData.of("500ms"),
                "index.search.slowlog.threshold.fetch.warn",
                JsonData.of("1s"),
                "index.search.slowlog.source",
                JsonData.of("2000")
            ))
        )
    );
}

The Decision Rule

Enable slow logs on all production indices with query.warn: 5s and query.info: 2s. These thresholds catch pathological queries without flooding the log.

Use the Profile API for targeted investigation after slow logs identify a problematic query. The Profile API adds significant overhead (20-50%) and should not be enabled for production traffic.

Focus on the breakdown.next_doc and breakdown.advance counters in the profile output. High values for these counters indicate that the query is scanning many documents rather than using the inverted index efficiently. This points to wildcard queries, regex queries, or script queries that bypass the index.