Skip to main content
search at depth

Hot Threads and JVM Pressure Diagnostics

4 min read Chapter 44 of 60

Hot Threads and JVM Pressure Diagnostics

The Symptom

Search latency increases by 3x every afternoon between 2pm and 4pm. The cluster has no increase in query volume during this window. CPU usage shows periodic 100% spikes lasting 2-5 seconds. The spikes correlate with search latency increases. The hot threads API reveals all CPU is consumed by GC task thread.

The Internals

OpenSearch runs on the JVM. Search performance is bounded by three JVM resources:

  1. Heap memory. Used for caches (node query cache, field data cache, shard request cache), in-flight requests, and internal data structures. When heap exceeds 75%, garbage collection becomes aggressive and increasingly disruptive.

  2. GC pauses. During garbage collection, all application threads stop (stop-the-world pause). A G1GC full GC on a 32GB heap can pause for 5-15 seconds, during which the node is unresponsive. Other nodes may consider it failed and begin shard reallocation.

  3. Thread starvation. When all threads in a pool are blocked (waiting for disk I/O, GC pauses, or network), new requests queue and eventually get rejected.

The Implementation

Hot Threads Diagnostic

public class HotThreadsDiagnostic {

    private final RestClient restClient;

    public HotThreadsDiagnostic(RestClient restClient) {
        this.restClient = restClient;
    }

    public String captureHotThreads(int snapshots, Duration interval)
            throws IOException {
        Request request = new Request("GET",
            "/_nodes/hot_threads" +
            "?snapshots=" + snapshots +
            "&interval=" + interval.toMillis() + "ms" +
            "&threads=5" +
            "&type=cpu");

        Response response = restClient.performRequest(request);
        return EntityUtils.toString(response.getEntity());
    }

    // Capture hot threads during a performance problem window
    public String captureWithContext() throws IOException {
        // Take 3 snapshots 500ms apart to identify persistent hot threads
        // vs transient spikes
        return captureHotThreads(3, Duration.ofMillis(500));
    }
}

Heap and GC Monitoring

public record JvmHealth(
    long heapUsedBytes,
    long heapMaxBytes,
    double heapPercent,
    long youngGcCount,
    long youngGcTimeMs,
    long oldGcCount,
    long oldGcTimeMs,
    double gcOverheadPercent
) {
    public boolean isUnderGcPressure() {
        return heapPercent > 75 || gcOverheadPercent > 5;
    }

    public boolean isInDanger() {
        return heapPercent > 90 || oldGcCount > 0;
    }
}

public JvmHealth getJvmHealth(String nodeId) throws IOException {
    var stats = client.nodes().stats(ns -> ns
        .nodeId(nodeId)
        .metric("jvm"));

    var node = stats.nodes().values().iterator().next();
    var jvm = node.jvm();
    var mem = jvm.mem();
    var youngGc = jvm.gc().collectors().get("young");
    var oldGc = jvm.gc().collectors().get("old");

    long totalGcTimeMs = youngGc.collectionTimeInMillis() +
                         oldGc.collectionTimeInMillis();
    long uptimeMs = jvm.uptimeInMillis();
    double gcOverhead = uptimeMs > 0
        ? (double) totalGcTimeMs / uptimeMs * 100
        : 0;

    return new JvmHealth(
        mem.heapUsedInBytes(),
        mem.heapMaxInBytes(),
        mem.heapUsedPercent(),
        youngGc.collectionCount(),
        youngGc.collectionTimeInMillis(),
        oldGc.collectionCount(),
        oldGc.collectionTimeInMillis(),
        gcOverhead
    );
}

JVM Tuning Recommendations

# opensearch.yml / jvm.options

# Heap: 50% of available RAM, max 32GB (compressed oops threshold)
-Xms16g
-Xmx16g

# G1GC settings (OpenSearch default)
-XX:+UseG1GC
-XX:G1HeapRegionSize=16m
-XX:InitiatingHeapOccupancyPercent=40
-XX:MaxGCPauseMillis=200

# GC logging
-Xlog:gc*:file=/var/log/opensearch/gc.log:time,pid,tags:filecount=10,filesize=64m

The Measurement

JVM health indicators and their search performance impact:

Heap UsageGC OverheadOld GC/hourp99 Search LatencyStatus
< 60%< 2%0BaselineHealthy
60-75%2-5%0+20%Watch
75-85%5-15%1-3+100%Degraded
85-95%> 15%> 5+500%Critical
> 95%> 30%ContinuousUnresponsiveEmergency

The Decision Rule

Set heap to 50% of available RAM, maximum 31GB (to stay below the compressed oops threshold at 32GB). Exceeding 32GB disables compressed ordinary object pointers, effectively reducing the usable heap.

Monitor old GC collections as the primary JVM health indicator. Young GC collections are normal and expected. Old GC collections indicate heap pressure that will degrade search latency. More than 3 old GC collections per hour warrants investigation.

When hot threads show GC threads dominating CPU, the cluster is under memory pressure. Reduce caches (field data cache size, query cache size), add nodes to distribute the data, or increase heap (up to 31GB). Never increase heap past 31GB.