Reindex API Internals and Transform Scripts

The Symptom

The team runs a reindex operation to add a code_language field extracted from code blocks in the document body. The reindex completes but the code_language field is null for 40% of documents. Investigation reveals the Painless script throws an exception for documents without code blocks, but the reindex API swallows the error and indexes the document without the field.

The Internals

The reindex API is a scroll-then-bulk operation:

Opens a scroll on the source index
Fetches a batch of documents (default: 1,000)
Optionally transforms each document through a Painless script
Bulk indexes the batch into the target index
Repeats until the scroll is exhausted

When wait_for_completion is false, the reindex runs as a persistent task. The task ID can be used to check progress, cancel the operation, or retrieve the result after completion.

Script errors during reindex do not halt the operation by default. The document is indexed without the transformation. The error is counted in the response’s failures array, but only if the error is severe enough to prevent indexing entirely. Script exceptions that produce a null value are not failures—they are silent data corruption.

The Implementation

Safe Transform Script

public void reindexWithCodeLanguageExtraction(String sourceIndex,
        String targetIndex) throws IOException {

    Request request = new Request("POST", "/_reindex?wait_for_completion=false");
    request.setJsonEntity("""
        {
          "source": {
            "index": "%s",
            "size": 500
          },
          "dest": {
            "index": "%s"
          },
          "script": {
            "lang": "painless",
            "source": "def body = ctx._source.body; if (body == null) { ctx._source.code_language = 'none'; return; } def matcher = /```(\\\\w+)/.matcher(body); def languages = new HashSet(); while (matcher.find()) { languages.add(matcher.group(1)); } ctx._source.code_language = languages.isEmpty() ? 'none' : String.join(',', languages); ctx._source.has_code = !languages.isEmpty();"
          },
          "conflicts": "proceed"
        }
        """.formatted(sourceIndex, targetIndex));

    Response response = restClient.performRequest(request);
    String taskId = extractTaskId(response);
}

The script extracts programming language identifiers from fenced code blocks (```java, ```python). For documents without code blocks, it explicitly sets code_language to "none" instead of leaving it null. The has_code boolean field enables efficient filtering without regex at query time.

Task Monitoring

public record ReindexProgress(
    long created,
    long updated,
    long deleted,
    long versionConflicts,
    long total,
    double percentComplete,
    boolean completed,
    List<String> failures
) {}

public ReindexProgress getReindexProgress(String taskId) throws IOException {
    Request request = new Request("GET", "/_tasks/" + taskId);
    Response response = restClient.performRequest(request);

    JsonNode root = objectMapper.readTree(
        EntityUtils.toString(response.getEntity()));

    JsonNode status = root.path("task").path("status");
    boolean completed = root.path("completed").asBoolean();

    long created = status.path("created").asLong();
    long total = status.path("total").asLong();
    double percent = total > 0 ? (double) created / total * 100 : 0;

    List<String> failures = new ArrayList<>();
    JsonNode failuresNode = root.path("response").path("failures");
    if (failuresNode.isArray()) {
        for (JsonNode failure : failuresNode) {
            failures.add(failure.path("cause").path("reason").asText());
        }
    }

    return new ReindexProgress(
        created,
        status.path("updated").asLong(),
        status.path("deleted").asLong(),
        status.path("version_conflicts").asLong(),
        total,
        percent,
        completed,
        failures
    );
}

public void cancelReindex(String taskId) throws IOException {
    Request request = new Request("POST",
        "/_tasks/" + taskId + "/_cancel");
    restClient.performRequest(request);
}

Conflict Handling Strategy

// FRAGILE: conflicts: "abort" halts the entire reindex on the first
// version conflict. If any document was updated in the source while
// reindexing, the entire operation stops and must be restarted.

// HARDENED: conflicts: "proceed" skips conflicting documents and
// reports the count. A catch-up reindex handles the skipped documents.

The Measurement

Reindex performance for 5 million documents with transform script:

Configuration	Throughput	Duration	Script Overhead
No script	12,000 docs/s	7 min	0%
Simple set script	11,400 docs/s	7.3 min	5%
Regex extraction script	8,200 docs/s	10.2 min	32%
Application-side transform + no script	11,800 docs/s	7.1 min	2%

The regex extraction script adds 32% overhead. For large reindexing operations, pre-computing the transformation in the application and reindexing with a pre-enriched source is faster than running a Painless script on every document inside OpenSearch.

The Decision Rule

Validate transform scripts against representative documents before running the full reindex. Test with documents that have null fields, empty strings, and edge-case content. A script that works on sample data but fails on edge cases silently corrupts the target index.

Use conflicts: "proceed" for reindex operations on live indices. A single version conflict should not halt the migration of millions of documents. Track the conflict count and handle conflicting documents in a catch-up pass.

Pre-compute complex transformations (regex extraction, content parsing) in the application layer and write the enriched documents directly. Reserve Painless scripts for simple field manipulation (rename, type conversion, default values).