Embedding Pipeline and Vector Index Operations
Embedding Pipeline and Vector Index Operations
The Symptom
The team enables semantic search by adding a knn_vector field and indexing embeddings generated by a Python microservice. Search results improve for conceptual queries. Then the Python service falls behind on embedding generation during a large documentation import. 200,000 documents are indexed in the lexical index but only 80,000 have vectors. Hybrid search returns incomplete results for 60% of the corpus.
The Internals
The embedding pipeline must keep the vector index synchronized with the lexical index. Every document that exists in the lexical index must have a corresponding vector representation. Missing vectors create a silent recall gap: documents without vectors can never be surfaced by the semantic retrieval arm of hybrid search.
HNSW (Hierarchical Navigable Small World) is the approximate nearest neighbor algorithm used by OpenSearch’s kNN implementation. It builds a multi-layer graph where each node is a vector and edges connect nearby vectors. Query-time search starts at the top layer (sparse, long-range connections) and descends to the bottom layer (dense, short-range connections), following edges toward the query vector.
Two parameters control the recall/latency trade-off:
- ef_search (query-time): how many candidates to explore during search. Higher values improve recall but increase latency. Default: 100.
- ef_construction (index-time): how many candidates to consider when inserting a new vector into the graph. Higher values produce a better graph structure. Default: 100; production recommendation: 256.
The Implementation
Embedding Service
public class EmbeddingService {
private final HttpClient httpClient;
private final String embeddingServiceUrl;
private final int dimensions;
public EmbeddingService(String embeddingServiceUrl, int dimensions) {
this.httpClient = HttpClient.newHttpClient();
this.embeddingServiceUrl = embeddingServiceUrl;
this.dimensions = dimensions;
}
public float[] embed(String text) throws IOException, InterruptedException {
var request = HttpRequest.newBuilder()
.uri(URI.create(embeddingServiceUrl + "/embed"))
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(
"{\"text\": " + escapeJson(text) + "}"
))
.timeout(Duration.ofSeconds(30))
.build();
var response = httpClient.send(request,
HttpResponse.BodyHandlers.ofString());
if (response.statusCode() != 200) {
throw new EmbeddingException(
"Embedding service returned " + response.statusCode());
}
return parseEmbeddingResponse(response.body());
}
public List<float[]> embedBatch(List<String> texts)
throws IOException, InterruptedException {
var request = HttpRequest.newBuilder()
.uri(URI.create(embeddingServiceUrl + "/embed-batch"))
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(
"{\"texts\": " + textsToJsonArray(texts) + "}"
))
.timeout(Duration.ofSeconds(120))
.build();
var response = httpClient.send(request,
HttpResponse.BodyHandlers.ofString());
return parseBatchEmbeddingResponse(response.body());
}
}
Bulk Vector Indexing
// HARDENED: Synchronized vector indexing alongside lexical indexing
public class VectorIndexingPipeline {
private final OpenSearchClient client;
private final EmbeddingService embeddingService;
private final DocumentChunker chunker;
private static final int EMBEDDING_BATCH_SIZE = 32;
public VectorIndexingPipeline(OpenSearchClient client,
EmbeddingService embeddingService) {
this.client = client;
this.embeddingService = embeddingService;
this.chunker = new DocumentChunker();
}
public void indexDocumentWithVectors(DocPage page) throws Exception {
// Chunk the document
List<DocumentChunk> chunks = chunker.chunk(page);
// Generate embeddings in batches
List<String> chunkTexts = chunks.stream()
.map(DocumentChunk::text)
.toList();
List<float[]> embeddings = new ArrayList<>();
for (int i = 0; i < chunkTexts.size(); i += EMBEDDING_BATCH_SIZE) {
List<String> batch = chunkTexts.subList(
i, Math.min(i + EMBEDDING_BATCH_SIZE, chunkTexts.size()));
embeddings.addAll(embeddingService.embedBatch(batch));
}
// Bulk index chunks with embeddings
BulkRequest.Builder bulkBuilder = new BulkRequest.Builder()
.index("docs-vectors-v1")
.refresh(Refresh.False);
for (int i = 0; i < chunks.size(); i++) {
DocumentChunk chunk = chunks.get(i);
float[] embedding = embeddings.get(i);
Map<String, Object> doc = Map.of(
"tenant_id", chunk.tenantId(),
"parent_doc_slug", chunk.parentDocSlug(),
"parent_doc_title", chunk.parentDocTitle(),
"chunk_text", chunk.text(),
"chunk_index", chunk.chunkIndex(),
"embedding", embedding
);
bulkBuilder.operations(op -> op
.index(idx -> idx
.id(chunk.chunkId())
.document(doc)
)
);
}
BulkResponse response = client.bulk(bulkBuilder.build());
if (response.errors()) {
long failCount = response.items().stream()
.filter(item -> item.error() != null)
.count();
throw new VectorIndexingException(
failCount + " chunks failed to index for document " + page.slug());
}
}
}
HNSW Parameter Tuning
// Tuning ef_search at query time for recall vs latency trade-off
SearchRequest request = SearchRequest.of(s -> s
.index("docs-vectors-v1")
.query(q -> q.knn(knn -> knn
.field("embedding")
.vector(queryVector)
.k(20)
))
// ef_search can be set via index settings for global effect
);
// Set ef_search at the index level
client.indices().putSettings(ps -> ps
.index("docs-vectors-v1")
.settings(s -> s
.knn(true)
.putAll(Map.of(
"index.knn.algo_param.ef_search", JsonData.of(200)
))
)
);
The Measurement
HNSW parameter impact on recall and latency:
| ef_search | Recall@20 | p50 Latency | p99 Latency | Memory (per shard) |
|---|---|---|---|---|
| 50 | 0.88 | 5ms | 15ms | 1.2GB |
| 100 | 0.93 | 8ms | 22ms | 1.2GB |
| 200 | 0.97 | 14ms | 35ms | 1.2GB |
| 500 | 0.99 | 32ms | 80ms | 1.2GB |
Recall@20 measures what fraction of the true 20 nearest neighbors are returned by the approximate search. At ef_search=200, 97% of the true nearest neighbors are found, with p50 latency under 15ms. The diminishing returns above 200 rarely justify the latency cost.
The Decision Rule
Set ef_construction=256 and m=16 for production vector indices. These values produce high-quality HNSW graphs without excessive indexing overhead.
Set ef_search=200 for production queries. This provides 97% recall with sub-15ms latency. Increase to 500 only for offline evaluation tasks where recall matters more than latency.
Keep the vector index synchronized with the lexical index. Use a single indexing pipeline that writes to both indices, not separate pipelines that can diverge. Missing vectors are invisible failures that degrade hybrid search quality without producing errors.