Semantic Search Engine Built with CocoIndex in 2 Days
These articles are AI-generated summaries. Please check the original sources for full details.
How I Built a Semantic Search Engine with CocoIndex
Linghua Jin built a semantic search engine using CocoIndex, achieving 30-second indexing for 500+ documents and 50ms query responses.
Why This Matters
Traditional keyword-based search fails to capture context, leading to poor user experiences. Semantic search, powered by vector embeddings, bridges this gap but requires efficient infrastructure. CocoIndex demonstrates how lightweight vector storage and embedding models can achieve sub-50ms query times, avoiding the complexity of traditional systems.
Key Insights
- “500+ markdown files indexed in 30 seconds” (Real-World Example)
- “Semantic embeddings allow ‘teaching computers’ to match ‘machine learning’” (Key Features)
- “Batch indexing improves performance for large document collections” (Performance Tips)
Working Example
# Install CocoIndex
pip install cocoindex
from cocoindex import CocoIndex
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))
doc_embeddings = data_scope.add_collector()
# Process and chunk documents
with data_scope["documents"].row() as doc:
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(),
language="markdown", chunk_size=2000, chunk_overlap=500
)
# Embed chunks and export to Postgres
with doc["chunks"].row() as chunk:
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"
)
)
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
text=chunk["text"], embedding=chunk["embedding"])
# Perform semantic search
def search(pool: ConnectionPool, query: str, top_k: int = 5):
table_name = cocoindex.utils.get_target_storage_default_name(text_embedding_flow, "doc_embeddings")
query_vector = text_to_embedding.eval(query)
with pool.connection() as conn:
with conn.cursor() as cur:
cur.execute(f"""
SELECT filename, text, embedding <=> %s::vector AS distance
FROM {table_name} ORDER BY distance LIMIT %s
""", (query_vector, top_k))
return [
{"filename": row[0], "text": row[1], "score": 1.0 - row[2]}
for row in cur.fetchall()
]
Practical Applications
- Use Case: Documentation search with 500+ markdown files using semantic embeddings
- Pitfall: Choosing embedding dimensions without balancing accuracy and performance (e.g., 384 dimensions vs. higher-dimensional models)
References:
Continue reading
Next article
How I Installed Nagios on EC2 and Created My Own Disk Monitoring Plugin
Related Content
Hardening Astropy's Core Stability: Testing Raw C-Extensions
Reem Hamraz joins GSoC 2026 to harden Astropy's core stability by implementing low-level tests for Cython extensions.
Local-First Open Source PDF to Excel Converter for Secure Data Extraction
Tsvetan Gerginov releases an open source PDF-to-Excel converter leveraging pdfplumber and tabula-py for local, privacy-focused data extraction.
Solving Tournament Admin Friction: Building The Colosseum for CoD Streamers
Developer Joe C eliminates manual data entry for CoD tournaments by integrating Google Forms and Challonge into a single Electron desktop app.