Semantic Search Engine Built with CocoIndex in 2 Days

How I Built a Semantic Search Engine with CocoIndex

Linghua Jin built a semantic search engine using CocoIndex, achieving 30-second indexing for 500+ documents and 50ms query responses.

Why This Matters

Traditional keyword-based search fails to capture context, leading to poor user experiences. Semantic search, powered by vector embeddings, bridges this gap but requires efficient infrastructure. CocoIndex demonstrates how lightweight vector storage and embedding models can achieve sub-50ms query times, avoiding the complexity of traditional systems.

Key Insights

“500+ markdown files indexed in 30 seconds” (Real-World Example)
“Semantic embeddings allow ‘teaching computers’ to match ‘machine learning’” (Key Features)
“Batch indexing improves performance for large document collections” (Performance Tips)

Working Example

# Install CocoIndex
pip install cocoindex

from cocoindex import CocoIndex

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))
    doc_embeddings = data_scope.add_collector()

# Process and chunk documents
with data_scope["documents"].row() as doc:
    doc["chunks"] = doc["content"].transform(
        cocoindex.functions.SplitRecursively(),
        language="markdown", chunk_size=2000, chunk_overlap=500
    )

# Embed chunks and export to Postgres
with doc["chunks"].row() as chunk:
    chunk["embedding"] = chunk["text"].transform(
        cocoindex.functions.SentenceTransformerEmbed(
            model="sentence-transformers/all-MiniLM-L6-v2"
        )
    )
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                       text=chunk["text"], embedding=chunk["embedding"])

# Perform semantic search
def search(pool: ConnectionPool, query: str, top_k: int = 5):
    table_name = cocoindex.utils.get_target_storage_default_name(text_embedding_flow, "doc_embeddings")
    query_vector = text_to_embedding.eval(query)
    with pool.connection() as conn:
        with conn.cursor() as cur:
            cur.execute(f"""
                SELECT filename, text, embedding <=> %s::vector AS distance
                FROM {table_name} ORDER BY distance LIMIT %s
            """, (query_vector, top_k))
            return [
                {"filename": row[0], "text": row[1], "score": 1.0 - row[2]}
                for row in cur.fetchall()
            ]

Practical Applications

Use Case: Documentation search with 500+ markdown files using semantic embeddings
Pitfall: Choosing embedding dimensions without balancing accuracy and performance (e.g., 384 dimensions vs. higher-dimensional models)

References:

https://dev.to/cocoindex/how-i-built-a-semantic-search-engine-with-cocoindex-5ak9

On This Page

How I Built a Semantic Search Engine with CocoIndex

Why This Matters

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

Why Small Open-Source Fixes Outshine a Big Portfolio: 25 Merged PRs That Prove It

Why I Built the 🕍 Cathedral Roo Architect Mode: A Technical Vision for Open-Source Game Development

BorrowHood: Open-Source Community Rental Platform Built with FastAPI and SQLAlchemy