Skip to main content

On This Page

Semantic Search Engine Built with CocoIndex in 2 Days

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

How I Built a Semantic Search Engine with CocoIndex

Linghua Jin built a semantic search engine using CocoIndex, achieving 30-second indexing for 500+ documents and 50ms query responses.

Why This Matters

Traditional keyword-based search fails to capture context, leading to poor user experiences. Semantic search, powered by vector embeddings, bridges this gap but requires efficient infrastructure. CocoIndex demonstrates how lightweight vector storage and embedding models can achieve sub-50ms query times, avoiding the complexity of traditional systems.

Key Insights

  • “500+ markdown files indexed in 30 seconds” (Real-World Example)
  • “Semantic embeddings allow ‘teaching computers’ to match ‘machine learning’” (Key Features)
  • “Batch indexing improves performance for large document collections” (Performance Tips)

Working Example

# Install CocoIndex
pip install cocoindex
from cocoindex import CocoIndex

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))
    doc_embeddings = data_scope.add_collector()
# Process and chunk documents
with data_scope["documents"].row() as doc:
    doc["chunks"] = doc["content"].transform(
        cocoindex.functions.SplitRecursively(),
        language="markdown", chunk_size=2000, chunk_overlap=500
    )
# Embed chunks and export to Postgres
with doc["chunks"].row() as chunk:
    chunk["embedding"] = chunk["text"].transform(
        cocoindex.functions.SentenceTransformerEmbed(
            model="sentence-transformers/all-MiniLM-L6-v2"
        )
    )
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                       text=chunk["text"], embedding=chunk["embedding"])
# Perform semantic search
def search(pool: ConnectionPool, query: str, top_k: int = 5):
    table_name = cocoindex.utils.get_target_storage_default_name(text_embedding_flow, "doc_embeddings")
    query_vector = text_to_embedding.eval(query)
    with pool.connection() as conn:
        with conn.cursor() as cur:
            cur.execute(f"""
                SELECT filename, text, embedding <=> %s::vector AS distance
                FROM {table_name} ORDER BY distance LIMIT %s
            """, (query_vector, top_k))
            return [
                {"filename": row[0], "text": row[1], "score": 1.0 - row[2]}
                for row in cur.fetchall()
            ]

Practical Applications

  • Use Case: Documentation search with 500+ markdown files using semantic embeddings
  • Pitfall: Choosing embedding dimensions without balancing accuracy and performance (e.g., 384 dimensions vs. higher-dimensional models)

References:

Continue reading

Next article

How I Installed Nagios on EC2 and Created My Own Disk Monitoring Plugin

Related Content