Codexity Part 2: Query Rewriting with LLMs

The user types “which database should I use for my startup.” DuckDuckGo does not know what to do with that. Neither does Google.

Search engines work best with specific, keyword-rich queries. The gap between what humans ask and what search engines need is where query rewriting lives. This is the first place in the Codexity pipeline where an LLM earns its keep.

The Problem

Consider these real questions:

“what’s better postgres or mongo for a startup”
“how do I make my python code faster”
“explain kubernetes networking”

Each one is too broad for a single search. “What’s better postgres or mongo” could benefit from three separate searches: one about PostgreSQL strengths, one about MongoDB strengths, and one comparing them in startup contexts.

The query rewriter does two things:

Classifies the intent (comparison, how-to, factual, opinion)
Decomposes the query into 2-4 specific, search-friendly queries

The LLM Client

Before we write the rewriter, we need a way to talk to a local model. This abstraction sits in llm_client.py and gets reused by the synthesizer later.

# llm_client.py
from llama_cpp import Llama

from config import settings

_llm: Llama | None = None

def get_llm() -> Llama:
    global _llm
    if _llm is None:
        _llm = Llama(
            model_path=settings.model_path,
            n_ctx=settings.context_length,
            n_threads=4,
            verbose=False,
        )
    return _llm

def generate(prompt: str, max_tokens: int = 512, temperature: float = 0.1) -> str:
    llm = get_llm()
    response = llm.create_chat_completion(
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        temperature=temperature,
    )
    return response["choices"][0]["message"]["content"]

async def generate_streaming(prompt: str, system: str = "", max_tokens: int = 2048):
    llm = get_llm()
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})

    for chunk in llm.create_chat_completion(
        messages=messages,
        max_tokens=max_tokens,
        temperature=0.3,
        stream=True,
    ):
        delta = chunk["choices"][0].get("delta", {})
        if "content" in delta:
            yield delta["content"]

The model loads lazily on first call. generate is for quick, non-streaming tasks like query rewriting. generate_streaming yields tokens one at a time for the answer synthesis phase.

Temperature sits at 0.1 for rewriting. We want deterministic, focused output. Creative rewrites of search queries are not helpful.

The Query Rewriter

The prompt needs to be specific enough that a 7B model follows it reliably. Vague instructions produce vague output. Here is the full rewriter:

# query_rewriter.py
import json
import re

from llm_client import generate
from config import settings

REWRITE_PROMPT = """You are a search query optimizer. Given a user question, generate {max_queries} specific search queries that will find the most relevant information.

Rules:
- Each query should be 4-8 words
- Include the current year (2026) when time-relevance matters
- Use specific technical terms over conversational language
- For comparisons, generate one query per option plus one comparison query
- Output ONLY a JSON array of strings, nothing else

User question: {question}

JSON array:"""

def rewrite_query(question: str) -> list[str]:
    prompt = REWRITE_PROMPT.format(
        question=question,
        max_queries=settings.max_queries,
    )
    raw = generate(prompt, max_tokens=200, temperature=0.1)

    return parse_queries(raw, question)

def parse_queries(raw: str, fallback: str) -> list[str]:
    # Try JSON parse first
    try:
        cleaned = raw.strip()
        # Find the JSON array in the response
        match = re.search(r'\[.*?\]', cleaned, re.DOTALL)
        if match:
            queries = json.loads(match.group())
            if isinstance(queries, list) and all(isinstance(q, str) for q in queries):
                return queries[:settings.max_queries]
    except (json.JSONDecodeError, ValueError):
        pass

    # Fallback: split by newlines and clean up
    lines = [
        line.strip().strip('"-,').strip()
        for line in raw.strip().split('\n')
        if line.strip() and not line.strip().startswith('{')
    ]
    queries = [l for l in lines if len(l) > 5]

    if queries:
        return queries[:settings.max_queries]

    # Last resort: use the original question
    return [fallback]

Why the Parsing is Defensive

Small models do not always follow formatting instructions. A 7B model asked for a JSON array might return:

Here are the search queries:
["query one", "query two", "query three"]

Or:

1. "query one"
2. "query two"
3. "query three"

Or just the array with trailing text. The parse_queries function handles all of these. It tries JSON first, falls back to line-by-line extraction, and uses the original question as a last resort. The system never crashes on a bad rewrite. It degrades gracefully.

Prompt Engineering for Small Models

Large models like GPT-4 tolerate sloppy prompts. Small models do not. Here are the patterns that work reliably with 7B-class models:

Be explicit about format. “Output ONLY a JSON array” works better than “respond in JSON.” The word “ONLY” matters.

Give examples in the prompt when format matters. For query rewriting, the format is simple enough that we skip examples. For the synthesizer in Part 6, we will need them.

Keep instructions short. Every extra sentence in the system prompt consumes context tokens that could hold search results. The rewrite prompt is 8 lines. That is intentional.

Use low temperature. For structured output tasks, 0.1 or lower. Creativity is the enemy of consistent formatting.

Testing the Rewriter

Let’s trace through a real example. Input:

“what’s better for my startup, postgres or mongo?”

The rewriter generates:

[
  "PostgreSQL vs MongoDB startup 2026",
  "MongoDB flexible schema startup advantages",
  "PostgreSQL JSONB vs MongoDB document store"
]

Three queries. Each one targets a different angle. The first covers the direct comparison. The second explores MongoDB’s strongest argument. The third digs into PostgreSQL’s document capabilities, which is relevant because many people choose MongoDB without knowing PostgreSQL has JSONB.

Compare that to searching the raw question. DuckDuckGo would return generic “PostgreSQL vs MongoDB” articles from 2019. The rewritten queries surface recent, specific content.

Plugging Into the Pipeline

Update main.py to wire the rewriter into the search pipeline:

from query_rewriter import rewrite_query

async def search_pipeline(query: str):
    # Phase 1: Rewrite query
    yield SearchEvent(event="status", data={"step": "rewriting_query"})
    queries = rewrite_query(query)
    yield SearchEvent(
        event="status",
        data={"step": "queries_ready", "queries": queries},
    )

    # Phase 2: Search (next chapter)
    yield SearchEvent(event="status", data={"step": "searching"})
    # ...

The client sees two status events: one indicating the rewrite started, another carrying the generated queries. A frontend could display these to the user (“Searching for: …”). Since we are backend-only, they show up in the SSE stream for debugging.

Edge Cases Worth Knowing

Single-word queries. “Python” becomes ["Python programming language overview 2026", "Python latest features"]. The rewriter expands terse input into something searchable.

Already specific queries. “FastAPI dependency injection with Annotated” is already a good search query. The rewriter tends to preserve it with minor variations, which is the right behavior. Do not fix what works.

Non-English queries. Small models handle major languages (Spanish, French, German, Chinese) but quality drops for others. If multilingual support matters, you need a larger model or a translation step before rewriting.

Adversarial input. Users will submit prompt injection attempts. The rewriter does not execute code or access files, so the risk is low. The worst case is a garbage rewrite, which the fallback handles.

Performance

Query rewriting with a Q4-quantized 7B model takes 100-300ms on a modern CPU. That is fast enough. The web search and scraping phases dominate total latency by an order of magnitude.

If even 200ms bothers you, cache rewrites. The same question produces the same queries (temperature 0.1 is nearly deterministic). A simple dictionary cache eliminates repeated work.

What Comes Next

Part 3 covers the search phase. We take the rewritten queries, fire them at DuckDuckGo in parallel using asyncio.gather, deduplicate URLs, and handle rate limiting. The DuckDuckGo library has quirks. We will cover all of them.

On This Page

Codexity Part 2: Query Rewriting with LLMs

Codexity Part 2: Query Rewriting with LLMs

The Problem

The LLM Client

The Query Rewriter

Why the Parsing is Defensive

Prompt Engineering for Small Models

Testing the Rewriter

Plugging Into the Pipeline

Edge Cases Worth Knowing

Performance

What Comes Next

Continue reading

Related Content

Codexity Part 6: Small Model Inference with llama-cpp-python

Codexity Part 3: Async Web Search with DuckDuckGo

Codexity Part 5: Content Processing and Relevance Ranking