Skip to main content

On This Page

Codexity Part 1: Architecture of an Answer Engine

6 min read
Share

Codexity Part 1: Architecture of an Answer Engine

Perplexity takes a question, searches the web, reads the pages, and writes an answer with citations. That description fits on a napkin. The implementation does not.

This series builds a fully functional clone called Codexity. No frontend. Pure Python backend. By the end of the final chapter, you will have a working API that accepts a natural language question, searches the web, scrapes sources, synthesizes an answer through a local LLM, and streams it back token by token over Server-Sent Events.

What Perplexity Actually Does

Before building anything, we need to understand the pipeline. Every query flows through five phases:

  1. Query Understanding: The raw user question gets rewritten into search-engine-friendly queries. A vague question like “which database should I use” becomes multiple targeted searches.
  2. Web Search: Those rewritten queries hit a search engine. We use DuckDuckGo because it has no API key requirement and a solid Python library.
  3. Web Scraping: The top URLs from search results get fetched and their content extracted. This is where things get ugly. JavaScript-rendered pages, anti-bot measures, rate limiting.
  4. Content Processing: Raw HTML gets stripped down to clean text, chunked, and ranked by relevance to the original question.
  5. Answer Synthesis: A language model reads the ranked chunks and generates a cited answer, streamed to the client in real time.
Codexity System Architecture

The Tech Stack

Everything runs on Python 3.12+. Here is the full dependency list and why each library was chosen:

LibraryPurpose
fastapiHTTP server with native async and SSE support
uvicornASGI server to run FastAPI
httpxAsync HTTP client for scraping
duckduckgo-searchWeb search without API keys
playwrightBrowser automation for JS-heavy pages
beautifulsoup4HTML parsing and content extraction
readability-lxmlArticle extraction (Readability algorithm)
llama-cpp-pythonLocal LLM inference with OpenAI-compatible API
rank-bm25BM25 scoring for chunk relevance
sse-starletteServer-Sent Events for FastAPI

No OpenAI. No paid APIs. The entire stack runs on your machine.

Project Structure

codexity/
├── main.py              # FastAPI app + SSE endpoint
├── query_rewriter.py    # LLM-based query decomposition
├── searcher.py          # DuckDuckGo async search
├── scraper.py           # Tiered scraping (httpx + Playwright)
├── content_processor.py # HTML stripping, chunking, ranking
├── synthesizer.py       # LLM answer generation
├── llm_client.py        # Abstraction over llama-cpp-python
├── config.py            # Settings and constants
└── models.py            # Pydantic models

Nine files. Each one maps to a stage in the pipeline.

Setting Up the Project

mkdir codexity && cd codexity
python -m venv .venv
source .venv/bin/activate

Create pyproject.toml:

[project]
name = "codexity"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
    "fastapi>=0.115.0",
    "uvicorn[standard]>=0.30.0",
    "httpx>=0.27.0",
    "duckduckgo-search>=6.3.0",
    "playwright>=1.48.0",
    "beautifulsoup4>=4.12.0",
    "readability-lxml>=0.8.0",
    "lxml>=5.0.0",
    "llama-cpp-python>=0.3.0",
    "rank-bm25>=0.2.2",
    "sse-starlette>=2.0.0",
    "pydantic>=2.0.0",
    "pydantic-settings>=2.0.0",
]

Install everything:

pip install -e .
playwright install chromium

The Playwright install downloads a Chromium binary. We will need it for JavaScript-rendered pages in Part 4.

The Data Models

Every stage of the pipeline passes typed data to the next. Define these models upfront so the contract between components is clear.

# models.py
from pydantic import BaseModel

class SearchResult(BaseModel):
    title: str
    url: str
    snippet: str

class ScrapedPage(BaseModel):
    url: str
    title: str
    content: str
    success: bool

class TextChunk(BaseModel):
    text: str
    source_url: str
    source_title: str
    relevance_score: float = 0.0

class SourceReference(BaseModel):
    index: int
    title: str
    url: str

class SearchEvent(BaseModel):
    event: str
    data: dict

SearchEvent is the SSE payload. Every message the server sends to the client follows this format: an event type (status, sources, token, done) and a data dictionary.

The Config Module

# config.py
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    # LLM
    model_path: str = "./models/qwen2.5-7b-instruct-q4_k_m.gguf"
    context_length: int = 8192
    max_tokens: int = 2048

    # Search
    max_search_results: int = 8
    max_queries: int = 3

    # Scraping
    scrape_timeout: int = 15
    max_concurrent_scrapes: int = 5

    # Content processing
    chunk_size: int = 512
    chunk_overlap: int = 50
    top_k_chunks: int = 10

    class Config:
        env_file = ".env"

settings = Settings()

Every magic number lives here. When we start tuning performance in later chapters, this is the file that changes.

The FastAPI Skeleton

# main.py
import asyncio
from fastapi import FastAPI, Query
from sse_starlette.sse import EventSourceResponse

from config import settings
from models import SearchEvent

app = FastAPI(title="Codexity", version="0.1.0")

async def search_pipeline(query: str):
    """
    Main pipeline generator. Yields SSE events as each phase completes.
    """
    # Phase 1: Rewrite query
    yield SearchEvent(event="status", data={"step": "rewriting_query"})
    # ... (implemented in Part 2)

    # Phase 2: Search
    yield SearchEvent(event="status", data={"step": "searching"})
    # ... (implemented in Part 3)

    # Phase 3: Scrape
    yield SearchEvent(event="status", data={"step": "scraping"})
    # ... (implemented in Part 4)

    # Phase 4: Process content
    yield SearchEvent(event="status", data={"step": "processing"})
    # ... (implemented in Part 5)

    # Phase 5: Generate answer
    yield SearchEvent(event="status", data={"step": "generating"})
    # ... (implemented in Part 6-7)

    yield SearchEvent(event="done", data={})

@app.get("/search")
async def search(q: str = Query(..., min_length=1)):
    async def event_generator():
        async for event in search_pipeline(q):
            yield {"event": event.event, "data": event.data}

    return EventSourceResponse(event_generator())

@app.get("/health")
async def health():
    return {"status": "ok"}

The /search endpoint is an SSE stream. The client opens a persistent connection, and the server pushes events as each pipeline phase completes. Status updates first, source URLs second, answer tokens last.

This is the skeleton. Run it:

uvicorn main:app --reload --host 0.0.0.0 --port 8000

Test with curl:

curl -N "http://localhost:8000/search?q=what+is+python"

You will see the status events fire, but nothing meaningful yet. The pipeline is hollow. Each subsequent chapter fills in one stage.

Why Async Everywhere

The entire pipeline is async. Search calls happen in parallel. Scraping runs concurrently with a semaphore. LLM tokens stream as they generate.

A synchronous implementation would work, but the latency would be brutal. Web searches take 500ms-2s. Scraping 10 pages sequentially at 3s each means 30 seconds of dead time. With asyncio.gather, those 10 pages fetch in parallel and complete in the time of the slowest one.

The async model also enables SSE naturally. While the LLM generates tokens, each one yields back to the event loop, which pushes it to the client immediately. No buffering, no polling.

What Comes Next

Part 2 covers query rewriting. A user types “what database for my startup”. The rewriter turns that into two or three search-engine queries that will actually return useful results. This is where we first touch the LLM, and where the quality of the entire system gets decided.

The series progresses like this:

  • Part 2: Query rewriting and decomposition with LLMs
  • Part 3: Async web search with DuckDuckGo
  • Part 4: Web scraping, proxies, and anti-bot measures
  • Part 5: Content processing and relevance ranking
  • Part 6: Small model inference with llama-cpp-python
  • Part 7: Server-Sent Events and streaming
  • Part 8: Full integration, testing, and deployment

Each part builds on the previous one. By Part 8, every stub in search_pipeline will be replaced with real code.

Continue reading

Next article

Continuous Audio Playback on a Static Astro Site

Related Content