Codexity Part 5: Content Processing and Relevance Ranking

Twelve pages scraped. Around 60,000 words of raw text. A 7B model can process maybe 6,000 tokens of context. That means 90% of the scraped content has to go. The question is which 10% to keep.

Dumping entire pages into the prompt produces bad answers. Irrelevant paragraphs dilute useful information, confuse the model, and waste context window space. The content processor solves this by chunking the text, scoring each chunk’s relevance to the original question, and selecting only the top-scoring fragments.

Step 1: Text Chunking

The raw text from each page needs to be split into chunks small enough to score individually. Too large and irrelevant paragraphs hide inside a chunk that scores well because of one good sentence. Too small and you lose context between sentences.

# content_processor.py
from models import ScrapedPage, TextChunk
from config import settings

def chunk_text(text: str, chunk_size: int = None, overlap: int = None) -> list[str]:
    """Split text into overlapping chunks by token count (approximated by words)."""
    chunk_size = chunk_size or settings.chunk_size
    overlap = overlap or settings.chunk_overlap

    words = text.split()
    if len(words) <= chunk_size:
        return [text]

    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap

    return chunks

def pages_to_chunks(pages: list[ScrapedPage]) -> list[TextChunk]:
    """Convert scraped pages into scored text chunks."""
    all_chunks = []
    for page in pages:
        text_chunks = chunk_text(page.content)
        for chunk_text_str in text_chunks:
            all_chunks.append(
                TextChunk(
                    text=chunk_text_str,
                    source_url=page.url,
                    source_title=page.title,
                )
            )
    return all_chunks

512 words per chunk with 50 words of overlap. The overlap ensures that a relevant sentence split across two chunks still appears in at least one of them. Without overlap, you lose information at boundaries.

Word-based splitting is a rough approximation of token count. For English text, the ratio is approximately 1 word = 1.3 tokens. A 512-word chunk is around 660 tokens. Four such chunks fit comfortably in a 4000-token context budget, leaving room for the system prompt and the model’s response.

Step 2: BM25 Scoring

BM25 is a ranking function used by search engines since the 1990s. It scores documents against a query based on term frequency, inverse document frequency, and document length normalization. No neural network, no embeddings, no GPU. Just math that works.

from rank_bm25 import BM25Okapi

def score_chunks(chunks: list[TextChunk], query: str) -> list[TextChunk]:
    """Score each chunk's relevance to the query using BM25."""
    if not chunks:
        return []

    corpus = [chunk.text.lower().split() for chunk in chunks]
    bm25 = BM25Okapi(corpus)
    query_tokens = query.lower().split()
    scores = bm25.get_scores(query_tokens)

    for chunk, score in zip(chunks, scores):
        chunk.relevance_score = float(score)

    return sorted(chunks, key=lambda c: c.relevance_score, reverse=True)

BM25 over embeddings? For this use case, yes. Embedding-based similarity requires a model (even a small one like all-MiniLM-L6-v2 adds 200ms and a dependency). BM25 runs in microseconds. It handles keyword matching well, which is what we need: does this chunk mention the terms from the user’s question?

BM25 has a known weakness with synonyms. A chunk about “databases” might not score well against the query “data stores.” But the query rewriter already handles this by generating multiple queries with varied vocabulary. By the time we reach BM25, the terminology overlap is high.

Step 3: Selecting Top-K Chunks

Take the top-scoring chunks, but with a constraint: do not let one source dominate.

def select_top_chunks(
    chunks: list[TextChunk],
    top_k: int = None,
    max_per_source: int = 3,
) -> list[TextChunk]:
    """Select top-k chunks with source diversity."""
    top_k = top_k or settings.top_k_chunks
    selected = []
    source_counts: dict[str, int] = {}

    for chunk in chunks:
        url = chunk.source_url
        count = source_counts.get(url, 0)
        if count >= max_per_source:
            continue

        selected.append(chunk)
        source_counts[url] = count + 1

        if len(selected) >= top_k:
            break

    return selected

max_per_source=3 means at most 3 chunks from any single page. If a blog post wrote a 5000-word article that matches the query perfectly, BM25 would rank all its chunks at the top. Without the per-source cap, the answer would cite only one source and miss perspectives from other pages.

Step 4: Building the Context

The selected chunks need to be formatted into a structured context string that the LLM can reference by source number.

from models import SourceReference

def build_context(chunks: list[TextChunk]) -> tuple[str, list[SourceReference]]:
    """Build LLM context string with numbered source references."""
    # Assign unique source numbers
    source_map: dict[str, int] = {}
    sources: list[SourceReference] = []
    counter = 1

    for chunk in chunks:
        if chunk.source_url not in source_map:
            source_map[chunk.source_url] = counter
            sources.append(SourceReference(
                index=counter,
                title=chunk.source_title,
                url=chunk.source_url,
            ))
            counter += 1

    # Build context string
    context_parts = []
    for chunk in chunks:
        source_num = source_map[chunk.source_url]
        context_parts.append(f"[Source {source_num}]\n{chunk.text}")

    context = "\n\n".join(context_parts)
    return context, sources

The output looks like:

[Source 1]
PostgreSQL offers JSONB columns that store semi-structured data with indexing
support. Unlike MongoDB's BSON, PostgreSQL's JSONB is stored alongside
relational data in the same transaction...

[Source 2]
MongoDB's flexible schema makes it popular for startups that need to iterate
quickly. Schema changes require no migrations...

[Source 3]
In benchmarks comparing PostgreSQL 16 and MongoDB 7, PostgreSQL showed 40%
higher throughput for read-heavy workloads when data fits relational patterns...

Each chunk is tagged with its source number. The synthesizer (Part 6) will generate an answer citing these numbers as [1], [2], etc. The source list maps numbers to URLs, allowing the client to display clickable citations.

The Full Processing Pipeline

async def process_content(
    pages: list[ScrapedPage],
    query: str,
) -> tuple[str, list[SourceReference]]:
    """Full content processing pipeline."""
    # Step 1: Chunk
    chunks = pages_to_chunks(pages)

    # Step 2: Score
    scored = score_chunks(chunks, query)

    # Step 3: Select
    selected = select_top_chunks(scored)

    # Step 4: Build context
    context, sources = build_context(selected)

    return context, sources

Four function calls. Input: 12 pages of raw text. Output: a 4000-token context string with numbered sources.

Plugging Into the Pipeline

from content_processor import process_content

async def search_pipeline(query: str):
    # ... Phase 1, 2, 3 ...

    # Phase 4: Process
    yield SearchEvent(event="status", data={"step": "processing"})
    context, sources = await process_content(pages, query)
    yield SearchEvent(
        event="sources",
        data={
            "sources": [
                {"index": s.index, "title": s.title, "url": s.url}
                for s in sources
            ]
        },
    )

    # Phase 5: Synthesize (next chapter)
    # ...

The sources event now carries structured source metadata instead of just URLs. Each source has an index, a title, and a URL.

Tuning Parameters

The defaults work for general-purpose questions. Different query types benefit from different settings:

Technical comparisons (PostgreSQL vs MongoDB): Increase top_k_chunks to 12 and lower max_per_source to 2. You want breadth across many sources.

How-to guides (how to deploy FastAPI): Increase max_per_source to 5 and lower top_k_chunks to 8. One comprehensive tutorial is more useful than fragments from five different guides.

Factual queries (what year did Python 3 release): Lower everything. top_k_chunks=4 is enough. The answer is short and factual.

These adjustments can be automated by having the query rewriter classify the query type and pass it to the content processor.

Performance

Processing 100+ chunks through BM25 takes 5-20ms. The bottleneck is text splitting, which involves string operations on 60,000 words. Total processing time: 50-200ms. Negligible compared to the scraping phase.

What Comes Next

Part 6 is where the LLM generates the answer. We load a quantized small model, construct the prompt from the context we just built, and stream tokens back. The challenge: making a 7B model produce well-cited, accurate answers from noisy web content.

On This Page

Draft / Scheduled Content

Codexity Part 5: Content Processing and Relevance Ranking

Codexity Part 5: Content Processing and Relevance Ranking

Step 1: Text Chunking

Step 2: BM25 Scoring

Step 3: Selecting Top-K Chunks

Step 4: Building the Context

The Full Processing Pipeline

Plugging Into the Pipeline

Tuning Parameters

Performance

What Comes Next

Related Content

Codexity Part 4: Web Scraping, Proxies, and Anti-Bot Warfare

Codexity Part 3: Async Web Search with DuckDuckGo

Codexity Part 6: Small Model Inference with llama-cpp-python