RAG Pipelines and Vector Databases
SummaryThe modern NLP practitioner's primary job is orchestration,...
The modern NLP practitioner's primary job is orchestration,...
The modern NLP practitioner's primary job is orchestration, not training. This section covers the paradigm shift from train-your-own to prompt-and-orchestrate, the conditions under which prompting alone is sufficient, and the failure modes that demand retrieval-augmented generation. A complete RAG pipeline is built from scratch: document ingestion with multiple chunking strategies, embedding with sentence-transformers, vector search with Qdrant, context injection with prompt engineering, and generation with the OpenAI API. Evaluation methodology covers retrieval precision and recall, answer faithfulness, and the common failures — wrong chunks retrieved, hallucination despite context, and the lost-in-the-middle problem — that kill RAG systems in production.
RAG Pipelines and Vector Databases
7.1 — The Shift from Training to Prompting
Until 2022, building an NLP system meant building a model. You collected labeled data, chose an architecture, trained it, evaluated it, and deployed it. The model was the system. If you needed to extract entities from legal contracts, you trained an NER model on annotated legal text. If you needed to classify support tickets, you fine-tuned BERT on your ticket corpus. Every new task required a new training pipeline.
Foundation models inverted this workflow. A single model — GPT-4, Claude, Llama 3 — handles classification, extraction, summarization, translation, and generation without task-specific training. The interface is a text prompt. The “model selection” step collapses to choosing a provider and writing instructions.
This is not a theoretical observation. It has concrete implications for how you spend your time. A data scientist in 2020 spent 70% of NLP project time on data labeling, feature engineering, and model training. A data scientist in 2025 spends 70% on prompt design, context engineering, and evaluation. The skills transferred — rigor, systematic evaluation, understanding failure modes — but the activities changed entirely.
When Prompting Is Enough
Prompting a foundation model is sufficient — and often optimal — when these conditions hold:
- The task has a clear specification. Classification labels are well-defined. Extraction fields have unambiguous boundaries. Summarization has explicit length and focus constraints.
- The required knowledge is general. The model’s training data covers the domain. You are not asking about proprietary data, internal processes, or real-time events.
- Consistency requirements are moderate. You can tolerate minor variation in formatting or phrasing between runs. Exact reproducibility is not required.
- Volume is manageable. At $0.50–15 per million input tokens, processing 10,000 documents costs $5–150. Processing 10 million costs $5,000–150,000. The economics flip fast.
Here is a concrete example: structured extraction from unstructured text. Given a job posting, extract the title, company, location, salary range, and required skills into a typed Python object.
from openai import OpenAI
from pydantic import BaseModel, Field
class JobPosting(BaseModel):
"""Structured representation of a job posting."""
title: str = Field(description="Job title")
company: str = Field(description="Company name")
location: str = Field(description="Job location or 'Remote'")
salary_min: int | None = Field(description="Minimum salary in USD, null if not stated")
salary_max: int | None = Field(description="Maximum salary in USD, null if not stated")
skills: list[str] = Field(description="Required technical skills")
def extract_job_posting(text: str) -> JobPosting:
"""Extract structured job data from free-text posting."""
client = OpenAI()
completion = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"Extract structured job posting information. "
"Return only the fields specified. "
"If a field is not mentioned, use null."
),
},
{"role": "user", "content": text},
],
response_format=JobPosting,
)
return completion.choices[0].message.parsed
# Usage
raw_posting = """
Senior ML Engineer at Acme Corp - San Francisco or Remote
$180,000 - $240,000 base + equity
Requirements: Python, PyTorch, MLOps, Kubernetes, 5+ years experience.
"""
result: JobPosting = extract_job_posting(raw_posting)
print(result.model_dump_json(indent=2))
# {"title": "Senior ML Engineer", "company": "Acme Corp",
# "location": "San Francisco or Remote", "salary_min": 180000,
# "salary_max": 240000, "skills": ["Python", "PyTorch", "MLOps", "Kubernetes"]}
This took 15 lines of business logic, no training data, and costs ~$0.001 per extraction. A custom NER model for the same task would require 1,000+ annotated examples, a training pipeline, and weeks of development. The prompting approach wins on every dimension except per-query cost at very high volume.
When Prompting Fails
Prompting hits a wall in three situations:
Domain-specific knowledge. A foundation model does not know your company’s internal product taxonomy, your proprietary research data, or any information created after its training cutoff. Asking GPT-4 to answer questions about your internal documentation produces confident, wrong answers — hallucinations dressed up as expertise.
Factual precision. Foundation models are probabilistic text generators. They approximate truth, they do not guarantee it. For tasks where a single wrong fact has consequences — medical diagnosis, legal analysis, financial reporting — prompting alone is insufficient. You need the model’s output grounded in verified source material.
Behavioral consistency. Prompts are fragile. A model that correctly classifies 95% of your test cases today may drop to 88% after a provider update. You have no control over model versioning, weight changes, or system prompt modifications on the provider side. If you need guaranteed behavior, you need a model you control.
When any of these conditions apply, you need retrieval-augmented generation: give the model access to your data at query time so its output is grounded in facts you control.
7.2 — RAG: Retrieval-Augmented Generation
RAG is not a model — it is an architecture. The idea is straightforward: instead of asking the model to answer from memory, you retrieve relevant documents from your own data store and include them in the prompt. The model generates an answer grounded in the retrieved context rather than its parametric knowledge.
The pipeline has five stages:
- Ingest: Load documents and split them into chunks
- Embed: Convert chunks to dense vectors using an embedding model
- Index: Store vectors in a vector database for efficient similarity search
- Retrieve: Given a query, embed it and find the most similar chunks
- Generate: Construct a prompt with the retrieved chunks and send it to an LLM
Each stage has decisions that affect the entire system’s performance. The most impactful — and most underestimated — is chunking.
Chunking Strategies
A document is too long to fit into a prompt. You must break it into chunks that are small enough to embed meaningfully and large enough to preserve context. Three strategies, ordered by complexity:
Fixed-size chunking. Split every N characters or tokens with an overlap window. Fast, deterministic, and often good enough.
def fixed_size_chunks(
text: str, chunk_size: int = 512, overlap: int = 64
) -> list[str]:
"""Split text into fixed-size chunks with overlap."""
chunks: list[str] = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunks
Sentence-based chunking. Accumulate sentences until you hit a token budget. Respects natural boundaries but can produce wildly varying chunk sizes.
import re
def sentence_chunks(
text: str, max_tokens: int = 256
) -> list[str]:
"""Split text into chunks that respect sentence boundaries."""
sentences: list[str] = re.split(r'(?<=[.!?])\s+', text)
chunks: list[str] = []
current: list[str] = []
current_len = 0
for sentence in sentences:
token_estimate = len(sentence.split())
if current_len + token_estimate > max_tokens and current:
chunks.append(" ".join(current))
current = []
current_len = 0
current.append(sentence)
current_len += token_estimate
if current:
chunks.append(" ".join(current))
return chunks
Semantic chunking. Embed each sentence, then split where the embedding similarity between adjacent sentences drops below a threshold. This produces chunks that are semantically coherent but requires an embedding model at ingestion time.
The choice matters more than most practitioners realize. Chunks that are too small (under 100 tokens) lose context — a sentence fragment about “the treatment protocol” means nothing without knowing which treatment. Chunks that are too large (over 1,000 tokens) dilute relevance — a 2,000-token chunk that contains one relevant paragraph and nine irrelevant ones will score as partially relevant, pushing truly relevant chunks down the ranking.
For most production systems, start with sentence-based chunking at 200–400 tokens with 2–3 sentence overlap. Tune from there based on retrieval quality metrics.
The Complete Pipeline
Here is a complete RAG pipeline using Qdrant as the vector store and sentence-transformers for embeddings. This is production-grade architecture, not a toy example.
from dataclasses import dataclass, field
from qdrant_client import QdrantClient
from qdrant_client.models import (
Distance,
PointStruct,
VectorParams,
Filter,
FieldCondition,
MatchValue,
)
from sentence_transformers import SentenceTransformer
from openai import OpenAI
import hashlib
import uuid
@dataclass
class Document:
"""A document with metadata for the RAG pipeline."""
content: str
source: str
category: str
chunk_id: str = field(default_factory=lambda: str(uuid.uuid4()))
class RAGPipeline:
"""End-to-end RAG pipeline: ingest, embed, retrieve, generate."""
def __init__(
self,
collection_name: str = "documents",
embedding_model: str = "all-MiniLM-L6-v2",
qdrant_url: str = "http://localhost:6333",
) -> None:
self.encoder = SentenceTransformer(embedding_model)
self.embedding_dim: int = self.encoder.get_sentence_embedding_dimension()
self.qdrant = QdrantClient(url=qdrant_url)
self.collection_name = collection_name
self.openai = OpenAI()
self._ensure_collection()
def _ensure_collection(self) -> None:
"""Create the vector collection if it does not exist."""
collections = [c.name for c in self.qdrant.get_collections().collections]
if self.collection_name not in collections:
self.qdrant.create_collection(
collection_name=self.collection_name,
vectors_config=VectorParams(
size=self.embedding_dim,
distance=Distance.COSINE,
),
)
def ingest(self, documents: list[Document]) -> int:
"""Embed and store documents in the vector database."""
texts = [doc.content for doc in documents]
embeddings = self.encoder.encode(texts, show_progress_bar=True)
points = [
PointStruct(
id=hashlib.md5(doc.chunk_id.encode()).hexdigest()[:16],
vector=embedding.tolist(),
payload={
"content": doc.content,
"source": doc.source,
"category": doc.category,
"chunk_id": doc.chunk_id,
},
)
for doc, embedding in zip(documents, embeddings, strict=True)
]
self.qdrant.upsert(
collection_name=self.collection_name,
points=points,
)
return len(points)
def retrieve(
self,
query: str,
top_k: int = 5,
category_filter: str | None = None,
) -> list[dict]:
"""Retrieve the most relevant chunks for a query."""
query_vector = self.encoder.encode(query).tolist()
search_filter = None
if category_filter:
search_filter = Filter(
must=[
FieldCondition(
key="category",
match=MatchValue(value=category_filter),
)
]
)
results = self.qdrant.query_points(
collection_name=self.collection_name,
query=query_vector,
limit=top_k,
query_filter=search_filter,
).points
return [
{
"content": hit.payload["content"],
"source": hit.payload["source"],
"score": hit.score,
}
for hit in results
]
def generate(
self,
query: str,
top_k: int = 5,
category_filter: str | None = None,
model: str = "gpt-4o-mini",
) -> dict:
"""Retrieve context and generate an answer."""
retrieved = self.retrieve(query, top_k, category_filter)
# Build context block from retrieved chunks
context_parts: list[str] = []
for i, chunk in enumerate(retrieved, 1):
context_parts.append(
f"[Source {i}: {chunk['source']} | Relevance: {chunk['score']:.3f}]\n"
f"{chunk['content']}"
)
context_block = "\n\n---\n\n".join(context_parts)
# Generate with grounding instructions
completion = self.openai.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": (
"Answer the user's question using ONLY the provided context. "
"If the context does not contain enough information, say so explicitly. "
"Cite sources by number [Source N] when making claims. "
"Do not use prior knowledge."
),
},
{
"role": "user",
"content": f"Context:\n{context_block}\n\nQuestion: {query}",
},
],
temperature=0.1,
)
return {
"answer": completion.choices[0].message.content,
"sources": retrieved,
"model": model,
}
Using the Pipeline
# Ingest documents
pipeline = RAGPipeline(collection_name="tech_docs")
raw_texts: list[str] = load_your_documents() # Your document loading logic
chunks: list[str] = []
for text in raw_texts:
chunks.extend(sentence_chunks(text, max_tokens=300))
documents = [
Document(content=chunk, source=f"doc_{i}", category="engineering")
for i, chunk in enumerate(chunks)
]
n_indexed = pipeline.ingest(documents)
print(f"Indexed {n_indexed} chunks")
# Query
result = pipeline.generate(
query="How does the retry mechanism handle timeout errors?",
category_filter="engineering",
)
print(result["answer"])
for src in result["sources"]:
print(f" - {src['source']} (score: {src['score']:.3f})")
Evaluating RAG Systems
A RAG pipeline has two components to evaluate independently: retrieval quality and generation quality. Conflating them is the most common evaluation mistake.
Retrieval evaluation. Given a query and a set of known-relevant chunks, measure:
- Precision@k: What fraction of the top-k retrieved chunks are relevant? Low precision means the model is reading irrelevant context.
- Recall@k: What fraction of all relevant chunks appear in the top-k? Low recall means the model is missing critical information.
Generation evaluation. Given retrieved context and a generated answer, measure:
- Faithfulness: Does the answer only make claims supported by the retrieved context? Unfaithful answers indicate hallucination.
- Relevance: Does the answer address the question? High faithfulness but low relevance means the model is summarizing context instead of answering.
Build a small evaluation set — 50–100 query/answer pairs with annotated relevant chunks — before tuning any parameter. Without this, you are navigating blind.
Common RAG Failures
Three failure modes kill RAG pipelines in production. Recognizing them is half the fix.
Wrong chunks retrieved. The embedding model’s notion of “similarity” does not match your task’s notion of “relevance.” A query about “Python memory management” retrieves chunks about “Python snake habitat management” because the word “Python” dominates the embedding. Fix: use domain-specific embedding models, add metadata filtering, or prepend category tags to chunks before embedding.
Hallucination despite context. The model has the right context in the prompt but ignores it in favor of its parametric knowledge. This happens most often when the context contradicts the model’s training data. Fix: use stronger grounding instructions, lower the temperature to 0.0–0.1, and explicitly instruct the model to quote from the context rather than paraphrase.
Lost-in-the-middle. Research shows that LLMs attend most strongly to content at the beginning and end of the context window. Relevant information buried in the middle of a long context block is effectively invisible. Fix: limit context to 3–5 high-relevance chunks rather than dumping 20 chunks into the prompt. Order chunks by relevance with the most relevant first. More context is not always better context.