Skip to main content

On This Page

RAG App Fails Two Basic Questions: Chunking Bug vs Model Capacity Limits

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

I Built a RAG App, Then Asked It What Car I Like. It Didn’t Know.

Developer Dogukan Karademir built Kenning, a from-scratch RAG app using Spring AI and local Ollama models. The Phase 1 pipeline failed two basic questions about its own project—one due to chunking dilution, the other due to a 3B model’s capacity limits.

Why This Matters

This debugging story exposes the gap between idealized RAG architecture and real-world behavior: two identical-looking failures had completely different root causes—one fixable by better chunking strategies, the other requiring larger models. Without systematic diagnosis, developers risk treating symptoms instead of causes in production RAG systems.

Key Insights

  • Chunk dilution degrades retrieval accuracy: A single chunk covering five unrelated topics (Spring AI, Tika, async processing, OAuth2 plans, BMW) produced a similarity score of only 0.46 for a focused query (‘what car brand’), missing the default threshold of 0.5 (Kenning project, June 2026).
  • Smaller models exhibit answer uncertainty: The llama3.2:3b model retrieved the correct embedding model name but refused to commit to it as an answer—demonstrating that sub-8B parameters can cause ‘retrieval without commitment’ even with full context (Kenning project, June 2026).
  • GPU acceleration is blocked by WSL2 limitations: ROCm requires /dev/kfd exposed to Docker containers; WSL2 does not expose this device path, resulting in Ollama running at ‘100% CPU’ confirmed via ollama ps on an AMD RX 6700 XT with 12 GB VRAM (Kenning infrastructure debug, June 2026).
  • “nomic-embed-text” for embeddings and “llama3.2:3b” for chat were used as the local inference stack via Ollama on Docker Compose (pgvector/pgvector:pg16), all running CPU-only under WSL2.

Practical Applications

  • [Use case] Kenning document-chat tool — uploading one document and asking one question returns answers with source chunks attached using fully local stack [Pitfall] Using same entity name Document for both Spring AI’s chunk class and custom file entity causes import ambiguity; renamed to SourceDocument to avoid confusion.
  • [Use case] Debugging failed queries in RAG — treat identical-looking failures as potentially different root causes [Pitfall] Assuming all ‘doesn’t know’ outputs are model quality issues misses chunk-dilution problems solvable by smaller topic-coherent chunks instead of larger models.

References:

Continue reading

Next article

Idea-First Collaboration Platform Challenges Repository-Centric Development

Related Content