RAG App Fails Two Basic Questions: Chunking Bug vs Model Capacity Limits
These articles are AI-generated summaries. Please check the original sources for full details.
I Built a RAG App, Then Asked It What Car I Like. It Didn’t Know.
Developer Dogukan Karademir built Kenning, a from-scratch RAG app using Spring AI and local Ollama models. The Phase 1 pipeline failed two basic questions about its own project—one due to chunking dilution, the other due to a 3B model’s capacity limits.
Why This Matters
This debugging story exposes the gap between idealized RAG architecture and real-world behavior: two identical-looking failures had completely different root causes—one fixable by better chunking strategies, the other requiring larger models. Without systematic diagnosis, developers risk treating symptoms instead of causes in production RAG systems.
Key Insights
- Chunk dilution degrades retrieval accuracy: A single chunk covering five unrelated topics (Spring AI, Tika, async processing, OAuth2 plans, BMW) produced a similarity score of only 0.46 for a focused query (‘what car brand’), missing the default threshold of 0.5 (Kenning project, June 2026).
- Smaller models exhibit answer uncertainty: The llama3.2:3b model retrieved the correct embedding model name but refused to commit to it as an answer—demonstrating that sub-8B parameters can cause ‘retrieval without commitment’ even with full context (Kenning project, June 2026).
- GPU acceleration is blocked by WSL2 limitations: ROCm requires /dev/kfd exposed to Docker containers; WSL2 does not expose this device path, resulting in Ollama running at ‘100% CPU’ confirmed via
ollama pson an AMD RX 6700 XT with 12 GB VRAM (Kenning infrastructure debug, June 2026). - “nomic-embed-text” for embeddings and “llama3.2:3b” for chat were used as the local inference stack via Ollama on Docker Compose (pgvector/pgvector:pg16), all running CPU-only under WSL2.
Practical Applications
- [Use case] Kenning document-chat tool — uploading one document and asking one question returns answers with source chunks attached using fully local stack [Pitfall] Using same entity name
Documentfor both Spring AI’s chunk class and custom file entity causes import ambiguity; renamed toSourceDocumentto avoid confusion. - [Use case] Debugging failed queries in RAG — treat identical-looking failures as potentially different root causes [Pitfall] Assuming all ‘doesn’t know’ outputs are model quality issues misses chunk-dilution problems solvable by smaller topic-coherent chunks instead of larger models.
References:
Continue reading
Next article
Idea-First Collaboration Platform Challenges Repository-Centric Development
Related Content
GitHub Copilot vs. React Native: Lessons from a Vibe-Coded Login App
Engineer T J Maher attempts to build the DetoxDemo React Native app using GitHub Copilot, revealing 14 distinct failure modes including directory path errors and dependency loops.
LLM Solves Novel Dot Puzzle: What Next-Token Prediction Gets Wrong
Engineer reveals how an LLM solved a novel dot puzzle, challenging the 'next-token prediction' folk model and exposing emergent reasoning via attention mechanisms.
Beyond the Tutorial: Building an AI Portfolio Based on Real Company Briefs
Move beyond RAG clones with 5 real-world company briefs designed to demonstrate engineering judgment and architectural decision-making.