NadirClaw: Building Cost-Aware LLM Routing with Local Prompt Classification
These articles are AI-generated summaries. Please check the original sources for full details.
How to Build a Cost-Aware LLM Routing System with NadirClaw Using Local Prompt Classification and Gemini Model Switching
NadirClaw implements an intelligent routing layer that classifies prompts locally before sending them to the most suitable model tier. By utilizing centroid vectors and a local encoder, the system avoids unnecessary high-cost model calls for simple tasks. In live tests, this configuration demonstrated significant cost savings compared to an always-Pro model baseline.
Why This Matters
The technical reality of deploying LLMs involves a constant trade-off between reasoning capability and operational cost. Many production systems default to high-parameter models for every request, which leads to significant financial waste on low-complexity tasks like basic formatting or simple arithmetic. NadirClaw addresses this by introducing a local classification step that ensures only high-complexity reasoning tasks consume expensive ‘Pro’ tier tokens.
By moving the routing decision to a local proxy, developers can maintain a single ‘auto’ model endpoint while benefiting from the speed of lightweight models and the depth of larger ones. This architectural pattern is essential for scaling agentic systems where thousands of intermediate steps may not require full reasoning capabilities, thereby optimizing both latency and budget.
Key Insights
- Local classification via NadirClaw CLI uses JSON output to return routing tier, score, and confidence without making live LLM calls.
- The system utilizes the all-MiniLM-L6-v2 encoder from Sentence-Transformers to generate embeddings for local similarity checks.
- Routing decisions are determined by comparing prompt embeddings against simple_centroid.npy and complex_centroid.npy vectors.
- A default confidence threshold of 0.06 is applied; prompts falling below this threshold are automatically escalated to the complex tier.
- NadirClaw supports modifier-marker scans, identifying ‘agentic’, ‘reasoning’, or ‘vision’ requests based on text markers or request shape.
- Live routing through a local proxy server allows for OpenAI-compatible requests to be dynamically mapped to models like gemini-2.5-flash and gemini-2.5-pro.
Working Examples
Function to locally classify prompts into tiers using the NadirClaw CLI.
import subprocess, json
def classify(prompt: str) -> dict:
r = subprocess.run(
["nadirclaw", "classify", "--format", "json", prompt],
capture_output=True, text=True, timeout=180,
)
if r.returncode != 0:
return {"prompt": prompt, "error": (r.stderr or r.stdout).strip()}
return json.loads(r.stdout.strip())
prompts = ["What is 2+2?", "Refactor the auth module to use dependency injection"]
results = [classify(p) for p in prompts]
Starting the NadirClaw proxy server to handle live model routing.
import os, subprocess
PORT = 8856
env = os.environ.copy()
env.update({
"GEMINI_API_KEY": "YOUR_KEY_HERE",
"NADIRCLAW_SIMPLE_MODEL": "gemini-2.5-flash",
"NADIRCLAW_COMPLEX_MODEL": "gemini-2.5-pro",
"NADIRCLAW_PORT": str(PORT),
})
server_proc = subprocess.Popen(
["nadirclaw", "serve", "--verbose"],
env=env
)
Practical Applications
- Enterprise Chatbots: Routing basic FAQs to Gemini Flash while reserving Gemini Pro for complex architectural or legal inquiries. Pitfall: Using an overly high confidence threshold may cause complex edge cases to fail on simple models.
- Coding Assistants: Detecting ‘agentic’ markers in prompts to ensure code execution tasks are always routed to high-reasoning models. Pitfall: Incorrectly configured environment variables can lead to proxy startup failures, defaulting to a single model.
- Cost Monitoring: Utilizing ‘nadirclaw report’ to analyze JSONL request logs and estimate savings against a fixed-model baseline. Pitfall: Ignoring the request shape (e.g., vision/tools) might bypass the intended routing logic.
References:
Continue reading
Next article
Lindy: A Rust-Powered Tool for One-Click Linux Dual-Boot Folder Access
Related Content
Google DeepMind Unveils Gemini-Powered AI Mouse Pointer for Context-Aware Computing
Google DeepMind introduces an AI-enabled mouse pointer powered by Gemini that captures visual and semantic context directly at the cursor for streamlined workflows.
Why Your AGENTS.md Files are Sabotaging AI Coding Performance
ETH Zurich study reveals that auto-generated AGENTS.md files can decrease AI agent success rates by 3% while increasing inference costs by 20%.
Google AI Releases Android Bench: Specialized Evaluation for Mobile LLMs
Google AI releases Android Bench, an open-source framework where Gemini 3.1 Pro Preview achieved a top 72.4% success rate on real-world Android tasks.