Benchmarking Local LLMs: Qwen3 vs Qwen3.5 in Agentic Coding Workflows
These articles are AI-generated summaries. Please check the original sources for full details.
I Pushed Local LLMs Harder. Here’s What Two Models Actually Did.
Donald Cruver deployed Qwen3-Coder-Next and Qwen3.5-35B-A3B on dual AMD MI60 GPUs to build a complex health-correlation CLI tool. The experiment demonstrated that while 32k context windows fail, 131k windows allow local agents to autonomously fix bugs via mandatory test-fix loops.
Why This Matters
In theory, agentic coding models can scaffold entire applications, but local hardware faces the “context snowball” problem where reprocessing history consumes massive compute without generating tokens. This gap between cloud-scale attention caching and local inference means architectural choices like subagent delegation and specific context ceilings of 128k+ are mandatory for production-grade local development. Transitioning from tensor-parallel setups to independent GPU instances can mitigate communication overhead, though the structural lack of attention caching in local inference remains a performance bottleneck.
Key Insights
- Context overflow at 32,768 tokens: Qwen3-Coder-Next hit a hard wall during CLI module generation, requiring a bump to 65,536 tokens.
- Mandatory Test-Fix Loop: Subagents successfully diagnosed and fixed five bugs in the data layer by iteratively running pytest until all tests passed.
- Tool-Calling Parser Requirements: Qwen3-Coder-Next requires the —tool-call-parser qwen3_coder flag in vLLM to prevent silent tool-call failures.
- Independent GPU Instances: Running two llama-server instances on separate MI60 GPUs outperformed tensor-parallel mode for sequential agentic tasks by reducing inter-GPU overhead.
- Context Snowballing: Local inference lacks the attention caching of cloud providers, causing GPUs to peg at 100% while reprocessing entire conversation histories.
Working Examples
vLLM configuration for Qwen3-Coder-Next with mandatory tool-call parser flag.
vllm serve --model cyankiwi/Qwen3-Coder-Next-AWQ-4bit --tensor-parallel-size 2 --max-model-len 65536 --gpu-memory-utilization 0.95 --enable-auto-tool-choice --tool-call-parser qwen3_coder
llama.cpp server configuration for Qwen3.5-35B-A3B with 131k context and Jinja template support.
llama-server --ctx-size 131072 --flash-attn on --jinja --reasoning-budget -1 --dangerously-skip-permissions
Practical Applications
- Use case: Multi-module Python CLI development using the subagent pattern to isolate context windows and maintain architectural coherence.
- Pitfall: Neglecting the —jinja or —tool-call-parser flags, which causes LLM tool calls to fail silently without error messages.
- Use case: Automated bug fixing in local environments by implementing a mandatory ‘run-fail-fix’ loop in the agent’s markdown prompt files.
- Pitfall: Running unattended agentic sessions without —dangerously-skip-permissions, causing the process to stall indefinitely on hidden TUI prompts over SSH.
References:
Continue reading
Next article
Analyzing 600 Daily Automated Attack Requests on Public Servers
Related Content
Advanced Git Commands for AI-Driven Engineering Workflows
Leverage underused Git commands like worktree and bisect to optimize context windows and debugging for AI coding agents.
Nomira: Implementing Professional Naming Studio Workflows via Claude Code
Sardhak Addepalli releases Nomira, an open-source Claude Code skill that automates professional naming agency workflows for software projects.
Solving the Data Layer Problem in Agentic AI Systems
Production AI agents fail without a structured data layer; the Model Context Protocol (MCP) provides essential real-time ground truth for factual accuracy.