Benchmarking Local LLMs: Qwen3 vs Qwen3.5 in Agentic Coding Workflows
These articles are AI-generated summaries. Please check the original sources for full details.
I Pushed Local LLMs Harder. Here’s What Two Models Actually Did.
Donald Cruver deployed Qwen3-Coder-Next and Qwen3.5-35B-A3B on dual AMD MI60 GPUs to build a complex health-correlation CLI tool. The experiment demonstrated that while 32k context windows fail, 131k windows allow local agents to autonomously fix bugs via mandatory test-fix loops.
Why This Matters
In theory, agentic coding models can scaffold entire applications, but local hardware faces the “context snowball” problem where reprocessing history consumes massive compute without generating tokens. This gap between cloud-scale attention caching and local inference means architectural choices like subagent delegation and specific context ceilings of 128k+ are mandatory for production-grade local development. Transitioning from tensor-parallel setups to independent GPU instances can mitigate communication overhead, though the structural lack of attention caching in local inference remains a performance bottleneck.
Key Insights
- Context overflow at 32,768 tokens: Qwen3-Coder-Next hit a hard wall during CLI module generation, requiring a bump to 65,536 tokens.
- Mandatory Test-Fix Loop: Subagents successfully diagnosed and fixed five bugs in the data layer by iteratively running pytest until all tests passed.
- Tool-Calling Parser Requirements: Qwen3-Coder-Next requires the —tool-call-parser qwen3_coder flag in vLLM to prevent silent tool-call failures.
- Independent GPU Instances: Running two llama-server instances on separate MI60 GPUs outperformed tensor-parallel mode for sequential agentic tasks by reducing inter-GPU overhead.
- Context Snowballing: Local inference lacks the attention caching of cloud providers, causing GPUs to peg at 100% while reprocessing entire conversation histories.
Working Examples
vLLM configuration for Qwen3-Coder-Next with mandatory tool-call parser flag.
vllm serve --model cyankiwi/Qwen3-Coder-Next-AWQ-4bit --tensor-parallel-size 2 --max-model-len 65536 --gpu-memory-utilization 0.95 --enable-auto-tool-choice --tool-call-parser qwen3_coder
llama.cpp server configuration for Qwen3.5-35B-A3B with 131k context and Jinja template support.
llama-server --ctx-size 131072 --flash-attn on --jinja --reasoning-budget -1 --dangerously-skip-permissions
Practical Applications
- Use case: Multi-module Python CLI development using the subagent pattern to isolate context windows and maintain architectural coherence.
- Pitfall: Neglecting the —jinja or —tool-call-parser flags, which causes LLM tool calls to fail silently without error messages.
- Use case: Automated bug fixing in local environments by implementing a mandatory ‘run-fail-fix’ loop in the agent’s markdown prompt files.
- Pitfall: Running unattended agentic sessions without —dangerously-skip-permissions, causing the process to stall indefinitely on hidden TUI prompts over SSH.
References:
Continue reading
Next article
Logtide 0.7.0: Completing the Observability Stack with OTLP Metrics and Service Maps
Related Content
Understanding Model Context Protocol (MCP): A Standardized Bridge for Agentic AI
Anthropic's Model Context Protocol (MCP) standardizes how LLMs securely connect to external data sources, enabling more efficient and scalable agentic workflows across fragmented enterprise APIs.
Advanced Git Commands for AI-Driven Engineering Workflows
Leverage underused Git commands like worktree and bisect to optimize context windows and debugging for AI coding agents.
Engineering Cross-Country Payroll APIs: Solving Semantic Salary Normalization
Dario at Obolus developed a unified payroll API covering 8+ countries, revealing that 'net salary' is a semantic challenge rather than a simple math problem.