Skip to main content

On This Page

Benchmarking Local LLMs: Qwen3 vs Qwen3.5 in Agentic Coding Workflows

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

I Pushed Local LLMs Harder. Here’s What Two Models Actually Did.

Donald Cruver deployed Qwen3-Coder-Next and Qwen3.5-35B-A3B on dual AMD MI60 GPUs to build a complex health-correlation CLI tool. The experiment demonstrated that while 32k context windows fail, 131k windows allow local agents to autonomously fix bugs via mandatory test-fix loops.

Why This Matters

In theory, agentic coding models can scaffold entire applications, but local hardware faces the “context snowball” problem where reprocessing history consumes massive compute without generating tokens. This gap between cloud-scale attention caching and local inference means architectural choices like subagent delegation and specific context ceilings of 128k+ are mandatory for production-grade local development. Transitioning from tensor-parallel setups to independent GPU instances can mitigate communication overhead, though the structural lack of attention caching in local inference remains a performance bottleneck.

Key Insights

  • Context overflow at 32,768 tokens: Qwen3-Coder-Next hit a hard wall during CLI module generation, requiring a bump to 65,536 tokens.
  • Mandatory Test-Fix Loop: Subagents successfully diagnosed and fixed five bugs in the data layer by iteratively running pytest until all tests passed.
  • Tool-Calling Parser Requirements: Qwen3-Coder-Next requires the —tool-call-parser qwen3_coder flag in vLLM to prevent silent tool-call failures.
  • Independent GPU Instances: Running two llama-server instances on separate MI60 GPUs outperformed tensor-parallel mode for sequential agentic tasks by reducing inter-GPU overhead.
  • Context Snowballing: Local inference lacks the attention caching of cloud providers, causing GPUs to peg at 100% while reprocessing entire conversation histories.

Working Examples

vLLM configuration for Qwen3-Coder-Next with mandatory tool-call parser flag.

vllm serve --model cyankiwi/Qwen3-Coder-Next-AWQ-4bit --tensor-parallel-size 2 --max-model-len 65536 --gpu-memory-utilization 0.95 --enable-auto-tool-choice --tool-call-parser qwen3_coder

llama.cpp server configuration for Qwen3.5-35B-A3B with 131k context and Jinja template support.

llama-server --ctx-size 131072 --flash-attn on --jinja --reasoning-budget -1 --dangerously-skip-permissions

Practical Applications

  • Use case: Multi-module Python CLI development using the subagent pattern to isolate context windows and maintain architectural coherence.
  • Pitfall: Neglecting the —jinja or —tool-call-parser flags, which causes LLM tool calls to fail silently without error messages.
  • Use case: Automated bug fixing in local environments by implementing a mandatory ‘run-fail-fix’ loop in the agent’s markdown prompt files.
  • Pitfall: Running unattended agentic sessions without —dangerously-skip-permissions, causing the process to stall indefinitely on hidden TUI prompts over SSH.

References:

Continue reading

Next article

Logtide 0.7.0: Completing the Observability Stack with OTLP Metrics and Service Maps

Related Content