Implementing Qwen 3.6-35B-A3B: Multimodal MoE with Thinking Control and Tool Calling
These articles are AI-generated summaries. Please check the original sources for full details.
A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence
The Qwen 3.6-35B-A3B model introduces a sophisticated Mixture-of-Experts (MoE) architecture with 256 total experts and 3B active parameters per token. It natively supports a 262k context window, extendable via YaRN, and integrates explicit reasoning traces through thinking blocks.
Why This Matters
Transitioning from standard LLMs to MoE-based multimodal systems requires managing dynamic VRAM allocation and specialized inference logic like gated DeltaNet. While ideal models offer infinite reasoning, technical reality necessitates thinking budgets and structured output validation to prevent hallucination in agentic workflows. Implementing session persistence and retrieval-augmented generation at the application layer ensures that these large-scale models remain performant and contextually aware in production environments.
Key Insights
- Qwen 3.6-35B-A3B utilizes a hybrid architecture featuring Gated DeltaNet, a linear-attention variant, alongside standard attention layers (Marktechpost, 2026).
- The MoE layer uses 256 experts with 8 routed plus 1 shared expert per token to maintain 3B active parameters during inference.
- Thinking-Budget Control: Implementing custom StoppingCriteria allows developers to cap reasoning tokens before generating the final answer to manage latency.
- The model accepts image, video, and text input natively, supporting grounding tasks with pixel-coordinate bounding boxes.
- Context Scaling: The native 262,144 token context can be extended to approximately 1M tokens using YaRN rope-parameter overrides.
- Session Persistence: Storing conversation history and tool schemas in JSON allows for stateful agentic sessions across disjointed execution calls.
Working Examples
A custom stopping criterion to control the maximum number of reasoning tokens generated within the thinking blocks.
class ThinkingBudget(StoppingCriteria):
def __init__(self, tokenizer, budget: int):
self.budget = budget
self.open_ids = tokenizer.encode("<think>", add_special_tokens=False)
self.close_ids = tokenizer.encode("</think>", add_special_tokens=False)
self.start = None
def _find(self, seq, needle):
n = len(needle)
for i in range(len(seq)-n+1):
if seq[i:i+n] == needle: return i
return None
def __call__(self, input_ids, scores, **kwargs):
seq = input_ids[0].tolist()
if self.start is None:
idx = self._find(seq, self.open_ids)
if idx is not None: self.start = idx + len(self.open_ids)
return False
if self._find(seq[self.start:], self.close_ids) is not None: return False
return (len(seq) - self.start) >= self.budget
Adaptive model loading logic that selects quantization levels (BF16, INT8, or INT4) based on available GPU VRAM.
kwargs = dict(device_map="auto", trust_remote_code=True,
low_cpu_mem_usage=True, attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16)
if VRAM_GB >= 75: LOAD_MODE = "bf16"
elif VRAM_GB >= 40: LOAD_MODE = "int8"
else: LOAD_MODE = "int4"
if LOAD_MODE == "int4":
kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True)
Practical Applications
- Agentic Workflows: Implementing tool-calling loops for arithmetic and document search using TOOL_CALL_RE for extraction. Pitfall: Failing to validate JSON outputs with schemas leads to execution errors in automated pipelines.
- Visual Grounding: Locating distinct objects in images using pixel coordinates for automated inspection. Pitfall: Incorrectly formatted bounding box arrays can break downstream spatial logic systems.
- Long-Context RAG: Using semantic retrieval with SentenceTransformers to ground answers in 262k-token technical documentation. Pitfall: Oversaturating the context window without YaRN optimization can degrade retrieval precision.
References:
Continue reading
Next article
Amazon Expands Anthropic Partnership with $25 Billion AI Investment
Related Content
Thinking Machines Lab Unveils Interaction Models: Native Multimodal Architecture for Real-Time AI
Mira Murati's Thinking Machines Lab debuts TML-Interaction-Small, a 276B parameter MoE model achieving a 77.8 interaction quality score on FD-bench v1.5.
Building a Groq-Powered Agentic Research Assistant with LangGraph and Sub-Agents
Build a high-performance research assistant using Groq's inference endpoint, LangGraph, and Llama-3.3-70b to automate multi-step workflows with agentic memory.
Designing an Autonomous Multi-Agent Data Infrastructure System with Lightweight Qwen Models
A tutorial on building an agentic data and infrastructure strategy system using the Qwen2.5-0.5B-Instruct model for efficient pipeline intelligence, including code examples and real-world applications.