Skip to main content

On This Page

Thinking Machines Lab Unveils Interaction Models: Native Multimodal Architecture for Real-Time AI

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Mira Murati’s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration

Thinking Machines Lab has introduced a research preview of “interaction models” that move beyond turn-based AI toward native interactivity. Their TML-Interaction-Small model uses 200ms time-aligned micro-turns to process audio, video, and text streams simultaneously.

Why This Matters

Current AI systems rely on a “harness” of separate components like Voice-Activity Detection (VAD) to simulate responsiveness, but these components are fundamentally less intelligent than the core model and freeze perception during generation. By integrating interactivity natively into the model architecture, Thinking Machines Lab eliminates the turn-based bottleneck, allowing the system to react to visual cues and interruptions in real time without the latency or intelligence loss associated with stitched-together modular systems.

Key Insights

  • The TML-Interaction-Small model is a 276B parameter Mixture-of-Experts (MoE) architecture with 12B active parameters, optimized for real-time streaming (2026).
  • Encoder-free early fusion removes separate pretrained encoders like Whisper, instead using dMel for audio and hMLP for video patches co-trained within the transformer.
  • The system employs a dual-model design where an interaction model maintains the live stream while a background model handles asynchronous reasoning tasks.
  • Inference optimization utilizes streaming sessions and a gather+gemv strategy for MoE kernels, techniques upstreamed to the SGLang open-source framework.
  • TML-Interaction-Small achieved 0.40s latency on FD-bench v1, outperforming GPT-realtime-2.0’s 1.18s and Gemini’s 0.57s.

Practical Applications

  • Use Case: Real-time visual assistance for physical activities where the model must count repetitions (RepCount-A). Pitfall: Using standard turn-based models results in near-zero (1.3%) accuracy due to lack of temporal awareness.
  • Use Case: Collaborative web browsing and tool execution during active conversation. Pitfall: Relying on a single-model approach causes abrupt context switches, whereas the TML dual-model architecture interleaves updates seamlessly.

References:

Continue reading

Next article

Scaling Shopify Globally: A Technical Guide to Multi-Region Infrastructure

Related Content