Thinking Machines Lab Unveils Interaction Models: Native Multimodal Architecture for Real-Time AI

Mira Murati’s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration

Thinking Machines Lab has introduced a research preview of “interaction models” that move beyond turn-based AI toward native interactivity. Their TML-Interaction-Small model uses 200ms time-aligned micro-turns to process audio, video, and text streams simultaneously.

Why This Matters

Current AI systems rely on a “harness” of separate components like Voice-Activity Detection (VAD) to simulate responsiveness, but these components are fundamentally less intelligent than the core model and freeze perception during generation. By integrating interactivity natively into the model architecture, Thinking Machines Lab eliminates the turn-based bottleneck, allowing the system to react to visual cues and interruptions in real time without the latency or intelligence loss associated with stitched-together modular systems.

Key Insights

The TML-Interaction-Small model is a 276B parameter Mixture-of-Experts (MoE) architecture with 12B active parameters, optimized for real-time streaming (2026).
Encoder-free early fusion removes separate pretrained encoders like Whisper, instead using dMel for audio and hMLP for video patches co-trained within the transformer.
The system employs a dual-model design where an interaction model maintains the live stream while a background model handles asynchronous reasoning tasks.
Inference optimization utilizes streaming sessions and a gather+gemv strategy for MoE kernels, techniques upstreamed to the SGLang open-source framework.
TML-Interaction-Small achieved 0.40s latency on FD-bench v1, outperforming GPT-realtime-2.0’s 1.18s and Gemini’s 0.57s.

Practical Applications

Use Case: Real-time visual assistance for physical activities where the model must count repetitions (RepCount-A). Pitfall: Using standard turn-based models results in near-zero (1.3%) accuracy due to lack of temporal awareness.
Use Case: Collaborative web browsing and tool execution during active conversation. Pitfall: Relying on a single-model approach causes abrupt context switches, whereas the TML dual-model architecture interleaves updates seamlessly.

References:

https://www.marktechpost.com/2026/05/13/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration/

On This Page

Mira Murati’s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

Meta Superintelligence Lab Unveils Muse Spark: Natively Multimodal Model with Thought Compression

Implementing Qwen 3.6-35B-A3B: Multimodal MoE with Thinking Control and Tool Calling

Designing an Autonomous Multi-Agent Data Infrastructure System with Lightweight Qwen Models