MEM for Robots: Physical Intelligence Unveils 15-Minute Memory System for Gemma 3-4B VLAs
These articles are AI-generated summaries. Please check the original sources for full details.
Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks
Researchers from Physical Intelligence, Stanford, UC Berkeley, and MIT have introduced Multi-Scale Embodied Memory (MEM) for robotic policies. This system enables Vision-Language-Action (VLA) models to process up to 15 minutes of context, overcoming the standard lack of memory in traditional end-to-end models.
Why This Matters
Current robotic policies typically operate on a single observation or a very short history, making long-horizon tasks like kitchen cleaning computationally intractable or prone to failure. By factorizing memory into short-term video and long-term language scales, MEM maintains a 380ms real-time inference threshold while allowing robots to adapt manipulation strategies based on recent failures.
Key Insights
- 62% success rate increase in refrigerator opening tasks with unknown hinge directions (MEM Research, 2026)
- Space-Time Separable Attention concept to interleave spatial and causal-temporal attention, reducing complexity from O(n^2K^2) to O(n^2+nK)
- Gemma 3-4B tool utilized by Physical Intelligence and Stanford researchers as the foundation for the π0.6 VLA backbone
- Language-based long-term memory to compress 15 minutes of events into semantic summaries such as ‘I placed three bowls’
- Single NVIDIA H100 GPU implementation capable of processing 16 observation frames while staying under the 380ms real-time barrier
Practical Applications
- Use Case: π0.6 VLA performing ‘Recipe Setup’ by retrieving ingredients from multiple locations over 15 minutes. Pitfall: Memory-less VLAs failing tasks significantly more often due to short-term history constraints.
- Use Case: MEM-based robot adapting manipulation strategies in real-time to pick up chopsticks at variable heights. Pitfall: Single-observation models failing to resolve self-occlusions or adapt grasps during the execution phase.
References:
Continue reading
Next article
Real-Time Medical Transcription and SOAP Note Generation with AssemblyAI and GPT-4
Related Content
OpenMind OM1: Building an Open Source Operating System for Humanoid Robots
Jan Liphardt introduces OM1, an open-source robotic OS that leverages large language models for data fusion and utilizes $1,250 hardware components with 10,000-hour durability to enable human-centric robot interactions, shifting the focus from complex motor tasks like onion chopping to social engagement and spatial understanding.
Allen Institute for AI (AI2) Introduces Olmo 3: Open Source 7B/32B LLMs with 65K Context Window
Allen Institute for AI (AI2) launches Olmo 3, open-source 7B/32B LLMs with 65,536 token context window and Dolma 3 data stack.
Google Introduces T5Gemma 2: Encoder Decoder Models with Multimodal Inputs via SigLIP and 128K Context
Google released T5Gemma 2, a family of open-source encoder-decoder models inheriting Gemma 3’s multimodality and 128K context length.