Meta Releases TRIBE v2: A Tri-Modal Foundation Model for High-Resolution fMRI Prediction
These articles are AI-generated summaries. Please check the original sources for full details.
Meta Releases TRIBE v2: A Brain Encoding Model That Predicts fMRI Responses Across Video, Audio, and Text Stimuli
Meta’s FAIR team has introduced TRIBE v2, a tri-modal foundation model designed to align AI latent representations with human brain activity. The system processes stimuli through specialized encoders including LLaMA 3.2-3B and V-JEPA2-Giant to predict cortical and subcortical responses. It was trained on 451.6 hours of fMRI data, demonstrating a log-linear scaling in predictive power as training data volume increases.
Why This Matters
Traditional neuroscience has been limited by fragmented models that map specific cognitive functions to isolated brain regions, lacking a unified framework for multisensory integration. TRIBE v2 addresses this technical divide by aligning the latent representations of state-of-the-art AI architectures with human brain activity, enabling high-resolution fMRI predictions across diverse naturalistic conditions. By overcoming data scarcity through ‘deep’ training sets and leveraging frozen foundation models, the architecture moves neuroimaging from narrow experimental paradigms toward a scalable, ‘in-silico’ foundation for brain encoding.
Key Insights
- TRIBE v2 utilizes LLaMA 3.2-3B for text, V-JEPA2-Giant for video, and Wav2Vec-BERT 2.0 for audio to extract embeddings at a 2 Hz grid.
- A temporal Transformer with 8 layers and 8 attention heads aggregates multi-modal embeddings across a 100-second window.
- The model follows log-linear scaling laws, showing steady increases in encoding accuracy across 1,117.7 hours of evaluation data from 720 subjects.
- Zero-shot generalization on the HCP 7T dataset achieved a group correlation near 0.4, doubling the predictive accuracy of the median human subject’s recording.
- The model identifies five functional networks—primary auditory, language, motion, default mode, and visual—purely through emergent internal representations.
- Fine-tuning on just one hour of new subject data provides a two- to four-fold improvement over traditional linear models trained from scratch.
Practical Applications
- In-silico neuroimaging: Researchers can pilot or pre-screen experimental designs by running virtual tests on the Individual Brain Charting (IBC) dataset to localize landmarks like the fusiform face area.
- Efficient subject adaptation: Clinical or research environments can achieve high-resolution brain mapping for new participants with minimal data collection using TRIBE v2’s one-epoch fine-tuning.
- Multi-modal stimulus analysis: Systems can predict human cognitive responses to complex media containing simultaneous text, audio, and video inputs to better understand integrated sensory processing.
References:
Continue reading
Next article
Optimizing VICIdial: 15 Critical Settings to Reclaim Agent Talk Time
Related Content
Microsoft Research Releases OptiMind: A 20B Parameter Model for Optimization
Microsoft Research’s OptiMind achieves a 20.7% improvement in formulation accuracy across optimization benchmarks by translating natural language into solver-ready models.
Alibaba Releases Qwen3.5-Omni: A Native Multimodal Model for Real-Time Audio and Video Interaction
Alibaba Qwen Team unveils Qwen3.5-Omni, a native multimodal model achieving SOTA results on 215 subtasks while supporting 256k long-context audio-visual inputs.
Meta AI Open-Sources NeuralBench: A Standardized Benchmark for EEG Foundation Models
Meta AI's NeuralBench-EEG v1.0 standardizes NeuroAI evaluation across 36 tasks and 94 datasets, revealing that 150K-parameter models often rival 157M-parameter foundation models.