Skip to main content

On This Page

Meta Releases TRIBE v2: A Tri-Modal Foundation Model for High-Resolution fMRI Prediction

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Meta Releases TRIBE v2: A Brain Encoding Model That Predicts fMRI Responses Across Video, Audio, and Text Stimuli

Meta’s FAIR team has introduced TRIBE v2, a tri-modal foundation model designed to align AI latent representations with human brain activity. The system processes stimuli through specialized encoders including LLaMA 3.2-3B and V-JEPA2-Giant to predict cortical and subcortical responses. It was trained on 451.6 hours of fMRI data, demonstrating a log-linear scaling in predictive power as training data volume increases.

Why This Matters

Traditional neuroscience has been limited by fragmented models that map specific cognitive functions to isolated brain regions, lacking a unified framework for multisensory integration. TRIBE v2 addresses this technical divide by aligning the latent representations of state-of-the-art AI architectures with human brain activity, enabling high-resolution fMRI predictions across diverse naturalistic conditions. By overcoming data scarcity through ‘deep’ training sets and leveraging frozen foundation models, the architecture moves neuroimaging from narrow experimental paradigms toward a scalable, ‘in-silico’ foundation for brain encoding.

Key Insights

  • TRIBE v2 utilizes LLaMA 3.2-3B for text, V-JEPA2-Giant for video, and Wav2Vec-BERT 2.0 for audio to extract embeddings at a 2 Hz grid.
  • A temporal Transformer with 8 layers and 8 attention heads aggregates multi-modal embeddings across a 100-second window.
  • The model follows log-linear scaling laws, showing steady increases in encoding accuracy across 1,117.7 hours of evaluation data from 720 subjects.
  • Zero-shot generalization on the HCP 7T dataset achieved a group correlation near 0.4, doubling the predictive accuracy of the median human subject’s recording.
  • The model identifies five functional networks—primary auditory, language, motion, default mode, and visual—purely through emergent internal representations.
  • Fine-tuning on just one hour of new subject data provides a two- to four-fold improvement over traditional linear models trained from scratch.

Practical Applications

  • In-silico neuroimaging: Researchers can pilot or pre-screen experimental designs by running virtual tests on the Individual Brain Charting (IBC) dataset to localize landmarks like the fusiform face area.
  • Efficient subject adaptation: Clinical or research environments can achieve high-resolution brain mapping for new participants with minimal data collection using TRIBE v2’s one-epoch fine-tuning.
  • Multi-modal stimulus analysis: Systems can predict human cognitive responses to complex media containing simultaneous text, audio, and video inputs to better understand integrated sensory processing.

References:

Continue reading

Next article

Optimizing VICIdial: 15 Critical Settings to Reclaim Agent Talk Time

Related Content