Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM Audio And Large Scale Multimodal Retrieval
These articles are AI-generated summaries. Please check the original sources for full details.
Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV)
Meta AI has launched Perception Encoder Audiovisual (PE-AV), a new encoder family designed for joint audio and video understanding, trained on a massive dataset of 100 million audio-video pairs with text captions. This release extends Meta’s Perception Encoder (PE) stack, surpassing previous models like SigLIP2 and InternVideo2 in performance.
PE-AV addresses the challenge of creating a unified embedding space for audio, video, and text, moving beyond specialized models for each modality. Current multimodal models often struggle with generalization and require extensive task-specific fine-tuning, leading to significant development costs and limited scalability.
Key Insights
- 100M Audio-Video Pairs: PE-AV was pre-trained on a large-scale dataset of 100 million audio-video pairs with text captions.
- DAC VAE for Audio: The model utilizes a DAC VAE codec to convert raw waveforms into discrete audio tokens, enabling efficient processing.
- SAM Audio Integration: PE-AV serves as the core perception engine for Meta’s SAM Audio model, enabling prompt-based audio separation and sound event localization.
Working Example
# PE-AV utilizes a contrastive loss across ten modality pairs.
# Example (Conceptual - actual implementation is within the framework):
# loss = contrastive_loss(audio_embedding, video_embedding, text_embedding)
# The model learns to minimize the distance between related modalities
# and maximize the distance between unrelated modalities.
Practical Applications
- SAM Audio: Meta’s SAM Audio uses PE-AV embeddings to separate sound sources in complex audio mixtures.
- Pitfall: Relying solely on unimodal models (e.g., audio-only or video-only) can lead to inaccurate or incomplete understanding of the scene, especially in noisy or ambiguous environments.
References:
Continue reading
Next article
Meta Details GEM Ads Model Using LLM-Scale Training, Hybrid Parallelism, and Knowledge Transfer
Related Content
Jina AI Releases Jina-VLM: A 2.4B Multilingual Vision Language Model Focused on Token Efficient Visual QA
Jina AI released Jina-VLM, a 2.4B parameter multilingual vision language model achieving state-of-the-art results on multilingual benchmarks like MMMB and Multilingual MMBench.
MMCTAgent enables multimodal reasoning over large video collections
Microsoft's MMCTAgent boosts video analysis accuracy by 14% on MM-Vet, using Planner-Critic architecture for iterative reasoning.
Meta AI Releases Segment Anything Model 3 (SAM 3) for Promptable Concept Segmentation in Images and Videos
Meta AI’s SAM 3 achieves 75-80% of human performance on the SA-Co benchmark, outperforming existing models in promptable concept segmentation.