Netflix AI Open-Sources VOID: Physics-Aware Video Object Removal
These articles are AI-generated summaries. Please check the original sources for full details.
Netflix AI Team Just Open-Sourced VOID: an AI Model That Erases Objects From Videos — Physics and All
Netflix AI and INSAIT have released VOID (Video Object and Interaction Deletion), an open-source model that handles physical causality in video editing. The system can automatically simulate gravity, such as making a guitar fall naturally when the person holding it is removed from the scene.
Why This Matters
Standard video inpainting models function as sophisticated background painters but lack the ability to reason about physical causality. When an object is removed, existing models often leave secondary props defying gravity or ignore necessary scene changes, requiring weeks of manual VFX work to fix. VOID addresses this technical gap by moving beyond pixel-filling to scene understanding. By incorporating physical interactions into the diffusion process, it eliminates the need for manual correction of shadows, reflections, and gravity-based movements that occur after an object is deleted.
Key Insights
- VOID is built on CogVideoX-Fun-V1.5-5b-InP, a 5-billion parameter 3D Transformer-based video generation model from Alibaba PAI.
- The system introduces a 4-value ‘quadmask’ (0, 63, 127, 255) to encode the primary object, overlap regions, interaction-affected areas, and background for structured scene understanding.
- The model employs a two-pass inference pipeline where Pass 2 uses optical flow-warped latents to correct ‘object morphing’ artifacts common in video diffusion.
- Training was achieved using synthetic counterfactual data from HUMOTO (human-object interactions) and Google’s Kubric, where physics simulations were re-run after object removal to provide ground truth.
- Technical specifications include a default resolution of 384x672, support for up to 197 frames, and the use of BF16 with FP8 quantization for memory efficiency.
Practical Applications
- VFX Production: Automating the removal of actors or props while maintaining physically consistent interactions, such as objects falling or surfaces reacting to weight changes.
- Pitfall: Relying solely on Pass 1 for complex trajectories may result in object deforming; Pass 2 is required to anchor shape stability via flow-warped noise initialization.
- Synthetic Data Generation: Using the HUMOTO Blender re-simulation method to create high-fidelity, physically correct paired video datasets for training other vision models.
- Pitfall: Using standard binary masks rather than quadmasks prevents the model from identifying which specific scene elements should be physically modified vs. kept static.
References:
Continue reading
Next article
Securing the npm Supply Chain: Lessons from the 2026 Axios Attack
Related Content
Building a Netflix VOID Video Object Removal Pipeline with CogVideoX
Implement Netflix's VOID model for advanced video object removal requiring 40GB+ VRAM and utilizing CogVideoX-Fun-V1.5-5b-InP.
Black Forest Labs Releases FLUX.2: A 32B Flow Matching Transformer for Production Image Pipelines
Black Forest Labs launches FLUX.2, a 32B parameter model enabling 4MP image generation and editing with multi-reference support.
Fastino Labs Releases GLiGuard: 300M Parameter Model for 16x Faster LLM Safety Moderation
Fastino Labs open-sourced GLiGuard, a 300M parameter safety model that matches the accuracy of models 90x its size while delivering 16.6x lower latency.