Netflix AI Open-Sources VOID: Physics-Aware Video Object Removal

Netflix AI Team Just Open-Sourced VOID: an AI Model That Erases Objects From Videos — Physics and All

Netflix AI and INSAIT have released VOID (Video Object and Interaction Deletion), an open-source model that handles physical causality in video editing. The system can automatically simulate gravity, such as making a guitar fall naturally when the person holding it is removed from the scene.

Why This Matters

Standard video inpainting models function as sophisticated background painters but lack the ability to reason about physical causality. When an object is removed, existing models often leave secondary props defying gravity or ignore necessary scene changes, requiring weeks of manual VFX work to fix. VOID addresses this technical gap by moving beyond pixel-filling to scene understanding. By incorporating physical interactions into the diffusion process, it eliminates the need for manual correction of shadows, reflections, and gravity-based movements that occur after an object is deleted.

Key Insights

VOID is built on CogVideoX-Fun-V1.5-5b-InP, a 5-billion parameter 3D Transformer-based video generation model from Alibaba PAI.
The system introduces a 4-value ‘quadmask’ (0, 63, 127, 255) to encode the primary object, overlap regions, interaction-affected areas, and background for structured scene understanding.
The model employs a two-pass inference pipeline where Pass 2 uses optical flow-warped latents to correct ‘object morphing’ artifacts common in video diffusion.
Training was achieved using synthetic counterfactual data from HUMOTO (human-object interactions) and Google’s Kubric, where physics simulations were re-run after object removal to provide ground truth.
Technical specifications include a default resolution of 384x672, support for up to 197 frames, and the use of BF16 with FP8 quantization for memory efficiency.

Practical Applications

VFX Production: Automating the removal of actors or props while maintaining physically consistent interactions, such as objects falling or surfaces reacting to weight changes.
Pitfall: Relying solely on Pass 1 for complex trajectories may result in object deforming; Pass 2 is required to anchor shape stability via flow-warped noise initialization.
Synthetic Data Generation: Using the HUMOTO Blender re-simulation method to create high-fidelity, physically correct paired video datasets for training other vision models.
Pitfall: Using standard binary masks rather than quadmasks prevents the model from identifying which specific scene elements should be physically modified vs. kept static.

References:

https://www.marktechpost.com/2026/04/04/netflix-ai-team-just-open-sourced-void-an-ai-model-that-erases-objects-from-videos-physics-and-all/

On This Page

Netflix AI Team Just Open-Sourced VOID: an AI Model That Erases Objects From Videos — Physics and All

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

Building a Netflix VOID Video Object Removal Pipeline with CogVideoX

Black Forest Labs Releases FLUX.2: A 32B Flow Matching Transformer for Production Image Pipelines

OpenMythos: A 770M Parameter Recurrent-Depth Transformer Matching 1.3B Models