Learn-to-Steer: NVIDIA’s 2025 Spatial Fix for Text-to-Image Diffusion

Why Spatial Reasoning Fails in Text-to-Image Diffusion

NVIDIA’s Learn-to-Steer, accepted to WACV 2026, addresses a critical weakness in text-to-image diffusion models: their inability to reliably place objects in scenes. Current models excel at what to generate but struggle with where, resulting in incorrect placements, missing entities, or object merging.

Traditional solutions like fine-tuning or handcrafted losses are either computationally expensive or brittle and prone to overfitting, failing to generalize to complex layouts. Learn-to-Steer offers a data-driven approach that steers diffusion at inference time without modifying model weights.

Key Insights

Spatial reasoning failures: Diffusion models often misplace objects or fail to render them at all.
Relation Leakage: A problem where linguistic cues in prompts influence relation classification, hindering accurate spatial understanding.
Cross-Attention as Signal: Leveraging cross-attention maps to understand how models connect text tokens to image regions.

Working Example

# Illustrative example - not runnable without a diffusion model and trained classifier
def steer_image(latent, subject_attention, object_attention, relation_classifier, desired_relation):
    """
    Steers a latent representation towards a desired spatial relation.

    Args:
        latent: The current latent representation.
        subject_attention: Cross-attention maps for the subject.
        object_attention: Cross-attention maps for the object.
        relation_classifier: Trained classifier for spatial relations.
        desired_relation: The target spatial relation (e.g., "left_of").

    Returns:
        A steered latent representation.
    """
    # Predict the current relation
    predicted_relation = relation_classifier(subject_attention, object_attention)

    # Calculate the loss
    loss = cross_entropy(predicted_relation, desired_relation)

    # Compute gradients and update the latent
    gradients = autograd.grad(loss, latent)
    steered_latent = latent - learning_rate * gradients

    return steered_latent

Practical Applications

Robotics: Generating scenes for robot training, ensuring objects are in reachable positions.
Pitfall: Relying on simple prompts without spatial qualifiers can lead to unpredictable object arrangements.

References:

https://dev.to/davidevans/what-is-learn-to-steer-nvidias-2025-spatial-fix-for-text-to-image-diffusion-4eoe

On This Page

Why Spatial Reasoning Fails in Text-to-Image Diffusion

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

Training Text-to-Image Models: Key Takeaways from Ablations

Spatial Supersensing as the Core Capability for Multimodal AI Systems

Baidu Releases ERNIE-4.5-VL-28B-A3B-Thinking: An Open-Source and Compact Multimodal Reasoning Model Under the ERNIE-4.5 Family