Mind-the-Glitch

Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation

NeurIPS 2025 (Spotlight)

TL;DR

We introduce the first method to both quantify and spatially localize visual inconsistencies in subject-driven image generation by disentangling visual and semantic features from diffusion model backbones.

Teaser Video

Abstract

We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models, enabling visual correspondence in a manner analogous to the well-established semantic correspondence.

While diffusion model backbones are known to encode semantically rich features, they must also contain visual features to support their image synthesis capabilities. However, isolating these visual features is challenging due to the absence of annotated datasets.

To address this, we introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences based on existing subject-driven image generation datasets, and design a contrastive architecture to separate the two feature types.

Leveraging the disentangled representations, we propose a new metric, Visual Semantic Matching (VSM), that quantifies visual inconsistencies in subject-driven image generation. Empirical results show that our approach outperforms global feature-based metrics such as CLIP, DINO, and vision--language models in quantifying visual inconsistencies while also enabling spatial localization of inconsistent regions.

To our knowledge, this is the first method that supports both quantification and localization of inconsistencies in subject-driven generation, offering a valuable tool for advancing this task.

Dataset Generation Pipeline

Dataset Generation Pipeline

We developed an automated pipeline to create high-quality image pairs with controlled visual inconsistencies for training our correspondence model.

Foundation Dataset

Starts from consistent subject pairs sourced from the Subjects200k dataset

Subject Segmentation

Uses Grounded-SAM to precisely segment subjects and CleanDIFT to compute semantic correspondences

Correspondence Extraction

Samples high-confidence correspondences and extracts localized regions via SAM for precise matching

Quality Filtering

Detects and filters ambiguous matches using skewness of similarity distributions to ensure data quality

Controlled Inconsistencies

Applies diffusion-based inpainting (SDXL) to introduce controlled visual inconsistencies while preserving semantic content

Final Validation

Ensures quality via LPIPS filtering and size/aspect ratio constraints for consistent dataset standards

Architecture

Architecture Diagram

Our architecture leverages a frozen diffusion backbone with dual-branch processing to disentangle semantic and visual features for precise inconsistency detection.

Frozen Diffusion Backbone

Built on Stable Diffusion 2.1 with frozen weights to preserve pre-trained knowledge while extracting rich multi-layer decoder features

Dual-Branch Architecture

Features are processed through two separate aggregation networks one for semantic and one for visual features

Feature Aggregation

Each branch uses ResNet blocks per layer with trainable scalar weights for optimal feature combination

Similarity Computation

Computes semantic and visual similarity matrices via dot products for correspondence matching

Contrastive Training

Semantic loss encourages alignment everywhere, while visual loss enforces similarity outside inconsistent regions and dissimilarity inside

Results

We visualize the disentangled semantic and visual features to demonstrate the effectiveness of our approach.

Evaluating Subject-Driven Image Generation Approaches

To demonstrate the effectiveness of our approach on localizing and quantifying inconsistencies in real subject-driven image generation, we evaluate 3 recent approaches and show that our proposed metric Visual Similarity Matching (VSM) based on feature correspondences is the most aligned with the groundtruth oracle compared to CLIP, DINO, and ChatGPT-4o.

Reference

@inproceedings{eldesokey2025mindtheglitch,
  title={Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation},
  author={Eldesokey, Abdelrahman and Cvejic, Aleksandar and Ghanem, Bernard and Wonka, Peter},
  booktitle={Advances in Neural Information Processing Systems},
  year={2025}
}