We developed an automated pipeline to create high-quality image pairs with controlled visual inconsistencies for training our correspondence model.
Foundation Dataset
Starts from consistent subject pairs sourced from the Subjects200k dataset
Subject Segmentation
Uses Grounded-SAM to precisely segment subjects and CleanDIFT to compute semantic correspondences
Correspondence Extraction
Samples high-confidence correspondences and extracts localized regions via SAM for precise matching
Quality Filtering
Detects and filters ambiguous matches using skewness of similarity distributions to ensure data quality
Controlled Inconsistencies
Applies diffusion-based inpainting (SDXL) to introduce controlled visual inconsistencies while preserving semantic content
Final Validation
Ensures quality via LPIPS filtering and size/aspect ratio constraints for consistent dataset standards