Relational Visual Similarity (Dec 2025)

Abstract

Title: Relational Visual Similarity (Dec 2025)
Link: http://arxiv.org/abs/2512.07833v1
Date: December 2025

Summary:

This paper introduces “relational visual similarity,” a novel framework for measuring image similarity based on underlying logical structures, functions, or interactions rather than surface-level visual attributes like color or shape. While existing models (e.g., CLIP, LPIPS) excel at perceptual similarity, they fail to capture abstract analogies (e.g., recognizing that the layers of a peach correspond to the layers of the Earth). To address this, the authors curate a dataset of 114k images paired with “anonymous captions”—descriptions that abstract away specific object identities to focus on relational patterns. They finetune a Vision-Language Model (Qwen2.5-VL) on this dataset to create a new metric, relsim. The study demonstrates that relsim significantly outperforms existing baselines in capturing relational logic for tasks such as image retrieval and analogical image generation.

Key Topics:

Relational Visual Similarity
Analogical Reasoning
Vision-Language Models
Anonymous Captioning
Image Retrieval
Visual Abstraction
Synthetic Dataset Generation

Stop Thinking, Just Do!