Measuring Image Distances via Embedding in a Semantic Manifold
In this work we introduce novel image metrics that can be used with distance-based classifiers or directly to decide whether two input images belong to the same class. While most prior image distances rely purely on comparisons of low-level features extracted from the inputs, our metrics use a large database of labeled photos as auxiliary data to draw semantic relationships between the two images, beyond those computable from simple visual features. In a preprocessing stage our approach derives a semantic image graph from the labeled dataset, where the nodes are the labeled images and the edges connect pictures with related labels. The graph can be viewed as modeling a semantic image manifold, and it enables the use of graph distances to approximate semantic distances. Thus, we reformulate the task of measuring the semantic distance between two unlabeled pictures as the problem of embedding the two input images in the semantic graph. We propose and evaluate several embedding schemes and graph distance metrics. Our results on Caltech101, Caltech256 and Imagenet show that our distances consistently match or outperform the state-of-the-art in this field.
Chen Fang, Lorenzo Torresani. Measuring Image Distances via Embedding in a Semantic Manifold. ECCV, 2012.
Fig.1. Conceptual illustration of our embedding method. During an offline stage a semantic image graph is constructed using a labeled database: links are created between images that satisfy the joint condition of being visually close and having related class labels. At test time, the unlabeled photo is embedded in the graph via a two-step process: first, visual neighbors are found; then, the positions of the test images in the graph is computed by finding visual neighbors that are semantically coherent
Fig.2. Caltech25 multiclass recognition using a NN classifier based on different image metrics using (a) GIST and (b) classeme descriptors (both unlabeled/test image and auxiliary dataset images). Our metrics based on embedding in the graph are: SEO, RW. The ImageNet metric proposed in  is CH. Our RW metric gives consistently the best results: it even outperforms the LMNN metric, which in this experiment has been advantageously trained on the test categories. L2 is the Euclidean distance of visual features.
Fig.3. Caltech256 performance of nonlinear SVMs trained with different kernels using (a) GIST and (b) classeme features: expRW and expSEO denote kernels constructed from our RW and SEO distances; expCH is the kernel induced by the CH distance of Deaselaers and Ferrari; exp-L2 and Gaussian are the exponential and Gaussian kernels computed from the L2 visual distances; linear indicates the linear SVM learned using the dot-product kernel. The training set consists of 15 examples per class.
Code and Data:
and semantic image graphs (6.5 GB)
We are grateful to S. Nowozin and C. Rother for useful discussion on strategies to optimize our SEO energy and to T. Deselaers and V. Ferrari for sharing data. Thanks to A. Bergamo for help with the experiments.
 Torresani, L., Szummer, M., Fitzgibbon A.: Efficient object category recognition using classemes, ECCV (2010)
 Deselaers, T., Ferrari, V.: Visual and semantic similarity in ImageNet, CVPR(2011)
 Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (2009)