Most current image categorization methods require large collections of manually annotated training examples to learn accurate visual recognition models. The time-consuming human labeling effort effectively limits these approaches to recognition problems involving a small number of different object classes. In order to address this shortcoming, in recent years several authors have proposed to learn object classifiers from weakly-labeled Internet images, such as photos retrieved by keyword-based image search engines. While this strategy eliminates the need for human supervision, the recognition accuracies of these methods are considerably lower than those obtained with fully-supervised approaches, because of the noisy nature of the labels associated to Web data.
In this paper we investigate and compare methods that learn image classifiers by combining very few manually annotated examples (e.g., 1-10 images per class) and a large number of weakly-labeled Web photos retrieved using keyword-based image search. We cast this as a domain adaptation problem: given a few strongly-labeled examples in a target domain (the manually annotated examples) and many source domain examples (the weakly-labeled Web photos), learn classifiers yielding small generalization error on the target domain. Our experiments demonstrate that, for the same number of strongly-labeled examples, our domain adaptation approach produces significant recognition rate improvements over the best published results (e.g., 65% better when using 5 labeled training examples per class) and that our classifiers are one order of magnitude faster to learn and to evaluate than the best competing method, despite our use of large weakly-labeled data sets.
The figure above shows the categorization accuracy of different methods on the Caltech256 benchmark.
SVMt, SVMs, SVMsUt are three algorithms that not based on domain adaptation and we use them as comparative baselines. SVMt is a linear SVM learned exclusively from the target examples. SVMs denotes an SVM learned from the source exampls assuming no outliers are present in the image search results. SVMsUt is a linear SVM trained on the union of the target and source examples.
MIXSVM is the classifier obtained by a convex combination of the two SVM hypotheses learned independently from the source and target data.
DWSVM is the classifier trained using both the source and the target examples by weighting differently the two domains in the learning objective, in order to encode the relative importance of the two domains.
AUGSVM is a linear SVM trained on the union of an augmented-version of the source and target examples.
TSVM: Transductive learning on the label of the source data during the learning: the key-idea is to exploit the availability of strongly-labeled target training data to simultaneously determine the correct labels of the source training examples and incorporate this labeling information to improve the classifier.
Manual annotation saving: the plot shows for a varying number of labeled examples given to TSVM the number of additional labeled images that would be needed by SVMt to achieve the same accuracy.
Both the features in Matlab format and the original JPEG images are available for the download. Please read the README file.
The following software has been used to perform the experiments:
This material is based upon work supported by the National Science Foundation under CAREER award IIS-0952943. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).