Abstract

This paper introduces EXMOVES, learned exemplar-based features for efficient recognition of actions in videos. The entries in our descriptor are produced by evaluating a set of movement classifiers over spatial-temporal volumes of the input sequence. Each movement classifier is a simple exemplar-SVM trained on low-level features, i.e., an SVM learned using a single annotated positive space-time volume and a large number of unannotated videos.

Our representation offers two main advantages. First, since our mid-level features are learned from individual video exemplars, they require minimal amount of supervision. Second, we show that simple linear classification models trained on our global video descriptor yield action recognition accuracy approaching the state-of-the-art but at orders of magnitude lower cost, since at test-time no sliding window is necessary and linear models are efficient to train and test. This enables scalable action recognition, i.e., efficient classification of a large number of actions even in massive video databases. We show the generality of our approach by building our mid-level descriptors from two different low-level feature vectors. The accuracy and efficiency of the approach are demonstrated on several large-scale action recognition benchmarks.

Results

Action classification accuracy of EXMOVES using different features such as HOG-HOF-STIPs [1] and Dense Trajectories [2]. Comparisions with other mid-level representations such as: Action Bank [3] and Discriminative Patches [4].

Paper

Du Tran, Lorenzo Torresani, EXMOVES: Classifier-based Features for Scalable Action Recognition, International Conference on Learning Representations (ICLR), 2014 [pdf].

Software

Software to extract EXMOVES is available here

Acknowledgment

This material is based upon work supported by the National Science Foundation (NSF) under CAREER award IIS-0952943 and NSF award CNS-1205521. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.

References

[1] I. Laptev, On space-time interest points, International Journal of Computer Vision 2005.
[2] H. Wang, A. Klaser, C. Schmid, and C. Liu, Dense trajectories and motion boundary descriptors for action recognition, International Journal of Computer Vision 2013.
[3] S. Sadanand and J. Corso, Action bank: A high-level representation of activity in video, IEEE Conference on Computer Vision and Pattern Recognition 2012.
[4] A. Jain, A. Gupta, M. Rodriguez, and L. Davis, Representing videos using mid-level discriminative patches, IEEE Conference on Computer Vision and Pattern Recognition 2013

* EXMOVES stands for "Exemplar Movements".