There is a natural correlation between the visual and auditive elements of a video. We leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.
Accuracy on standard action-recognition datasets. We achieve impressive gains (19.9% on UCF101, and 17.7% on HMDB) by using AVTS as a pretraining mechanism.
|Kinetics||fully supervised (action labels)||87.9||62.0|
|Kinetics||fully supervised (action labels)||90.5||66.8|
|I3D-RGB*||Imagenet||fully supervised (object labels)||84.5||49.8|
|Kinetics||fully supervised (action labels)||95.1||74.3|
|Kinetics + Imagenet||fully supervised (object+action labels)||95.6||74.8|
* results based on reference implementation - the evaluation method might vary
Results of an SVM trained on our AVTS features achieves close-to or state-of-the-art performance on benchmark audio scene classification tasks.
|L3 Net ||SoundNet||79.3||93|
The trained AVTS models can be downloaded from the following links:
MC3-a_mvgg: AVTS model based on MC3/a_vgg architectures. [Download]
MC2-a_mvgg: AVTS model based on MC3/a_vgg architectures. [Download]
 "A Closer Look at Spatiotemporal Convolutions for Action Recognition" - Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri
 "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset" - Joao Carreira, Andrew Zisserman
 "Look, Listen and Learn" - Relja Arandjelovic, Andrew Zisserman
This work was funded in part by NSF award CNS-120552. We gratefully acknowledge NVIDIA and Facebook for the donation of GPUs used for portions of this work.