Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization


Bruno Korbar, Du Tran, Lorenzo Torresani

Department of Computer Science, Dartmouth College

Facebook AI


There is a natural correlation between the visual and auditive elements of a video. We leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.



Action classification

Accuracy on standard action-recognition datasets. We achieve impressive gains (19.9% on UCF101, and 17.7% on HMDB) by using AVTS as a pretraining mechanism.

Video Network
MC2 [1] none N/A 67.2 41.2
Kinetics self-supervised (AVTS) 83.6 54.3
Kinetics fully supervised (action labels) 87.9 62.0
MC3 [1] none N/A 69.1 43.9
Kinetics self-supervised (AVTS) 85.8 56.9
Audioset self-supervised (AVTS) 89.0 61.6
Kinetics fully supervised (action labels) 90.5 66.8
I3D-RGB [2] none N/A 57.1 40.0
Kinetics self-supervised (AVTS) 83.7 53.0
I3D-RGB* Imagenet fully supervised (object labels) 84.5 49.8
Kinetics fully supervised (action labels) 95.1 74.3
Kinetics + Imagenet fully supervised (object+action labels) 95.6 74.8

* results based on reference implementation - the evaluation method might vary

Audio scene classification

Results of an SVM trained on our AVTS features achieves close-to or state-of-the-art performance on benchmark audio scene classification tasks.

Method Pretraining Dataset ESC50 DCASE2014
L3 Net [3] SoundNet 79.3 93
(best model)
SoundNet 82.3 94


The trained AVTS models can be downloaded from the following links:

MC3-a_mvgg: AVTS model based on MC3/a_vgg architectures. [Download]

MC2-a_mvgg: AVTS model based on MC3/a_vgg architectures. [Download]


[1] "A Closer Look at Spatiotemporal Convolutions for Action Recognition" - Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri

[2] "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset" - Joao Carreira, Andrew Zisserman

[3] "Look, Listen and Learn" - Relja Arandjelovic, Andrew Zisserman


This work was funded in part by NSF award CNS-120552. We gratefully acknowledge NVIDIA and Facebook for the donation of GPUs used for portions of this work.