Audio-visual self-supervised baby learning

Andrew Zisserman (Oxford University)
Understanding Lower-Level Intelligence from AI, Psychology, and Neuroscience Perspectives

Abstract

Lesson 1 from the classic paper “The Development of Embodied Cognition: Six Lessons from Babies” is `Be Multimodal’. This talks explores how recent work in the computer vision literature on audio-visual self-supervised learning addresses this challenge. The aim is to learn audio and visual representations and capabilities directly from the audio-visual data stream of a video (without providing any manual supervision of the data) - much as an infant could learn from the correspondence and synchronization between what they see and hear. It is shown that a neural network that simply learns to synchronize audio and visual streams is able to localize the faces that are speaking (active speaker detection) and objects that sound.

Stop Thinking, Just Do!