Stop Thinking, Just Do!

Sungsoo Kim's Blog

Audio-visual self-supervised baby learning

tagsTags

20 June 2024


Article Source


Audio-visual self-supervised baby learning

  • Andrew Zisserman (Oxford University)
  • Understanding Lower-Level Intelligence from AI, Psychology, and Neuroscience Perspectives

Abstract

Lesson 1 from the classic paper “The Development of Embodied Cognition: Six Lessons from Babies” is `Be Multimodal’. This talks explores how recent work in the computer vision literature on audio-visual self-supervised learning addresses this challenge. The aim is to learn audio and visual representations and capabilities directly from the audio-visual data stream of a video (without providing any manual supervision of the data) - much as an infant could learn from the correspondence and synchronization between what they see and hear. It is shown that a neural network that simply learns to synchronize audio and visual streams is able to localize the faces that are speaking (active speaker detection) and objects that sound.


comments powered by Disqus