Reflections from the Turing Award Winners

Speaker: Profs. Yann LeCun and Yoshua Bengio / Facebook AI + NYU / MILA

Abstract

The Future is Self-Supervised

Yann LeCun (The Future is Self-Supervised): Humans and animals learn enormous amount of background knowledge about the world in the early months of life with little supervision and almost no interactions. How can we reproduce this learning paradigm in machines? One proposal for doing so is Self-Supervised Learning (SSL) in which a system is trained to predict a part of the input from the rest of the input. SSL, in the form of denoising auto-encoder, has been astonishingly successful for learning task-independent representations of text. But the success has not been translated to images and videos. The main obstacle is how to represent uncertainty in high-dimensional continuous spaces in which probability densities are generally intractable. We propose to use Energy-Based Models (EBM) to represent data manifolds or level-sets of distributions on the variables to be predicted. There are two classes of methods to train EBMs: (1) contrastive methods that push down on the energy of data points and push up elsewhere; (2) architectural and regularizing methods that limit or minimize the volume of space that can take low energies by regularizing the information capacity of a latent variable. While contrastive methods have been somewhat successful to learn image features, they are very expensive computationally. I will propose that the future of self-supervised representation learning lies in regularized latent-variable energy-based models.

Deep Learning Priors Associated with Conscious Processing

Yoshua Bengio (Deep Learning Priors Associated with Conscious Processing): Some of the aspects of the world around us are captured in natural language and refer to semantic high-level variables, which often have a causal role (referring to agents, objects, and actions or intentions). These high-level variables also seem to satisfy very peculiar characteristics which low-level data (like images or sounds) do not share, and this work is about characterizing these characteristics in the form of priors which can guide the design of machine learning systems benefitting from these priors. Since these priors are not just about their joint distribution (e.g. it has a sparse factor graph) but also about how the distribution changes (typically by causal interventions), this analysis may also help to build machine learning systems which can generalize better out-of-distribution. There are fascinating connections between these priors and what is hypothesized about conscious processing in the brain, with conscious processing allowing us to reason (i.e., perform chains of inferences about the past and the future, as well as credit assignment) at the level of these high-level variables. This involves attention mechanisms and short-term memory to form a bottleneck of information being broadcast around the brain between different parts of it, as we focus on different high-level variables and some of their interactions. The presentation summarizes a few recent results using some of these ideas for discovering causal structure and modularizing recurrent neural networks with attention mechanisms in order to obtain better out-of-distribution generalization.

Stop Thinking, Just Do!