Portrait of Yoshua Bengio

Yoshua Bengio

Core Academic Member
Canada CIFAR AI Chair
Full Professor, Université de Montréal, Department of Computer Science and Operations Research Department
Research Topics
Causality
Computational Neuroscience
Deep Learning
Generative Models
Graph Neural Networks
Machine Learning Theory
Medical Machine Learning
Molecular Modeling
Natural Language Processing
Probabilistic Models
Reasoning
Recurrent Neural Networks
Reinforcement Learning
Representation Learning

Biography

*For media requests, please write to medias@mila.quebec.

For more information please contact Cassidy MacNeil, Senior Assistant and Operation Lead at cassidy.macneil@mila.quebec.

Yoshua Bengio is recognized worldwide as a leading expert in AI. He is most known for his pioneering work in deep learning, which earned him the 2018 A.M. Turing Award, “the Nobel Prize of computing,” with Geoffrey Hinton and Yann LeCun.

Bengio is a full professor at Université de Montréal, and the founder and scientific advisor of Mila – Quebec Artificial Intelligence Institute. He is also a senior fellow at CIFAR and co-directs its Learning in Machines & Brains program, serves as special advisor and founding scientific director of IVADO, and holds a Canada CIFAR AI Chair.

In 2019, Bengio was awarded the prestigious Killam Prize and in 2022, he was the most cited computer scientist in the world by h-index. He is a Fellow of the Royal Society of London, Fellow of the Royal Society of Canada, Knight of the Legion of Honor of France and Officer of the Order of Canada. In 2023, he was appointed to the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology.

Concerned about the social impact of AI, Bengio helped draft the Montréal Declaration for the Responsible Development of Artificial Intelligence and continues to raise awareness about the importance of mitigating the potentially catastrophic risks associated with future AI systems.

Current Students

Collaborating Alumni - McGill University
Collaborating researcher - Cambridge University
Principal supervisor :
PhD - Université de Montréal
Collaborating researcher - N/A
Principal supervisor :
PhD - Université de Montréal
Collaborating researcher - KAIST
PhD - Université de Montréal
Independent visiting researcher
Principal supervisor :
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating Alumni - Université de Montréal
Postdoctorate - Université de Montréal
Principal supervisor :
Collaborating researcher - Université de Montréal
Postdoctorate - Université de Montréal
Principal supervisor :
Collaborating Alumni
Collaborating researcher - s.o.
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
Independent visiting researcher - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher - Ying Wu Coll of Computing
Collaborating researcher - University of Waterloo
Principal supervisor :
PhD - Université de Montréal
Postdoctorate - Université de Montréal
Postdoctorate - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Postdoctorate
Co-supervisor :
Collaborating Alumni - Université de Montréal
Co-supervisor :
Collaborating researcher
Principal supervisor :
Collaborating Alumni - Université de Montréal
Collaborating Alumni - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher - Université de Montréal
Collaborating researcher
Collaborating researcher - Université de Montréal
PhD - Université de Montréal
PhD - McGill University
Principal supervisor :
Collaborating Alumni - McGill University
Principal supervisor :

Publications

Efficient Regression-Based Training of Normalizing Flows for Boltzmann Generators
Oscar Davis
Michael Bronstein
Avishek Joey Bose
Simulation-free training frameworks have been at the forefront of the generative modelling revolution in continuous spaces, leading to large… (see more)-scale diffusion and flow matching models. However, such modern generative models suffer from expensive inference, inhibiting their use in numerous scientific applications like Boltzmann Generators (BGs) for molecular conformations that require fast likelihood evaluation. In this paper, we revisit classical normalizing flows in the context of BGs that offer efficient sampling and likelihoods, but whose training via maximum likelihood is often unstable and computationally challenging. We propose Regression Training of Normalizing Flows (RegFlow), a novel and scalable regression-based training objective that bypasses the numerical instability and computational challenge of conventional maximum likelihood training in favour of a simple
Fast Monte Carlo Tree Diffusion: 100x Speedup via Parallel Sparse Planning
Jaesik Yoon
Hyeonseo Cho
Diffusion models have recently emerged as a powerful approach for trajectory planning. However, their inherently non-sequential nature limit… (see more)s their effectiveness in long-horizon reasoning tasks at test time. The recently proposed Monte Carlo Tree Diffusion (MCTD) offers a promising solution by combining diffusion with tree-based search, achieving state-of-the-art performance on complex planning problems. Despite its strengths, our analysis shows that MCTD incurs substantial computational overhead due to the sequential nature of tree search and the cost of iterative denoising. To address this, we propose Fast-MCTD, a more efficient variant that preserves the strengths of MCTD while significantly improving its speed and scalability. Fast-MCTD integrates two techniques: Parallel MCTD, which enables parallel rollouts via delayed tree updates and redundancy-aware selection; and Sparse MCTD, which reduces rollout length through trajectory coarsening. Experiments show that Fast-MCTD achieves up to 100x speedup over standard MCTD while maintaining or improving planning performance. Remarkably, it even outperforms Diffuser in inference speed on some tasks, despite Diffuser requiring no search and yielding weaker solutions. These results position Fast-MCTD as a practical and scalable solution for diffusion-based inference-time reasoning.
Torsional-GFN: a conditional conformation generator for small molecules
Structure-Aligned Protein Language Model
Can Chen
David Heurtel-Depeiges
Robert M. Vernon
Christopher J. Langmead
Adaptive Inference-Time Scaling via Cyclic Diffusion Search
Gyubin Lee
Truong Nhat Nguyen Bao
Jaesik Yoon
Dongwoo Lee
Diffusion models have demonstrated strong generative capabilities across domains ranging from image synthesis to complex reasoning tasks. Ho… (see more)wever, most inference-time scaling methods rely on fixed denoising schedules, limiting their ability to allocate computation based on instance difficulty or task-specific demands adaptively. We introduce the challenge of adaptive inference-time scaling-dynamically adjusting computational effort during inference-and propose Adaptive Bi-directional Cyclic Diffusion (ABCD), a flexible, search-based inference framework. ABCD refines outputs through bi-directional diffusion cycles while adaptively controlling exploration depth and termination. It comprises three components: Cyclic Diffusion Search, Automatic Exploration-Exploitation Balancing, and Adaptive Thinking Time. Experiments show that ABCD improves performance across diverse tasks while maintaining computational efficiency.
Self-Evolving Curriculum for LLM Reasoning
Latent Veracity Inference for Identifying Errors in Stepwise Reasoning
Jean-Pierre R. Falet
Oliver E. Richardson
Moksh J. Jain
Sungsoo Ahn
Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can cont… (see more)ain inaccurate statements that reduce performance and trustworthiness. To address this, we propose to augment each reasoning step in a CoT with a latent veracity (or correctness) variable. To efficiently explore this expanded space, we introduce Veracity Search (VS), a discrete search algorithm over veracity assignments. It performs otherwise intractable inference in the posterior distribution over latent veracity values by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time verification method facilitates supervised fine-tuning of an Amortized Veracity Inference (AVI) machine by providing pseudo-labels for veracity. AVI generalizes VS, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that VS reliably identifies errors in logical (ProntoQA), mathematical (GSM8K), and commonsense (CommonsenseQA) reasoning benchmarks, with AVI achieving comparable zero-shot accuracy. Finally, we demonstrate the utility of latent veracity inference for providing feedback during self-correction and self-improvement.
Can a Bayesian Oracle Prevent Harm from an Agent?
Is there a way to design powerful AI systems based on machine learning methods that would satisfy probabilistic safety guarantees? With the … (see more)long-term goal of obtaining a probabilistic guarantee that would apply in every context, we consider estimating a context-dependent bound on the probability of violating a given safety specification. Such a risk evaluation would need to be performed at run-time to provide a guardrail against dangerous actions of an AI. Noting that different plausible hypotheses about the world could produce very different outcomes, and because we do not know which one is right, we derive bounds on the safety violation probability predicted under the true but unknown hypothesis. Such bounds could be used to reject potentially dangerous actions. Our main results involve searching for cautious but plausible hypotheses, obtained by a maximization that involves Bayesian posteriors over hypotheses. We consider two forms of this result, in the iid case and in the non-iid case, and conclude with open problems towards turning such theoretical results into practical AI guardrails.
AI for Global Climate Cooperation: Modeling Global Climate Negotiations, Agreements, and Long-Term Cooperation in RICE-N
Phillip Wozny
Kai-Hendrik Cohrs
Koen Ponse
Soham Phade
Sunil Srinivasa
Yang Zhang
Prateek Gupta
Erman Acar
Stephan Zheng
Comprehensive global cooperation is essential to limit global temperature increases while continuing economic development, e.g., reducing se… (see more)vere inequality or achieving long-term economic growth. Achieving long-term cooperation on climate change mitigation with n strategic agents poses a complex game-theoretic problem. For example, agents may negotiate and reach climate agreements, but there is no central authority to enforce adherence to those agreements. Hence, it is critical to design negotiation and agreement frameworks that foster cooperation, allow all agents to meet their individual policy objectives, and incentivize long-term adherence. This is an interdisciplinary challenge that calls for collaboration between researchers in machine learning, economics, climate science, law, policy, ethics, and other fields. In particular, we argue that machine learning is a critical tool to address the complexity of this domain. To facilitate this research, here we introduce RICE-N, a multi-region integrated assessment model that simulates the global climate and economy, and which can be used to design and evaluate the strategic outcomes for different negotiation and agreement frameworks. We also describe how to use multi-agent reinforcement learning to train rational agents using RICE-N. This framework underpinsAI for Global Climate Cooperation, a working group collaboration and competition on climate negotiation and agreement design. Here, we invite the scientific community to design and evaluate their solutions using RICE-N, machine learning, economic intuition, and other domain knowledge. More information can be found on www.ai4climatecoop.org.
Outsourced Diffusion Sampling: Efficient Posterior Inference in Latent Spaces of Generative Models
Any well-behaved generative model over a variable …
Adaptive Teachers for Amortized Samplers
Amortized inference is the task of training a parametric model, such as a neural network, to approximate a distribution with a given unnorma… (see more)lized density where exact sampling is intractable. When sampling is implemented as a sequential decision-making process, reinforcement learning (RL) methods, such as generative flow networks, can be used to train the sampling policy. Off-policy RL training facilitates the discovery of diverse, high-reward candidates, but existing methods still face challenges in efficient exploration. We propose to use an adaptive training distribution (the \teacher) to guide the training of the primary amortized sampler (the \student). The \teacher, an auxiliary behavior model, is trained to sample high-loss regions of the \student and can generalize across unexplored modes, thereby enhancing mode coverage by providing an efficient training curriculum. We validate the effectiveness of this approach in a synthetic environment designed to present an exploration challenge, two diffusion-based sampling tasks, and four biochemical discovery tasks demonstrating its ability to improve sample efficiency and mode coverage. Source code is available at https://github.com/alstn12088/adaptive-teacher.
Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction
Zhangzhi Peng
Zachary Quinn
Michael Bronstein
Pranam Chatterjee
Avishek Joey Bose
Generative modeling of discrete data underlies important applications spanning text-based agents like ChatGPT to the design of the very buil… (see more)ding blocks of life in protein sequences. However, application domains need to exert control over the generated data by steering the generative process - typically via RLHF - to satisfy a specified property, reward, or affinity metric. In this paper, we study the problem of steering Masked Diffusion Models (MDMs), a recent class of discrete diffusion models that offer a compelling alternative to traditional autoregressive models. We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pre-trained MDMs as a problem of probabilistic inference by learning to sample from a target Bayesian posterior. Our DDPP framework leads to a family of three novel objectives that are all simulation-free, and thus scalable while applying to general non-differentiable reward functions. Empirically, we instantiate DDPP by steering MDMs to perform class-conditional pixel-level image modeling, RLHF-based alignment of MDMs using text-based rewards, and finetuning protein language models to generate more diverse secondary structures and shorter proteins. We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.