Yoshua Bengio

Biography

*For media requests, please write to medias@mila.quebec.

For more information please contact Cassidy MacNeil, Senior Assistant and Operation Lead at cassidy.macneil@mila.quebec.

Yoshua Bengio is recognized worldwide as a leading expert in AI. He is most known for his pioneering work in deep learning, which earned him the 2018 A.M. Turing Award, “the Nobel Prize of computing,” with Geoffrey Hinton and Yann LeCun.

Bengio is a full professor at Université de Montréal, and the founder and scientific advisor of Mila – Quebec Artificial Intelligence Institute. He is also a senior fellow at CIFAR and co-directs its Learning in Machines & Brains program, serves as special advisor and founding scientific director of IVADO, and holds a Canada CIFAR AI Chair.

In 2019, Bengio was awarded the prestigious Killam Prize and in 2022, he was the most cited computer scientist in the world by h-index. He is a Fellow of the Royal Society of London, Fellow of the Royal Society of Canada, Knight of the Legion of Honor of France and Officer of the Order of Canada. In 2023, he was appointed to the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology.

Concerned about the social impact of AI, Bengio helped draft the Montréal Declaration for the Responsible Development of Artificial Intelligence and continues to raise awareness about the importance of mitigating the potentially catastrophic risks associated with future AI systems.

Current Students

Jamal Abou Haibeh

Collaborating Alumni - McGill University

Mohammed Abukalam

Collaborating Alumni - Université de Montréal

Berkes Anaïs

Collaborating researcher - Cambridge University

Principal supervisor :

Rim Assouel

PhD - Université de Montréal

Stefan Bauer

Independent visiting researcher

Co-supervisor :

Guillaume Lajoie

Paul Bertin

PhD - Université de Montréal

Joyce Chai

Independent visiting researcher

Principal supervisor :

Siva Reddy

Shahana Chatterjee

Collaborating researcher - N/A

Principal supervisor :

David Rolnick

Xiaoyin Chen

PhD - Université de Montréal

Sanghyeok Choi

Collaborating researcher - KAIST

Collaborating Alumni - Université de Montréal

PhD - Université de Montréal

Collaborating Alumni - Université de Montréal

Co-supervisor :

Loubna Benabbou

Desmond Elliott

Independent visiting researcher

Principal supervisor :

PhD - Université de Montréal

Co-supervisor :

PhD - Université de Montréal

Jean-Pierre Falet

PhD - Université de Montréal

Leo Feng

PhD - Université de Montréal

PhD

PhD - Université de Montréal

Edward Hu

PhD - Université de Montréal

Moksh Jain

PhD - Université de Montréal

PhD - Université de Montréal

Principal supervisor :

Collaborating Alumni - Université de Montréal

Hyeonah Kim

Postdoctorate - Université de Montréal

Principal supervisor :

Alex Hernandez-Garcia

Salem Lahlou

Collaborating Alumni - Université de Montréal

Tabitha Edith Lee

Postdoctorate - Université de Montréal

Principal supervisor :

Collaborating Alumni

Zhen Liu

Collaborating Alumni - Université de Montréal

Principal supervisor :

Liam Paull

Kanika Madan

PhD - Université de Montréal

Nikolay Malkin

Collaborating Alumni - Université de Montréal

Cristian Dragos Manta

PhD - Université de Montréal

Co-supervisor :

Dhanya Sridhar

Sarthak Mittal

PhD - Université de Montréal

Principal supervisor :

PhD - Université de Montréal

Principal supervisor :

Postdoctorate - Université de Montréal

Principal supervisor :

Independent visiting researcher - Université de Montréal

Padideh Nouri

PhD - Université de Montréal

Principal supervisor :

Ali Parviz

Collaborating researcher - Ying Wu Coll of Computing

Lena Podina

Collaborating researcher - University of Waterloo

Principal supervisor :

David Rolnick

Nassim Rahaman

Collaborating Alumni - Max-Planck-Institute for Intelligent Systems

Amine RAZIG

Collaborating researcher - Université de Montréal

Co-supervisor :

Loubna Benabbou

Jarrid Rector-Brooks

PhD - Université de Montréal

Danyal REHMAN

Postdoctorate - Université de Montréal

James Requeima

Independent visiting researcher - Université de Montréal

Oli RICHARDSON

Postdoctorate - Université de Montréal

Camille Rochefort-Boulanger

PhD - Université de Montréal

Principal supervisor :

Julie Hussin

Abhik Roychoudhury Roychoudhury

Independent visiting researcher

Principal supervisor :

Siva Reddy

Luca Scimeca

Postdoctorate - Université de Montréal

Collaborating Alumni - Université de Montréal

Marcin Sendera

Collaborating Alumni - Université de Montréal

Divya Sharma

Postdoctorate

Co-supervisor :

Alex Hernandez-Garcia

Mélisande Astrid Crystal Teng

PhD - Université de Montréal

Co-supervisor :

Hugo Larochelle

Ivan Titov

Independent visiting researcher

Principal supervisor :

Siva Reddy

Alex Tong

Collaborating Alumni - Université de Montréal

Postdoctorate - Université de Montréal

Co-supervisor :

PhD - Université de Montréal

Principal supervisor :

Collaborating researcher

Collaborating researcher - Université de Montréal

PhD - Université de Montréal

PhD - McGill University

Principal supervisor :

PhD - Université de Montréal

Principal supervisor :

Aaron Courville

Skipper: Combining Spatial and Temporal Abstraction for Better Generalization

Harry Zhao

Collaborating Alumni - McGill University

Principal supervisor :

Blog Posts

Generic thumbnail for Mila Blog articles.

February 22, 2024

Mingde Harry Zhao

Safa Alver

Harm van Seijen

Romain Laroche

Doina Precup

Yoshua Bengio

Scaling in the Service of Reasoning & Model-Based ML

April 4, 2023

Yoshua Bengio

Edward J. Hu

A collaboration between Mila and Relation Therapeutics to discover novel synergistic combinations of drugs in vitro

March 23, 2022

Paul Bertin

Jake P. Taylor-King

Yoshua Bengio

March 15, 2022

Generative Flow Networks

Yoshua Bengio

Publications

BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks

Juan A. Rodriguez

Xiangru Jian

Siba Smarak Panigrahi

Aarash Feizi

Abhay Puri

Akshay Kalkunte Suresh

Amirhossein Abaskohi

Pierre-Andre Noel

Sanket Biswas … (see 19 more)

Sara Shanian

Ying Zhang

Sathwik Tejaswi Madhusudhan

Joao Monteiro

Krishnamurthy Dj Dvijotham

Torsten Scholak

Nicolas Chapados

Sepideh Kharaghani

Sean Hughes

M. Özsu

Issam Hadj Laradji

Sai Rajeswar

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (see more) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to relevant training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure that our data is high quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench,, a benchmark suite with 10 novel tasks where we carefully create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench, improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations revealed that participants preferred the outputs from models trained with BigDocs over those from GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning.

2025-01-22

ICLR.cc/2025/Conference (poster)

Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets

Zhen Liu

Tim Z. Xiao

Weiyang Liu

Dinghuai Zhang

While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetun… (see more)e pretrained diffusion models on some reward functions that are either designed by experts or learned from small-scale datasets. Existing methods for finetuning diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. Inspired by recent successes in generative flow networks (GFlowNets), a class of probabilistic models that sample with the unnormalized density of a reward function, we propose a novel GFlowNet method dubbed Nabla-GFlowNet (abbreviated as

2025-01-22

ICLR.cc/2025/Conference (poster)

HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models

Seanie Lee

Haebin Seong

Dong Bok Lee

Minki Kang

Xiaoyin Chen

Dominik Wagner

Juho Lee

Sung Ju Hwang

Safety guard models that detect malicious queries aimed at large language models (LLMs) are essential for ensuring the secure and responsibl… (see more)e deployment of LLMs in real-world applications. However, deploying existing safety guard models with billions of parameters alongside LLMs on mobile devices is impractical due to substantial memory requirements and latency. To reduce this cost, we distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. Due to the limited diversity of harmful instructions in the existing labeled dataset, naively distilled models tend to underperform compared to larger models. To bridge the gap between small and large models, we propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions. Given a prompt such as,"Make a single harmful instruction prompt that would elicit offensive content", we add an affirmative prefix (e.g.,"I have an idea for a prompt:") to the LLM's response. This encourages the LLM to continue generating the rest of the response, leading to sampling harmful instructions. Another LLM generates a response to the harmful instruction, and the teacher model labels the instruction-response pair. We empirically show that our HarmAug outperforms other relevant baselines. Moreover, a 435-million-parameter safety guard model trained with HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.

2025-01-22

ICLR.cc/2025/Conference (poster)

Learning diverse attacks on large language models for robust red-teaming and safety tuning

Juho Lee

Sung Ju Hwang

Moksh J. Jain

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of lar… (see more)ge language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

2025-01-22

ICLR.cc/2025/Conference (poster)

MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation

Li Li

Zhiqi Bu

Suyuchen Wang

Huan He

Jie Fu

Yonghui Wu

Jiang Bian

Yong Chen

Model merging has emerged as an effective approach to combine multiple single-task models, fine-tuned from the same pre-trained model, into … (see more)a multitask model. This process typically involves computing a weighted average of the model parameters without any additional training. Existing model-merging methods focus on enhancing average task accuracy. However, interference and conflicts between the objectives of different tasks can lead to trade-offs during model merging. In real-world applications, a set of solutions with various trade-offs can be more informative, helping practitioners make decisions based on diverse preferences. In this paper, we introduce a novel low-compute algorithm, Model Merging with Amortized Pareto Front (MAP). MAP identifies a Pareto set of scaling coefficients for merging multiple models to reflect the trade-offs. The core component of MAP is approximating the evaluation metrics of the various tasks using a quadratic approximation surrogate model derived from a pre-selected set of scaling coefficients, enabling amortized inference. Experimental results on vision and natural language processing tasks show that MAP can accurately identify the Pareto front. To further reduce the required computation of MAP, we propose (1) a Bayesian adaptive sampling algorithm and (2) a nested merging scheme with multiple stages.

2025-01-22

ICLR.cc/2025/Conference (poster)

Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold

Lazar Atanackovic

Xi Zhang

Brandon Amos

Leo J Lee

Numerous biological and physical processes can be modeled as systems of interacting entities evolving continuously over time, e.g. the dynam… (see more)ics of communicating cells or physical particles. Learning the dynamics of such systems is essential for predicting the temporal evolution of populations across novel samples and unseen environments. Flow-based models allow for learning these dynamics at the population level - they model the evolution of the entire distribution of samples. However, current flow-based models are limited to a single initial population and a set of predefined conditions which describe different dynamics. We argue that multiple processes in natural sciences have to be represented as vector fields on the Wasserstein manifold of probability densities. That is, the change of the population at any moment in time depends on the population itself due to the interactions between samples. In particular, this is crucial for personalized medicine where the development of diseases and their respective treatment response depends on the microenvironment of cells specific to each patient. We propose Meta Flow Matching (MFM), a practical approach to integrating along these vector fields on the Wasserstein manifold by amortizing the flow model over the initial populations. Namely, we embed the population of samples using a Graph Neural Network (GNN) and use these embeddings to train a Flow Matching model. This gives MFM the ability to generalize over the initial distributions unlike previously proposed methods. We demonstrate the ability of MFM to improve prediction of individual treatment responses on a large scale multi-patient single-cell drug screen dataset.

2025-01-22

ICLR.cc/2025/Conference (poster)

Structure Language Models for Protein Conformation Generation

Stephen Zhewen Lu

Hongyu Guo

2025-01-22

ICLR.cc/2025/Conference (poster)

On the Transfer of Object-Centric Representation Learning

Aniket Rajiv Didolkar

Andrii Zadaianchuk

Anirudh Goyal

Michael Curtis Mozer

Georg Martius

Maximilian Seitzer

The goal of object-centric representation learning is to decompose visual scenes into a structured representation that isolates the entities… (see more) into individual vectors. Recent successes have shown that object-centric representation learning can be scaled to real-world scenes by utilizing features from pre-trained foundation models like DINO. However, so far, these object-centric methods have mostly been applied in-distribution, with models trained and evaluated on the same dataset. This is in contrast to the underlying foundation models, which have been shown to be applicable to a wide range of data and tasks. Thus, in this work, we answer the question of whether current real-world capable object-centric methods exhibit similar levels of transferability by introducing a benchmark comprising seven different synthetic and real-world datasets. We analyze the factors influencing performance under transfer and find that training on diverse real-world images improves generalization to unseen scenarios. Furthermore, inspired by the success of task-specific fine-tuning in foundation models, we introduce a novel fine-tuning strategy to adapt pre-trained vision encoders for the task of object discovery. We find that the proposed approach results in state-of-the-art performance for unsupervised object discovery, exhibiting strong zero-shot transfer to unseen datasets.

2025-01-22

ICLR.cc/2025/Conference (poster)

On the Transfer of Object-Centric Representation Learning.

Aniket Rajiv Didolkar

Andrii Zadaianchuk

Anirudh Goyal

Michael Curtis Mozer

Georg Martius

Maximilian Seitzer

2025-01-22

ICLR.cc/2025/Conference (poster)

Towards Improving Exploration through Sibling Augmented GFlowNets

Exploration is a key factor for the success of an active learning agent, especially when dealing with sparse extrinsic terminal rewards and … (see more)long trajectories. We introduce Sibling Augmented Generative Flow Networks (SA-GFN), a novel framework designed to enhance exploration and training efficiency of Generative Flow Networks (GFlowNets). SA-GFN uses a decoupled dual network architecture, comprising of a main Behavior Network and an exploratory Sibling Network, to enable a diverse exploration of the underlying distribution using intrinsic rewards. Inspired by the ideas on exploration from reinforcement learning, SA-GFN provides a general-purpose exploration and learning paradigm that integrates with multiple GFlowNet training objectives and is especially helpful for exploration over a wide range of sparse or low reward distributions and task structures. An extensive set of experiments across a diverse range of tasks, reward structures and trajectory lengths, along with a thorough set of ablations, demonstrate the superior performance of SA-GFN in terms of exploration efficacy and convergence speed as compared to the existing methods. In addition, SA-GFN's versatility and compatibility with different GFlowNet training objectives and intrinsic reward methods underscores its broad applicability in various problem domains.

2025-01-22

ICLR.cc/2025/Conference (poster)

Towards Improving Exploration through Sibling Augmented GFlowNets.

2025-01-22

ICLR.cc/2025/Conference (poster)

VCR: Pixel-Level Complex Reasoning by Restoring Occluded Text

Suyuchen Wang

Li Li

Ge Zhang

Sai Rajeswar

We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured … (see more)texts using pixel-level hints within images through complex reasoning. This task stems from the observation that text embedded in images intrinsically differs from common visual elements and text due to the need to align the modalities of vision, text, and text embedded in images. While many works incorporate text into images for visual question answering, they mostly rely on OCR or masked language modeling, reducing the task to text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny, exposed areas of masked texts. We develop a pipeline to generate synthetic images for the VCR task using image-caption pairs, with adjustable caption visibility to control the task difficulty. With this pipeline, we construct VCR-WIKI for VCR using Wikipedia images with captions, including 2.11M English and 346K Chinese training entities, plus 5K validation and 5K test entities in both languages, each in easy and hard configurations. We also make a hidden test set, VCR-HIDDEN, to avoid potential overfitting on VCR-WIKI. Our results reveal that current vision-language models significantly lag behind human performance in the VCR task, and merely fine-tuning the models on our dataset does not lead to notable improvements. We release VCR-WIKI and the data construction code to facilitate future research.

2025-01-22

ICLR.cc/2025/Conference (poster)