Portrait of Marc Gendron-Bellemare is unavailable

Marc Gendron-Bellemare

Core Industry Member
Canada CIFAR AI Chair
Associate Professor, McGill University, School of Computer Science
Adjunct Professor, Université de Montréal, Department of Computer Science and Operations Research
Chief Scientific Officer, Reliant AI
Research Topics
Large Language Models (LLM)
Reinforcement Learning
Representation Learning

Biography

I am Chief Scientific Officer at Reliant AI, an adjunct professor at the School of Computer and Science at McGill University, and an adjunct professor at the Department of Computer Science and Operations Research (DIRO) at Université de Montréal.

Previously, I was a research scientist at Google Brain in Montréal, where my research focused on reinforcement learning effort. From 2013 to 2017, I worked at DeepMind in the U.K. I received my PhD from the University of Alberta under the supervision of Michael Bowling and Joel Veness.

My research lies at the intersection of reinforcement learning and probabilistic prediction. I am also interested in deep learning, generative modelling, online learning and information theory.

Current Students

PhD - McGill University
Co-supervisor :
PhD - McGill University
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
PhD - McGill University
Principal supervisor :

Publications

Biomechanical finite element simulation of the pelvic organs under dynamic loading and validation against experimental data from magnetic resonance imaging.
Camille Lafond
Louise Hohnadel
Thomas Brunel
Nicolas Pirró
Dominique Chamoret
Sébastien Roth
Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning
Yash Jhaveri
Patrick Shafto
Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning
Yash Jhaveri
Patrick Shafto
In the pursuit of finding an optimal policy, reinforcement learning (RL) methods generally ignore the properties of learned policies apart f… (see more)rom their expected return. Thus, even when successful, it is difficult to characterize which policies will be learned and what they will do. In this work, we present a theoretical framework for policy optimization that guarantees convergence to a particular optimal policy, via vanishing entropy regularization and a temperature decoupling gambit. Our approach realizes an interpretable, diversity-preserving optimal policy as the regularization temperature vanishes and ensures the convergence of policy derived objects--value functions and return distributions. In a particular instance of our method, for example, the realized policy samples all optimal actions uniformly. Leveraging our temperature decoupling gambit, we present an algorithm that estimates, to arbitrary accuracy, the return distribution associated to its interpretable, diversity-preserving optimal policy.
Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning
Yash Jhaveri
Patrick Shafto
In the pursuit of finding an optimal policy, reinforcement learning (RL) methods generally ignore the properties of learned policies apart f… (see more)rom their expected return. Thus, even when successful, it is difficult to characterize which policies will be learned and what they will do. In this work, we present a theoretical framework for policy optimization that guarantees convergence to a particular optimal policy, via vanishing entropy regularization and a temperature decoupling gambit. Our approach realizes an interpretable, diversity-preserving optimal policy as the regularization temperature vanishes and ensures the convergence of policy derived objects--value functions and return distributions. In a particular instance of our method, for example, the realized policy samples all optimal actions uniformly. Leveraging our temperature decoupling gambit, we present an algorithm that estimates, to arbitrary accuracy, the return distribution associated to its interpretable, diversity-preserving optimal policy.
Tapered Off-Policy REINFORCE Stable and efficient reinforcement learning for LLMs
Jonathan Lebensoldt
Joshua Greaves
Alex Fréchette
Sándor Toth
Sam Work
We propose a new algorithm for fine-tuning large language models using reinforcement learning. Tapered Off-Policy REINFORCE (TOPR) uses an a… (see more)symmetric, tapered variant of importance sampling to speed up learning while maintaining stable learning dynamics, even without the use of KL regularization. TOPR can be applied in a fully offline fashion, allows the handling of positive and negative examples in a unified framework, and benefits from the implementational simplicity that is typical of Monte Carlo algorithms. We demonstrate the effectiveness of our approach with a series of experiments on the GSM8K and MATH reasoning benchmarks, finding performance gains for training both a model for solution generation and as a generative verifier. We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency, all the while avoiding the ``wasted inference'' that comes with discarding negative examples. We find that this advantage persists over multiple iterations of training and can be amplified by dataset curation techniques, enabling us to match 70B-parameter model performance with 8B language models. As a corollary to this work, we find that REINFORCE's baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples, and is consequently critical in driving off-policy performance.
MSR37 Improve Analyst Accuracy in Systematic Literature Reviews Using Reliant Tabular and LLM-Based Relevance Scoring
Christoph R. Schlegel
Sam Work
MSR37 Improve Analyst Accuracy in Systematic Literature Reviews Using Reliant Tabular and LLM-Based Relevance Scoring
Christoph R. Schlegel
Sam Work
MSR37 Improve Analyst Accuracy in Systematic Literature Reviews Using Reliant Tabular and LLM-Based Relevance Scoring
Christoph R. Schlegel
Sam Work
Learning and Controlling Silicon Dopant Transitions in Graphene using Scanning Transmission Electron Microscopy
Joshua Greaves
Ekin Dogus Cubuk
Sergei Kalinin
Igor Mordatch
Kevin M Roccapriore
We introduce a machine learning approach to determine the transition dynamics of silicon atoms on a single layer of carbon atoms, when stimu… (see more)lated by the electron beam of a scanning transmission electron microscope (STEM). Our method is data-centric, leveraging data collected on a STEM. The data samples are processed and filtered to produce symbolic representations, which we use to train a neural network to predict transition probabilities. These learned transition dynamics are then leveraged to guide a single silicon atom throughout the lattice to pre-determined target destinations. We present empirical analyses that demonstrate the efficacy and generality of our approach.
Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs
Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs
Action Gaps and Advantages in Continuous-Time Distributional Reinforcement Learning
Patrick Shafto
Yash Jhaveri