Portrait de Marc Gendron-Bellemare n'est pas disponible

Marc Gendron-Bellemare

Membre industriel principal
Chaire en IA Canada-CIFAR
Professeur adjoint, McGill University, École d'informatique
Professeur asssocié, Université de Montréal, Département d'informatique et de recherche opérationnelle
Directeur scientifique, Reliant AI
Sujets de recherche
Apprentissage de représentations
Apprentissage par renforcement
Grands modèles de langage (LLM)

Biographie

J'occupe actuellement le poste de directeur scientifique à Reliant AI. Je suis également professeur adjoint à l'École d'informatique de l'Université McGill et professeur adjoint au Département d'informatique et de recherche opérationnelle (DIRO) de l'Université de Montréal.

Précédemment, j'ai travaillé à Google Brain à Montréal, où je me concentrais sur l'apprentissage par renforcement. De 2013 à 2017, j'ai travaillé chez DeepMind au Royaume-Uni. J'ai obtenu un doctorat de l'Université de l'Alberta en travaillant avec Michael Bowling et Joel Veness.

Ma recherche se situe au carrefour de l'apprentissage par renforcement et de la prédiction probabiliste. Je m'intéresse aussi à l'apprentissage profond, à la modélisation générative, à l'apprentissage en ligne et à la théorie de l'information.

Étudiants actuels

Doctorat - McGill
Co-superviseur⋅e :
Doctorat - McGill
Co-superviseur⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - McGill
Superviseur⋅e principal⋅e :

Publications

Biomechanical finite element simulation of the pelvic organs under dynamic loading and validation against experimental data from magnetic resonance imaging.
Camille Lafond
Louise Hohnadel
Thomas Brunel
Nicolas Pirró
Dominique Chamoret
Sébastien Roth
Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning
Yash Jhaveri
Patrick Shafto
Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning
Yash Jhaveri
Patrick Shafto
In the pursuit of finding an optimal policy, reinforcement learning (RL) methods generally ignore the properties of learned policies apart f… (voir plus)rom their expected return. Thus, even when successful, it is difficult to characterize which policies will be learned and what they will do. In this work, we present a theoretical framework for policy optimization that guarantees convergence to a particular optimal policy, via vanishing entropy regularization and a temperature decoupling gambit. Our approach realizes an interpretable, diversity-preserving optimal policy as the regularization temperature vanishes and ensures the convergence of policy derived objects--value functions and return distributions. In a particular instance of our method, for example, the realized policy samples all optimal actions uniformly. Leveraging our temperature decoupling gambit, we present an algorithm that estimates, to arbitrary accuracy, the return distribution associated to its interpretable, diversity-preserving optimal policy.
Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning
Yash Jhaveri
Patrick Shafto
In the pursuit of finding an optimal policy, reinforcement learning (RL) methods generally ignore the properties of learned policies apart f… (voir plus)rom their expected return. Thus, even when successful, it is difficult to characterize which policies will be learned and what they will do. In this work, we present a theoretical framework for policy optimization that guarantees convergence to a particular optimal policy, via vanishing entropy regularization and a temperature decoupling gambit. Our approach realizes an interpretable, diversity-preserving optimal policy as the regularization temperature vanishes and ensures the convergence of policy derived objects--value functions and return distributions. In a particular instance of our method, for example, the realized policy samples all optimal actions uniformly. Leveraging our temperature decoupling gambit, we present an algorithm that estimates, to arbitrary accuracy, the return distribution associated to its interpretable, diversity-preserving optimal policy.
Tapered Off-Policy REINFORCE Stable and efficient reinforcement learning for LLMs
Jonathan Lebensoldt
Joshua Greaves
Alex Fréchette
Sándor Toth
Sam Work
We propose a new algorithm for fine-tuning large language models using reinforcement learning. Tapered Off-Policy REINFORCE (TOPR) uses an a… (voir plus)symmetric, tapered variant of importance sampling to speed up learning while maintaining stable learning dynamics, even without the use of KL regularization. TOPR can be applied in a fully offline fashion, allows the handling of positive and negative examples in a unified framework, and benefits from the implementational simplicity that is typical of Monte Carlo algorithms. We demonstrate the effectiveness of our approach with a series of experiments on the GSM8K and MATH reasoning benchmarks, finding performance gains for training both a model for solution generation and as a generative verifier. We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency, all the while avoiding the ``wasted inference'' that comes with discarding negative examples. We find that this advantage persists over multiple iterations of training and can be amplified by dataset curation techniques, enabling us to match 70B-parameter model performance with 8B language models. As a corollary to this work, we find that REINFORCE's baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples, and is consequently critical in driving off-policy performance.
MSR37 Improve Analyst Accuracy in Systematic Literature Reviews Using Reliant Tabular and LLM-Based Relevance Scoring
Christoph R. Schlegel
Sam Work
MSR37 Improve Analyst Accuracy in Systematic Literature Reviews Using Reliant Tabular and LLM-Based Relevance Scoring
Christoph R. Schlegel
Sam Work
MSR37 Improve Analyst Accuracy in Systematic Literature Reviews Using Reliant Tabular and LLM-Based Relevance Scoring
Christoph R. Schlegel
Sam Work
Learning and Controlling Silicon Dopant Transitions in Graphene using Scanning Transmission Electron Microscopy
Joshua Greaves
Ekin Dogus Cubuk
Sergei Kalinin
Igor Mordatch
Kevin M Roccapriore
We introduce a machine learning approach to determine the transition dynamics of silicon atoms on a single layer of carbon atoms, when stimu… (voir plus)lated by the electron beam of a scanning transmission electron microscope (STEM). Our method is data-centric, leveraging data collected on a STEM. The data samples are processed and filtered to produce symbolic representations, which we use to train a neural network to predict transition probabilities. These learned transition dynamics are then leveraged to guide a single silicon atom throughout the lattice to pre-determined target destinations. We present empirical analyses that demonstrate the efficacy and generality of our approach.
Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs
Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs
Action Gaps and Advantages in Continuous-Time Distributional Reinforcement Learning
Patrick Shafto
Yash Jhaveri