Portrait de Doina Precup

Doina Precup

Membre académique principal
Chaire en IA Canada-CIFAR
Professeure agrégée, McGill University, École d'informatique
Chef d'équipe de recherche, Google DeepMind

Biographie

Doina Precup enseigne à l'Université McGill tout en menant des recherches fondamentales sur l'apprentissage par renforcement, notamment les applications de l'IA dans des domaines ayant des répercussions sociales, tels que les soins de santé. Elle s'intéresse à la prise de décision automatique dans des situations d'incertitude élevée.

Elle est membre de l'Institut canadien de recherches avancées (CIFAR) et de l'Association pour l'avancement de l'intelligence artificielle (AAAI), et dirige le bureau montréalais de DeepMind.

Ses spécialités sont les suivantes : intelligence artificielle, apprentissage machine, apprentissage par renforcement, raisonnement et planification sous incertitude, applications.

Étudiants actuels

Maîtrise recherche - McGill University
Co-superviseur⋅e :
Doctorat - McGill University
Maîtrise recherche - McGill University
Postdoctorat - McGill University
Maîtrise recherche - McGill University
Doctorat - McGill University
Stagiaire de recherche - McGill University
Doctorat - McGill University
Postdoctorat - Université de Montréal
Superviseur⋅e principal⋅e :
Doctorat - McGill University
Doctorat - McGill University
Superviseur⋅e principal⋅e :
Maîtrise recherche - McGill University
Superviseur⋅e principal⋅e :
Stagiaire de recherche - McGill University
Doctorat - McGill University
Superviseur⋅e principal⋅e :
Maîtrise recherche - McGill University
Co-superviseur⋅e :
Doctorat - McGill University
Co-superviseur⋅e :
Doctorat - McGill University
Doctorat - McGill University
Co-superviseur⋅e :
Stagiaire de recherche - McGill University
Doctorat - McGill University
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - McGill University
Maîtrise recherche - McGill University
Maîtrise recherche - Université de Montréal
Doctorat - McGill University
Co-superviseur⋅e :
Doctorat - McGill University
Doctorat - McGill University
Co-superviseur⋅e :
Collaborateur·rice de recherche - McGill University
Superviseur⋅e principal⋅e :
Doctorat - McGill University
Baccalauréat - McGill University
Doctorat - McGill University
Co-superviseur⋅e :
Maîtrise recherche - Université de Montréal
Superviseur⋅e principal⋅e :
Doctorat - McGill University
Doctorat - McGill University
Superviseur⋅e principal⋅e :
Doctorat - McGill University
Superviseur⋅e principal⋅e :

Publications

QGFN: Controllable Greediness with Action Values
Elaine Lau
Stephen Zhewen Lu
Ling Pan
Emmanuel Bengio
Generative Flow Networks (GFlowNets; GFNs) are a family of reward/energy-based generative methods for combinatorial objects, capable of gene… (voir plus)rating diverse and high-utility samples. However, biasing GFNs towards producing high-utility samples is non-trivial. In this work, we leverage connections between GFNs and reinforcement learning (RL) and propose to combine the GFN policy with an action-value estimate,
Effective Protein-Protein Interaction Exploration with PPIretrieval
Chenqing Hua
Connor Coley
Shuangjia Zheng
Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning
Harry Zhao
Mingde Zhao
Safa Alver
Harm van Seijen
Romain Laroche
Inspired by human conscious planning, we propose Skipper, a model-based reinforcement learning framework utilizing spatio-temporal abstracti… (voir plus)ons to generalize better in novel situations. It automatically decomposes the given task into smaller, more manageable subtasks, and thus enables sparse decision-making and focused computation on the relevant parts of the environment. The decomposition relies on the extraction of an abstracted proxy problem represented as a directed graph, in which vertices and edges are learned end-to-end from hindsight. Our theoretical analyses provide performance guarantees under appropriate assumptions and establish where our approach is expected to be helpful. Generalization-focused experiments validate Skipper’s significant advantage in zero-shot generalization, compared to some existing state-of-the-art hierarchical planning methods.
Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo
Haque Ishfaq
Qingfeng Lan
Pan Xu
A. Rupam Mahmood
Animashree Anandkumar
Kamyar Azizzadenesheli
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcom… (voir plus)ings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of
Connecting Weighted Automata, Tensor Networks and Recurrent Neural Networks through Spectral Learning
Policy Gradient Methods in the Presence of Symmetries and State Abstractions
Sahand Rezaei-Shoshtari
Rosie Zhao
Nash Learning from Human Feedback
R'emi Munos
Michal Valko
Daniele Calandriello
M. G. Azar
Mark Rowland
Zhaohan Daniel Guo
Yunhao Tang
Matthieu Geist
Thomas Mesnard
Andrea Michi
Marco Selvi
Sertan Girgin
Nikola Momchev
Olivier Bachem
Daniel J Mankowitz
Bilal Piot
Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human pref… (voir plus)erences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.
Learning domain-invariant classifiers for infant cry sounds
Charles Onu
Hemanth K. Sheetha
Arsenii Gorin
MUDiff: Unified Diffusion for Complete Molecule Generation
Chenqing Hua
Sitao Luan
Minkai Xu
Rex Ying
Zhitao Ying
Jie Fu
Stefano Ermon
DGFN: Double Generative Flow Networks
Elaine Lau
Nikhil Murali Vemgal
Emmanuel Bengio
Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search
Abbas Mehrabian
Ankit Anand
Hyunjik Kim
Nicolas Sonnerat
Matej Balog
Gheorghe Comanici
Tudor Berariu
Andrew Lee
Anian Ruoss
Anna Bulanova
Daniel Toyama
Sam Blackwell
Bernardino Romera Paredes
Petar Veličković
Laurent Orseau
Joonkyung Lee
Anurag Murty Naredla
Adam Zsolt Wagner
Forecaster: Towards Temporally Abstract Tree-Search Planning from Pixels
Thomas Jiralerspong
Flemming Kondrup
The ability to plan at many different levels of abstraction enables agents to envision the long-term repercussions of their decisions and th… (voir plus)us enables sample-efficient learning. This becomes particularly beneficial in complex environments from high-dimensional state space such as pixels, where the goal is distant and the reward sparse. We introduce Forecaster, a deep hierarchical reinforcement learning approach which plans over high-level goals leveraging a temporally abstract world model. Forecaster learns an abstract model of its environment by modelling the transitions dynamics at an abstract level and training a world model on such transition. It then uses this world model to choose optimal high-level goals through a tree-search planning procedure. It additionally trains a low-level policy that learns to reach those goals. Our method not only captures building world models with longer horizons, but also, planning with such models in downstream tasks. We empirically demonstrate Forecaster's potential in both single-task learning and generalization to new tasks in the AntMaze domain.