Portrait de Glen Berseth

Glen Berseth

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur agrégé, Université de Montréal, Département d'informatique et de recherche opérationnelle
Sujets de recherche
Apprentissage par renforcement
Apprentissage profond
Robotique

Biographie

Glen Berseth est professeur agrégé au Département d'informatique et de recherche opérationnelle (DIRO) de l'Université de Montréal, membre académique principal de Mila – Institut québécois d'intelligence artificielle, détenteur d’une chaire en IA Canada-CIFAR et codirecteur du Laboratoire de robotique et d’IA intégrative de Montréal (REAL). Il a été chercheur postdoctoral à Berkeley Artificial Intelligence Research (BAIR), où il a travaillé avec Sergey Levine. Ses recherches portent sur la résolution de problèmes de prise de décision séquentielle (planification) pour les systèmes d'apprentissage autonomes du monde réel (robots). Elles ont couvert les domaines de la collaboration humain-robot, du renforcement, ainsi que de l'apprentissage continu, multiagent et hiérarchique et du méta-apprentissage. Glen Berseth a fait paraître des articles dans les meilleures publications des domaines de la robotique, de l'apprentissage automatique et de l'animation informatique. Il donne également un cours sur l'apprentissage des robots à l'Université de Montréal et à Mila, couvrant les recherches les plus récentes sur les techniques d'apprentissage automatique pour la création de robots généralistes.

Étudiants actuels

Maîtrise recherche - UdeM
Doctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - UdeM
Maîtrise recherche - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Postdoctorat - UdeM
Co-superviseur⋅e :
Maîtrise recherche - UdeM
Stagiaire de recherche - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Collaborateur·rice de recherche
Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM

Publications

SLowRL: Safe Low-Rank Adaptation for Bridging the Sim-to-Real Gap in Legged Locomotion
Shafeef Omar
Majid Khadiv
A simulator is, at best, a coarse low-fidelity model of the real world the agent eventually has to act in. Closing this residual gap on hard… (voir plus)ware is a canonical instance of operating in a big world: the real environment exposes contact dynamics, latencies, and disturbances that the agent was never given the capacity (parameters or data) to model during pretraining. Naive on-hardware fine-tuning is risky --- the policy can damage the robot before it improves --- and full-parameter updates require prohibitive interaction time. We propose SLowRL, a continual fine-tuning framework that confronts this big-world adaptation problem with two complementary forms of capacity limitation: (i) a rank-1 LoRA adapter applied per layer to both actor and critic, restricting each layer's update to a single direction in its image space (
AI Agent Safety is a Reinforcement Learning Problem
Reginald McLean
Montaser Mohammedalamen
Kevin Roice
Patrick M. Pilarski
Marlos C. Machado
Alyssa Lefaivre Škopac
Benjamin Rosman
With the rapid advancement and deployment of Agentic AI, our scientific understanding of capabilities and limitations has not kept pace, lea… (voir plus)ding to cases where AI agents cause harm. We argue that many of these safety limitations are not novel problems. Instead, the safety challenges currently facing AI agents can be seen as instances of problems the reinforcement learning (RL) community has studied rigorously for decades. The core of this argument concerns the problem formulation of AI agents. AI agents are designed to solve sequential decision-making problems: problems with long-term objectives in which actions have delayed consequences. To model these types of problem, the problem is set up the problem such that the agent receives observations, feedback on its progress, and then takes actions. This is precisely the formulation of the RL problem. In this paper, we formalize the problem equivalence, which we then leverage to argue that \textbf{AI Agent safety is a reinforcement learning problem: the failure modes currently observed in deployed AI agents are structural instances of problems RL has formalized for decades, and the RL safety literature provides principled tools to diagnose and address them.}. We conclude with a call for deliberate collaboration between the RL and AI agent research communities: AI agent researchers gain access to principled frameworks, while RL researchers gain a class of real-world problems that could expose fundamental gaps in current RL benchmarks and theory.
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Roger Creus Castanyer
AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowled… (voir plus)ge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT-5 mini leads overall at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks; the reasoning harness multiplies LLM performance by 3-10x; and ASCII observations consistently outperform natural language. These findings highlight the substantial room for improvement that remains across all agent paradigms. Agentick's capability-decomposed, multi-modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post-training of foundation models in truly sequential environments.
Toward Hardware-Agnostic Quadrupedal World Models via Morphology Conditioning
World models promise a paradigm shift in robotics, where an agent learns the underlying physics of its environment once to enable efficient … (voir plus)planning and behavior learning. However, current world models are often hardware-locked specialists: a model trained on a Boston Dynamics Spot robot fails catastrophically on a Unitree Go1 due to the mismatch in kinematic and dynamic properties, as the model overfits to specific embodiment constraints rather than capturing the universal locomotion dynamics. Consequently, a slight change in actuator dynamics or limb length necessitates training a new model from scratch. In this work, we take a step towards a framework for training a generalizable Quadrupedal World Model (QWM) that disentangles environmental dynamics from robot morphology. We address the limitations of implicit system identification, where treating static physical properties (like mass or limb length) as latent variables to be inferred from motion history creates an adaptation lag that can compromise zero-shot safety and efficiency. Instead, we explicitly condition the generative dynamics on the robot's engineering specifications. By integrating a physical morphology encoder and a reward normalizer, we enable the model to serve as a neural simulator capable of generalizing across morphologies. This capability unlocks zero-shot control across a range of embodiments. We introduce, for the first time, a world model that enables zero-shot generalization to new morphologies for locomotion. While we carefully study the limitations of our method, QWM operates as a distribution-bounded interpolator within the quadrupedal morphology family rather than a universal physics engine, this work represents a significant step toward morphology-conditioned world models for legged locomotion.
SLowRL: Safe Low-Rank Adaptation Reinforcement Learning for Locomotion
Shafeef Omar
Majid Khadiv
Sim-to-real transfer of locomotion policies often leads to performance degradation due to the inevitable sim-to-real gap. Naively fine-tunin… (voir plus)g these policies directly on hardware is problematic, as it poses risks of mechanical failure and suffers from high sample inefficiency. In this paper, we address the challenge of safely and efficiently fine-tuning reinforcement learning (RL) policies for dynamic locomotion tasks. Specifically, we focus on fine-tuning policies learned in simulation directly on hardware, while explicitly enforcing safety constraints. In doing so, we introduce SLowRL, a framework that combines Low-Rank Adaptation (LoRA) with training-time safety enforcement via a recovery policy. We evaluate our method both in simulation and on a real Unitree Go2 quadruped robot for jump and trot tasks. Experimental results show that our method achieves a
Toward Self-Driven Microscopy Exploration for the Characterization of Functional Materials
Claudia M. Bazán
Ramzi Zidani
Maxime Goulet
Jean-Nicolas Deraspe
Jeanine Looman
Delphine Bouilly
Audrey Laventure
Generalization in Online Reinforcement Learning for Mobile Agents
Zihuan Jiang
Zhixiang Chi
Huan Liu
Ziqiang Wang
Yuanhao Yu
Graphical user interface (GUI)-based mobile agents automate digital tasks on mobile devices by interpreting natural-language instructions an… (voir plus)d interacting with the screen. While recent methods apply reinforcement learning (RL) to train vision-language-model(VLM) agents in interactive environments with a primary focus on performance, generalization remains underexplored due to the lack of standardized benchmarks and open-source RL systems. In this work, we formalize the problem as a Contextual Markov Decision Process (CMDP) and introduce \textbf{AndroidWorld-Generalization}, a benchmark with three increasingly challenging regimes for evaluating zero-shot generalization to unseen task instances, templates, and applications. We further propose an RL training system that integrates Group Relative Policy Optimization (GRPO) with a scalable rollout collection system, consisting of containerized infrastructure and asynchronous execution % , and error recovery to support reliable and efficient training. Experiments on AndroidWorld-Generalization show that RL enables a 7B-parameter VLM agent to surpass supervised fine-tuning baselines, yielding a 26.1\% improvement on unseen instances but only limited gains on unseen templates (15.7\%) and apps (8.3\%), underscoring the challenges of generalization. As a preliminary step, we demonstrate that few-shot adaptation at test-time improves performance on unseen apps, motivating future research in this direction. To support reproducibility and fair comparison, we open-source the full RL training system, including the environment, task suite, models, prompt configurations, and the underlying infrastructure \footnote{https://github.com/zihuanjiang/AndroidWorld-Generalization}.
Align and Filter: Improving Performance in Asynchronous On-Policy RL
Distributed training and increasing the gradient update frequency are practical strategies to accelerate learning and improve performance, b… (voir plus)ut both exacerbate a central challenge: \textit{policy lag}, which is the mismatch between the behavior policy generating data and the learning policy being updated. Policy lag can hinder the scaling of on-policy learning algorithms to larger problems. In this paper, we identify the sources of policy lag caused by distributed learning and high update frequency. We use the findings to propose \textit{total Variation-based Advantage aligned Constrained policy Optimization (\methodacronym)} as a practical approach to mitigate policy lag. We empirically validate our method and show that it offers better robustness to policy lag in classic RL tasks and a modern RL for LLM math reasoning task.
Temporal Representations for Exploration: Learning Complex Exploratory Behavior without Extrinsic Rewards
Catherine Ji
Benjamin Eysenbach
Effective exploration in reinforcement learning requires not only tracking where an agent has been, but also understanding how the agent per… (voir plus)ceives and represents the world. To learn powerful representations, an agent should actively explore states that contribute to its knowledge of the environment. Temporal representations can capture the information necessary to solve a wide range of potential tasks while avoiding the computational cost associated with full state reconstruction. In this paper, we propose an exploration method that leverages temporal contrastive representations to guide exploration, prioritizing states with unpredictable future outcomes. We demonstrate that such representations can enable the learning of complex exploratory x in locomotion, manipulation, and embodied-AI tasks, revealing capabilities and behaviors that traditionally require extrinsic rewards. Unlike approaches that rely on explicit distance learning or episodic memory mechanisms (e.g., quasimetric-based methods), our method builds directly on temporal similarities, yielding a simpler yet effective strategy for exploration.
ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning
Roger Creus Castanyer
Cyrus Neary
Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting the… (voir plus)ir broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) - an automata-based formalism for reward specification - are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM's effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.
Discovering Diverse Behaviors via Temporal Contrastive Learning
Catherine Ji
Benjamin Eysenbach
Effective exploration in reinforcement learning requires not only tracking where an agent has been, but also understanding how the agent per… (voir plus)ceives and represents the world. To learn powerful representations, an agent should actively explore states that contribute to its knowledge of the environment. Temporal representations can capture the information necessary to solve a wide range of potential tasks while avoiding the computational cost associated with full state reconstruction. In this paper, we propose an exploration method that leverages temporal contrastive representations to guide exploration, prioritizing states with unpredictable future outcomes. We demonstrate that such representations can enable the learning of complex exploratory behaviors in locomotion, manipulation, and embodied-AI tasks, revealing capabilities and behaviors that traditionally require extrinsic rewards. Unlike approaches that rely on explicit distance learning or episodic memory mechanisms (e.g., quasimetric-based methods), our method builds directly on temporal similarities, yielding a simpler yet effective strategy for exploration.
Position: Collusion Risks Among AI Reasoning Agents Justify Certification Requirements for Making Market Decisions
This position paper argues that AI agents with chain-of-thought reasoning capabilities are predisposed to exhibit collusive behavior and sho… (voir plus)uld be required to obtain behavioral certification before making decisions that affect economic markets. This is because integrating these agents into society could collapse the legal evidentiary distinction between competition and collusion among independent firms without eroding the economic harm distinction. Experiments with DeepSeek-R1 agents in the Bertrand oligopoly pricing domain reveal a tendency towards tacit collusion that persists even when humans prompt the agents not to collude. We further show that the chain-of- thought of these agents can be steered toward either extremely collusive or highly competitive behavior in a way that is not semantically detectable by another LLM analyzing the reasoning traces. As a result, deploying reasoning agents for market decisions leads to collusive economic outcomes without any evidence of conspiracy or intent. Thus, certification based on observed behavior in representative situations is necessary to prevent collusion. We provide preliminary evidence that such agents can be steered in a generalizable way toward efficient competitive equilibria. However, developing a comprehensive behavioral certification will be required before these models can be deployed in real-world markets while ensuring their stability and efficiency.