Glen Berseth

Biography

Glen Berseth is an assistant professor in the Department of Computer Science and Operations Research (DIRO) at Université de Montréal and a core academic member of Mila – Quebec Artificial Intelligence Institute.

He is a Canada CIFAR AI Chair and co-directs the Robotics and Embodied AI Lab (REAL). He was formerly a postdoctoral researcher at Berkeley Artificial Intelligence Research (BAIR), working with Sergey Levine.

Berseth’s previous and current research has focused on solving sequential decision-making problems (planning) for real-world autonomous learning systems (robots). More specifically, his research has focused on human-robot collaboration, reinforcement, and continual-, meta-, multi-agent and hierarchical learning.

He has published in the top venues in robotics, machine learning and computer animation. He teaches a course on robot learning at Université de Montréal and at Mila, in which he covers the most recent research on machine learning techniques for creating generalist robots.

Current Students

Özgür Aslan

PhD - Université de Montréal

PhD - Université de Montréal

Master's Research - Université de Montréal

Website

Roger Creus-Castanyer

PhD - Université de Montréal

Co-supervisor :

PhD - McGill University

Principal supervisor :

Hsiu-Chin Lin

Léa Demeule

PhD - Université de Montréal

Co-supervisor :

PhD - Université de Montréal

Principal supervisor :

Liam Paull

Kumaraditya Gupta

PhD

Principal supervisor :

Collaborating researcher - Université de Montréal

Adriana Knatchbull-Hugessen

PhD - Université de Montréal

Artur Kuramshin

Master's Research - Université de Montréal

Website

Daniel Lawson

PhD - Université de Montréal

Co-supervisor :

Postdoctorate - Université de Montréal

Co-supervisor :

Master's Research - Université de Montréal

Postdoctorate - Université de Montréal

Co-supervisor :

PhD - Université de Montréal

Co-supervisor :

PhD - Université de Montréal

Co-supervisor :

Marc Gendron-Bellemare

Collaborating researcher

Real-time Reinforcement Learning

Siddarth Venkatraman

PhD - Université de Montréal

Co-supervisor :

Blog Posts

Deux robots dans une cuisine, en train de préparer le dîner. L'un coupe les légumes et l'autre fait une omelette.

June 20, 2025

Ivan Anokhin

Matthew Riemer

Rishav Rishav

Gopeshh Subbaraj

Glen Berseth

Read the article

February 15, 2022

Fully Autonomous Real-World Reinforcement Learning with Applications to Mobile Manipulation Primary tabs View Edit(active tab) Delete Revisions

Jędrzej Orbik

Charles Sun

Coline Devin

Glen Berseth

Read the article

Publications

Adaptive Resolution Residual Networks — Generalizing Across Resolutions Easily and Efficiently

Léa Demeule

Mahtab Sandhu

The majority of signal data captured in the real world uses numerous sensors with different resolutions. In practice, most deep learning arc… (see more)hitectures are fixed-resolution; they consider a single resolution at training and inference time. This is convenient to implement but fails to fully take advantage of the diverse signal data that exists. In contrast, other deep learning architectures are adaptive-resolution; they directly allow various resolutions to be processed at training and inference time. This provides computational adaptivity but either sacrifices robustness or compatibility with mainstream layers, which hinders their use. In this work, we introduce Adaptive Resolution Residual Networks (ARRNs) to surpass this tradeoff. We construct ARRNs from Laplacian residuals, which serve as generic adaptive-resolution adapters for fixed-resolution layers. We use smoothing filters within Laplacian residuals to linearly separate input signals over a series of resolution steps. We can thereby skip Laplacian residuals to cast high-resolution ARRNs into low-resolution ARRNs that are computationally cheaper yet numerically identical over low-resolution signals. We guarantee this result when Laplacian residuals are implemented with perfect smoothing kernels. We complement this novel component with Laplacian dropout, which randomly omits Laplacian residuals during training. This regularizes for robustness to a distribution of lower resolutions. This also regularizes for numerical errors that may occur when Laplacian residuals are implemented with approximate smoothing kernels. We provide a solid grounding for the advantageous properties of ARRNs through a theoretical analysis based on neural operators, and empirically show that ARRNs embrace the challenge posed by diverse resolutions with computational adaptivity, robustness, and compatibility with mainstream layers.

2025-07-17

TMLR (accepted)

Curiosity-Driven Exploration via Temporal \\ Contrastive Learning

Faisal Mohamed

Catherine Ji

Benjamin Eysenbach

Exploration remains a key challenge in reinforcement learning (RL), especially in long-horizon tasks and environments with high-dimensional … (see more)observations. A common strategy for effective exploration is to promote state coverage or novelty, which often involves estimating the agent's state visitation distribution. In this paper, we propose \textbf{C}uriosity-Driven Exploration via \textbf{Te}mporal \textbf{C}ontrastive Learning (\methodName), an exploration method based on temporal contrastive learning that rewards agents for reaching states with unexpected futures. This incentivizes uncovering meaningful less-visited states. \methodName is simple and does not require explicit density or uncertainty estimation, while learning representations aligned with the RL objective. It consistently outperforms standard baselines in complex mazes using different embodiments (Ant and Humanoid) and robotic manipulation tasks, while also yielding more diverse behaviors in Craftax without requiring task-specific information.

2025-07-01

rl-conference.cc/RLC/2025/Workshop/RLBrew (published)

Curiosity-Driven Exploration via Temporal Contrastive Learning

Faisal Mohamed

Catherine Ji

Benjamin Eysenbach

Effective exploration in reinforcement learning requires keeping track not just of where the agent has been, but also of how the agent think… (see more)s about and represents the world: an agent should explore states that enable it to learn powerful representations. Temporal representations can include the information required to solve any potential task while avoiding the computational cost of reconstruction. In this paper, we propose an exploration method that uses temporal contrastive representations to drive exploration, maximizing coverage as seen through the lens of these temporal representations. We demonstrate complex exploration behaviors in locomotion, manipulation, and embodied-AI tasks, revealing previously unknown capabilities and behaviors once achievable only via extrinsic rewards.

2025-07-01

rl-conference.cc/RLC/2025/Workshop/RLBrew (published)

Is Exploration or Optimization the Problem for Deep Reinforcement Learning?

2025-07-01

rl-conference.cc/RLC/2025/Workshop/Finding_the_Frame (published)

Is Exploration or Optimization the Problem for Deep Reinforcement Learning?

In the era of deep reinforcement learning, making progress is more complex, as the collected experience must be compressed into a deep model… (see more) for future exploitation and sampling. Many papers have shown that training a deep learning policy under the changing state and action distribution leads to sub-optimal performance even collapse. This naturally leads to the concern that even if the community creates improved exploration algorithms or reward objectives, will those improvements fall on the \textit{deaf ears} of optimization difficulties. This work proposes a new \textit{pracitcal} sub-optimality estimator to determine optimization limitations of deep reinforcement learning algorithms. Through experiments acrossenvironments and RL algorithms, it is shown that the difference between the best data generated is

2025-07-01

rl-conference.cc/RLC/2025/Workshop/Finding_the_Frame (published)

Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning

Adriana Hugessen

Behavioral cloning (BC) methods trained with supervised learning (SL) are an effective way to learn policies from human demonstrations in do… (see more)mains like robotics. Goal-conditioning these policies enables a single generalist policy to capture diverse behaviors contained within an offline dataset. While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally related states are encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. Hence, encouraging this temporal consistency in the representation space should facilitate combinatorial generalization. Successor representations, which encode the distribution of future states visited from the current state, nicely encapsulate this property. However, previous methods for learning successor representations have relied on contrastive samples, temporal-difference (TD) learning, or both. In this work, we propose a simple yet effective representation learning objective,

2025-07-01

rl-conference.cc/RLC/2025/Workshop/RLBrew (published)

doi.org

What Matters for Maximizing Data Reuse In Value-based Deep Reinforcement Learning

Roger Creus Castanyer

Pablo Samuel Castro

A key ingredient for successfully applying deep reinforcement learning to challenging tasks is the effective use of data at scale. Although … (see more)originally deep RL algorithms achieved this by storing past experiences collected from a synchronous actor in an external replay memory [DQN; Mnih et al., 2013], follow-up works scaled training by collecting data asynchronously through distributed actors [R2D2; Kapturowski et al., 2018], and more recently by GPU-optimized parallelization [PQN; Gallici et al., 2024]. We argue that DQN, PQN, and R2D2 constitute a group of value-based methods for parallel training and study them to shed light on the dynamics induced by varying data collection schemes. We conduct a thorough empirical study to better understand these dynamics, and propose the Data Replay Ratio as a novel metric for quantifying data reuse. Our findings suggest that maximizing data reuse involves directly addressing the deadly triad: Q-lambda rollouts for reducing the bias from bootstrapping, the use of LayerNorm for stabilizing function approximation, and parallelized data collection for mitigating off-policy divergence.

2025-07-01

rl-conference.cc/RLC/2025/Workshop/Finding_the_Frame (published)

What Matters for Maximizing Data Reuse In Value-based Deep Reinforcement Learning

Roger Creus Castanyer

Pablo Samuel Castro

2025-07-01

rl-conference.cc/RLC/2025/Workshop/Finding_the_Frame (published)

Zero-Shot Constraint Satisfaction with Forward- Backward Representations

Adriana Hugessen

Harley Wiltzer

Cyrus Neary

Amy Zhang

Traditionally, constrained policy optimization with Reinforcement Learning (RL) requires learning a new policy from scratch for any new envi… (see more)ronment, goal or cost function, with limited generalization to new tasks and constraints. Given the sample inefficiency of many common deep RL methods, this procedure can be impractical for many real-world scenarios, particularly when constraints or tasks are changing. As an alternative, in the unconstrained setting, various works have sought to pre-train representations from offline datasets to accelerate policy optimization upon specification of a reward. Such methods can permit faster adaptation to new tasks in a given environment, dramatically improving sample efficiency. Recently, zero-shot policy optimization has been explored by leveraging a particular

2025-07-01

rl-conference.cc/RLC/2025/Workshop/RLBrew (published)

Training PPO-Clip with Parallelized Data Generation: A Case of Fixed-Point Convergence

Homayoun Honari

Roger Creus Castanyer

Pablo Samuel Castro

In recent years, with the increase in the compute power of GPUs, parallelized data collection has become the dominant approach for training … (see more)reinforcement learning (RL) agents. Proximal Policy Optimization (PPO) is one of the widely-used on-policy methods for training RL agents. In this paper, we focus on the training behavior of PPO-Clip with the increase in the number of parallel environments. In particular, we show that as we increase the amount of data used to train PPO-Clip, the optimized policy would converge to a fixed distribution. We use the results to study the behavior of PPO-Clip in two case studies: the effect of change in the minibatch size and the effect of increase in the number of parallel environments versus the increase in the rollout lengths. The experiments show that settings with high-return PPO runs result in slower convergence to the fixed-distribution and higher consecutive KL divergence changes. Our results aim to offer a better understanding for the prediction of the performance of PPO with the scaling of the parallel environments.

2025-06-22

rl-conference.cc/RLC/2025/Workshop/IBRL (published)