Portrait de Pablo Samuel Castro

Pablo Samuel Castro

Membre industriel principal
Professeur associé, Université de Montréal, Département d'informatique et de recherche opérationnelle
Chercheur scientifique, Google DeepMind
Sujets de recherche
Apprentissage par renforcement

Biographie

Pablo Samuel Castro est né et a grandi à Quito, en Équateur, et a déménagé à Montréal après l'école secondaire pour étudier à l’Université McGill. Il y a obtenu un doctorat en se concentrant sur l'apprentissage par renforcement, sous la supervision de Doina Precup et Prakash Panangaden. Il est chercheur scientifique à Google DeepMind à Montréal. Il s’intéresse particulièrement à la recherche fondamentale sur l'apprentissage par renforcement et plaide régulièrement en faveur d'une augmentation de la représentation des personnes d’origine latino-américaine dans la communauté de recherche. Il est également professeur adjoint au Département d'informatique et de recherche opérationnelle (DIRO) de l'Université de Montréal. Outre son intérêt pour le codage, l'intelligence artificielle et les mathématiques, Pablo Samuel est un musicien actif.

Étudiants actuels

Doctorat - UdeM
Superviseur⋅e principal⋅e :
Visiteur de recherche indépendant - RWTH Aachen University
Maîtrise recherche - UdeM
Doctorat - UdeM
Collaborateur·rice de recherche
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - UdeM

Publications

Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning
Jiashun Liu
Johan S. Obando-Ceron
Han Lu
Yancheng He
Weixun Wang
Wenbo Su
Bo Zheng
Ling Pan
Most recent RL for LLMs (RL4LLM) methods avoid explicit critics, replacing them with average advantage baselines. This shift is largely prag… (voir plus)matic: conventional value functions are computationally expensive to train at LLM scale and often fail under sparse rewards and long reasoning horizons. We revisit this bottleneck from an architectural perspective and introduce Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the critics role while remaining efficient in large-model settings. AsyPPO employs a set of lightweight mini-critics, each trained on disjoint prompt shards. This design encourages diversity while preserving calibration, reducing value-estimation bias. Beyond robust estimation, AsyPPO leverages inter-critic uncertainty to refine the policy update: (i) masking advantages in states where critics agree and gradients add little learning signal, and (ii) filtering high-divergence states from entropy regularization, suppressing spurious exploration. After training on open-source data with only 5,000 samples, AsyPPO consistently improves learning stability and performance across multiple benchmarks over strong baselines, such as GRPO, achieving performance gains of more than six percent on Qwen3-4b-Base and about three percent on Qwen3-8b-Base and Qwen3-14b-Base over classic PPO, without additional tricks. These results highlight the importance of architectural innovations for scalable, efficient algorithms.
A Geometric Lens on RL Environment Complexity Based on Ricci Curvature
We introduce Ollivier-Ricci Curvature (ORC) as an information-geometric tool for analyzing the local structure of reinforcement learning (RL… (voir plus)) environments. We establish a novel connection between ORC and the Successor Representation (SR), enabling a geometric interpretation of environment dynamics decoupled from reward signals. Our analysis shows that states with positive and negative ORC values correspond to regions where random walks converge and diverge respectively, which are often critical for effective exploration. ORC is highly correlated with established environment complexity metrics, yet integrates naturally with standard RL frameworks based on SR and provides both global and local complexity measures. Leveraging this property, we propose an ORC-based intrinsic reward that guides agents toward divergent regions and away from convergent traps. Empirical results demonstrate that our curvature-driven reward substantially improves exploration performance across diverse environments, outperforming both random and count-based intrinsic baselines.
A Geometric Lens on RL Environment Complexity Based on Ricci Curvature
We introduce Ollivier-Ricci Curvature (ORC) as an information-geometric tool for analyzing the local structure of reinforcement learning (RL… (voir plus)) environments. We establish a novel connection between ORC and the Successor Representation (SR), enabling a geometric interpretation of environment dynamics decoupled from reward signals. Our analysis shows that states with positive and negative ORC values correspond to regions where random walks converge and diverge respectively, which are often critical for effective exploration. ORC is highly correlated with established environment complexity metrics, yet integrates naturally with standard RL frameworks based on SR and provides both global and local complexity measures. Leveraging this property, we propose an ORC-based intrinsic reward that guides agents toward divergent regions and away from convergent traps. Empirical results demonstrate that our curvature-driven reward substantially improves exploration performance across diverse environments, outperforming both random and count-based intrinsic reward baselines.
A Geometric Lens on RL Environment Complexity Based on Ricci Curvature
We introduce Ollivier-Ricci Curvature (ORC) as an information-geometric tool for analyzing the local structure of reinforcement learning (RL… (voir plus)) environments. We establish a novel connection between ORC and the Successor Representation (SR), enabling a geometric interpretation of environment dynamics decoupled from reward signals. Our analysis shows that states with positive and negative ORC values correspond to regions where random walks converge and diverge respectively, which are often critical for effective exploration. ORC is highly correlated with established environment complexity metrics, yet integrates naturally with standard RL frameworks based on SR and provides both global and local complexity measures. Leveraging this property, we propose an ORC-based intrinsic reward that guides agents toward divergent regions and away from convergent traps. Empirical results demonstrate that our curvature-driven reward substantially improves exploration performance across diverse environments, outperforming both random and count-based intrinsic baselines.
What Matters for Maximizing Data Reuse In Value-based Deep Reinforcement Learning
Roger Creus Castanyer
A key ingredient for successfully applying deep reinforcement learning to challenging tasks is the effective use of data at scale. Although … (voir plus)originally deep RL algorithms achieved this by storing past experiences collected from a synchronous actor in an external replay memory [DQN; Mnih et al., 2013], follow-up works scaled training by collecting data asynchronously through distributed actors [R2D2; Kapturowski et al., 2018], and more recently by GPU-optimized parallelization [PQN; Gallici et al., 2024]. We argue that DQN, PQN, and R2D2 constitute a group of value-based methods for parallel training and study them to shed light on the dynamics induced by varying data collection schemes. We conduct a thorough empirical study to better understand these dynamics, and propose the Data Replay Ratio as a novel metric for quantifying data reuse. Our findings suggest that maximizing data reuse involves directly addressing the deadly triad: Q-lambda rollouts for reducing the bias from bootstrapping, the use of LayerNorm for stabilizing function approximation, and parallelized data collection for mitigating off-policy divergence.
What Matters for Maximizing Data Reuse In Value-based Deep Reinforcement Learning
Roger Creus Castanyer
A key ingredient for successfully applying deep reinforcement learning to challenging tasks is the effective use of data at scale. Although … (voir plus)originally deep RL algorithms achieved this by storing past experiences collected from a synchronous actor in an external replay memory [DQN; Mnih et al., 2013], follow-up works scaled training by collecting data asynchronously through distributed actors [R2D2; Kapturowski et al., 2018], and more recently by GPU-optimized parallelization [PQN; Gallici et al., 2024]. We argue that DQN, PQN, and R2D2 constitute a group of value-based methods for parallel training and study them to shed light on the dynamics induced by varying data collection schemes. We conduct a thorough empirical study to better understand these dynamics, and propose the Data Replay Ratio as a novel metric for quantifying data reuse. Our findings suggest that maximizing data reuse involves directly addressing the deadly triad: Q-lambda rollouts for reducing the bias from bootstrapping, the use of LayerNorm for stabilizing function approximation, and parallelized data collection for mitigating off-policy divergence.
A Survey of State Representation Learning for Deep Reinforcement Learning
Representation learning methods are an important tool for addressing the challenges posed by complex observations spaces in sequential decis… (voir plus)ion making problems. Recently, many methods have used a wide variety of types of approaches for learning meaningful state representations in reinforcement learning, allowing better sample efficiency, generalization, and performance. This survey aims to provide a broad categorization of these methods within a model-free online setting, exploring how they tackle the learning of state representations differently. We categorize the methods into six main classes, detailing their mechanisms, benefits, and limitations. Through this taxonomy, our aim is to enhance the understanding of this field and provide a guide for new researchers. We also discuss techniques for assessing the quality of representations, and detail relevant future directions.
Training PPO-Clip with Parallelized Data Generation: A Case of Fixed-Point Convergence
In recent years, with the increase in the compute power of GPUs, parallelized data collection has become the dominant approach for training … (voir plus)reinforcement learning (RL) agents. Proximal Policy Optimization (PPO) is one of the widely-used on-policy methods for training RL agents. In this paper, we focus on the training behavior of PPO-Clip with the increase in the number of parallel environments. In particular, we show that as we increase the amount of data used to train PPO-Clip, the optimized policy would converge to a fixed distribution. We use the results to study the behavior of PPO-Clip in two case studies: the effect of change in the minibatch size and the effect of increase in the number of parallel environments versus the increase in the rollout lengths. The experiments show that settings with high-return PPO runs result in slower convergence to the fixed-distribution and higher consecutive KL divergence changes. Our results aim to offer a better understanding for the prediction of the performance of PPO with the scaling of the parallel environments.
Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning
Scaling deep reinforcement learning networks is challenging and often results in degraded performance, yet the root causes of this failure m… (voir plus)ode remain poorly understood. Several recent works have proposed mechanisms to address this, but they are often complex and fail to highlight the causes underlying this difficulty. In this work, we conduct a series of empirical analyses which suggest that the combination of non-stationarity with gradient pathologies, due to suboptimal architectural choices, underlie the challenges of scale. We propose a series of direct interventions that stabilize gradient flow, enabling robust performance across a range of network depths and widths. Our interventions are simple to implement and compatible with well-established algorithms, and result in an effective mechanism that enables strong performance even at large scales. We validate our findings on a variety of agents and suites of environments.
The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep Reinforcement Learning
Off-policy deep reinforcement learning (RL) typically leverages replay buffers for reusing past experiences during learning. This can help i… (voir plus)mprove sample efficiency when the collected data is informative and aligned with the learning objectives; when that is not the case, it can have the effect of"polluting"the replay buffer with data which can exacerbate optimization challenges in addition to wasting environment interactions due to wasteful sampling. We argue that sampling these uninformative and wasteful transitions can be avoided by addressing the sunk cost fallacy, which, in the context of deep RL, is the tendency towards continuing an episode until termination. To address this, we propose learn to stop (LEAST), a lightweight mechanism that enables strategic early episode termination based on Q-value and gradient statistics, which helps agents recognize when to terminate unproductive episodes early. We demonstrate that our method improves learning efficiency on a variety of RL algorithms, evaluated on both the MuJoCo and DeepMind Control Suite benchmarks.
Meta-World+: An Improved, Standardized, RL Benchmark
Reginald McLean
Evangelos Chatzaroulas
Luc McCutcheon
Frank Röder
Tianhe Yu
Zhanpeng He
K.R. Zentner
Ryan Julian
J K Terry
Isaac Woungang
Nariman Farsad
Multi-task reinforcement learning challenges agents to master diverse skills simultaneously, and Meta-World emerged as the gold standard ben… (voir plus)chmark for evaluating these algorithms. However, since the introduction of the Meta-World benchmark there have been numerous undocumented changes which inhibit fair comparison of multi-task and meta reinforcement learning algorithms. This work strives to disambiguate these results from the literature, while also producing an open-source version of Meta-World that has full reproducibility of past results.
Continual Learning in Vision-Language Models via Aligned Model Merging
Ghada Sokar
Anurag Arnab
Ahmet Iscen
Cordelia Schmid
Continual learning is conventionally tackled through sequential fine-tuning, a process that, while enabling adaptation, inherently favors pl… (voir plus)asticity over the stability needed to retain prior knowledge. While existing approaches attempt to mitigate catastrophic forgetting, a bias towards recent tasks persists as they build upon this sequential nature. In this work we present a new perspective based on model merging to maintain stability while still retaining plasticity. Rather than just sequentially updating the model weights, we propose merging newly trained task parameters with previously learned ones, promoting a better balance. To maximize the effectiveness of the merging process, we propose a simple mechanism that promotes learning aligned weights with previous ones, thereby avoiding interference when merging. We evaluate this approach on large Vision-Language Models (VLMs), and demonstrate its effectiveness in reducing forgetting, increasing robustness to various task orders and similarities, and improving generalization.