Portrait of Pablo Samuel Castro

Pablo Samuel Castro

Core Industry Member
Adjunct professor, Université de Montréal, Department of Computer Science and Operations Research
Research Scientist, Google DeepMind
Research Topics
Reinforcement Learning

Biography

Pablo Samuel Castro was born and raised in Quito, Ecuador, and moved to Montréal after high school to study at McGill University. For his PhD, he studied reinforcement learning with Doina Precup and Prakash Panangaden at McGill. Castro has been working at Google for over eleven years. He is currently a staff research scientist at Google DeepMind in Montreal, where he conducts fundamental reinforcement learning research and is a regular advocate for increasing LatinX representation in the research community.

He is also an adjunct professor in the Department of Computer Science and Operations Research (DIRO) at Université de Montréal. In addition to his interest in coding, AI and math, Castro is an active musician.

Current Students

PhD - Université de Montréal
Principal supervisor :
Master's Research - Université de Montréal
Collaborating researcher
PhD - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
PhD - McGill University
Principal supervisor :
PhD - McGill University
Principal supervisor :
PhD - Université de Montréal

Publications

Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
Jesse Farebrother
Jordi Orbay
Quan Vuong
Adrien Ali Taiga
Yevgen Chebotar
Ted Xiao
Alex Irpan
Sergey Levine
Aleksandra Faust
Aviral Kumar
Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
Jesse Farebrother
Jordi Orbay
Quan Vuong
Adrien Ali Taiga
Yevgen Chebotar
Ted Xiao
Alex Irpan
Sergey Levine
Aleksandra Faust
Aviral Kumar
Value functions are a central component of deep reinforcement learning (RL). These functions, parameterized by neural networks, are trained … (see more)using a mean squared error regression objective to match bootstrapped target values. However, scaling value-based RL methods that use regression to large networks, such as high-capacity Transformers, has proven challenging. This difficulty is in stark contrast to supervised learning: by leveraging a cross-entropy classification loss, supervised methods have scaled reliably to massive networks. Observing this discrepancy, in this paper, we investigate whether the scalability of deep RL can also be improved simply by using classification in place of regression for training value functions. We demonstrate that value functions trained with categorical cross-entropy significantly improves performance and scalability in a variety of domains. These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving state-of-the-art results on these domains. Through careful analysis, we show that the benefits of categorical cross-entropy primarily stem from its ability to mitigate issues inherent to value-based RL, such as noisy targets and non-stationarity. Overall, we argue that a simple shift to training value functions with categorical cross-entropy can yield substantial improvements in the scalability of deep RL at little-to-no cost.
Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
Jesse Farebrother
Jordi Orbay
Quan Vuong
Adrien Ali Taiga
Yevgen Chebotar
Ted Xiao
Alex Irpan
Sergey Levine
Aleksandra Faust
Aviral Kumar
Value functions are a central component of deep reinforcement learning (RL). These functions, parameterized by neural networks, are trained … (see more)using a mean squared error regression objective to match bootstrapped target values. However, scaling value-based RL methods that use regression to large networks, such as high-capacity Transformers, has proven challenging. This difficulty is in stark contrast to supervised learning: by leveraging a cross-entropy classification loss, supervised methods have scaled reliably to massive networks. Observing this discrepancy, in this paper, we investigate whether the scalability of deep RL can also be improved simply by using classification in place of regression for training value functions. We demonstrate that value functions trained with categorical cross-entropy significantly improves performance and scalability in a variety of domains. These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving state-of-the-art results on these domains. Through careful analysis, we show that the benefits of categorical cross-entropy primarily stem from its ability to mitigate issues inherent to value-based RL, such as noisy targets and non-stationarity. Overall, we argue that a simple shift to training value functions with categorical cross-entropy can yield substantial improvements in the scalability of deep RL at little-to-no cost.
Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
Jesse Farebrother
Jordi Orbay
Quan Ho Vuong
Adrien Ali Taiga
Yevgen Chebotar
Ted Xiao
A. Irpan
Sergey Levine
Aleksandra Faust
Aviral Kumar
Value functions are a central component of deep reinforcement learning (RL). These functions, parameterized by neural networks, are trained … (see more)using a mean squared error regression objective to match bootstrapped target values. However, scaling value-based RL methods that use regression to large networks, such as high-capacity Transformers, has proven challenging. This difficulty is in stark contrast to supervised learning: by leveraging a cross-entropy classification loss, supervised methods have scaled reliably to massive networks. Observing this discrepancy, in this paper, we investigate whether the scalability of deep RL can also be improved simply by using classification in place of regression for training value functions. We demonstrate that value functions trained with categorical cross-entropy significantly improves performance and scalability in a variety of domains. These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving state-of-the-art results on these domains. Through careful analysis, we show that the benefits of categorical cross-entropy primarily stem from its ability to mitigate issues inherent to value-based RL, such as noisy targets and non-stationarity. Overall, we argue that a simple shift to training value functions with categorical cross-entropy can yield substantial improvements in the scalability of deep RL at little-to-no cost.
Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
Jesse Farebrother
Jordi Orbay
Quan Vuong
Adrien Ali Taiga
Yevgen Chebotar
Ted Xiao
Alex Irpan
Sergey Levine
Aleksandra Faust
Aviral Kumar
Value functions are a central component of deep reinforcement learning (RL). These functions, parameterized by neural networks, are trained … (see more)using a mean squared error regression objective to match bootstrapped target values. However, scaling value-based RL methods that use regression to large networks, such as high-capacity Transformers, has proven challenging. This difficulty is in stark contrast to supervised learning: by leveraging a cross-entropy classification loss, supervised methods have scaled reliably to massive networks. Observing this discrepancy, in this paper, we investigate whether the scalability of deep RL can also be improved simply by using classification in place of regression for training value functions. We demonstrate that value functions trained with categorical cross-entropy significantly improves performance and scalability in a variety of domains. These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving state-of-the-art results on these domains. Through careful analysis, we show that the benefits of categorical cross-entropy primarily stem from its ability to mitigate issues inherent to value-based RL, such as noisy targets and non-stationarity. Overall, we argue that a simple shift to training value functions with categorical cross-entropy can yield substantial improvements in the scalability of deep RL at little-to-no cost.
A density estimation perspective on learning from pairwise human preferences
Vincent Dumoulin
Daniel D. Johnson
Yann Dauphin
Learning from human feedback (LHF) -- and in particular learning from pairwise preferences -- has recently become a crucial ingredient in tr… (see more)aining large language models (LLMs), and has been the subject of much research. Most recent works frame it as a reinforcement learning problem, where a reward function is learned from pairwise preference data and the LLM is treated as a policy which is adapted to maximize the rewards, often under additional regularization constraints. We propose an alternative interpretation which centers on the generative process for pairwise preferences and treats LHF as a density estimation problem. We provide theoretical and empirical results showing that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution. Finally, we discuss and present findings on"annotator misspecification"-- failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models -- suggesting that approaches that learn from pairwise human preferences could have trouble learning from a population of annotators with diverse viewpoints.
Mixtures of Experts Unlock Parameter Scaling for Deep RL
Johan Samir Obando Ceron
Ghada Sokar
Timon Willi
Clare Lyle
Jesse Farebrother
Jakob Nicolaus Foerster
The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance s… (see more)cales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
Mixtures of Experts Unlock Parameter Scaling for Deep RL
Johan Samir Obando Ceron
Ghada Sokar
Timon Willi
Clare Lyle
Jesse Farebrother
Jakob Nicolaus Foerster
Mixtures of Experts Unlock Parameter Scaling for Deep RL
Johan Samir Obando Ceron
Ghada Sokar
Timon Willi
Clare Lyle
Jesse Farebrother
Jakob Nicolaus Foerster
The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance s… (see more)cales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
JaxPruner: A concise library for sparsity research
Joo Hyung Lee
Wonpyo Park
Nicole Elyse Mitchell
Jonathan Pilault
Johan Samir Obando Ceron
Han-Byul Kim
Namhoon Lee
Elias Frantar
Yun Long
Amir Yazdanbakhsh
Shivani Agrawal
Suvinay Subramanian
Xin Wang
Sheng-Chun Kao
Xingyao Zhang
Trevor Gale
Aart J.C. Bik
Woohyun Han
Milen Ferev
Zhonglin Han … (see 5 more)
Hong-Seok Kim
Yann Dauphin
Utku Evci
This paper introduces JaxPruner, an open-source JAX-based pruning and sparse training library for machine learning research. JaxPruner aims … (see more)to accelerate research on sparse neural networks by providing concise implementations of popular pruning and sparse training algorithms with minimal memory and latency overhead. Algorithms implemented in JaxPruner use a common API and work seamlessly with the popular optimization library Optax, which, in turn, enables easy integration with existing JAX based libraries. We demonstrate this ease of integration by providing examples in four different codebases: Scenic, t5x, Dopamine and FedJAX and provide baseline experiments on popular benchmarks.
In deep reinforcement learning, a pruned network is a good network
Johan Samir Obando Ceron
Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage pri… (see more)or insights into the advantages of sparse training techniques and demonstrate that gradual magnitude pruning enables agents to maximize parameter effectiveness. This results in networks that yield dramatic performance improvements over traditional networks and exhibit a type of"scaling law", using only a small fraction of the full network parameters.
Mixture of Experts in a Mixture of RL settings
Timon Willi
Johan Samir Obando Ceron
Jakob Nicolaus Foerster