Portrait of Pablo Samuel Castro

Pablo Samuel Castro

Core Industry Member
Adjunct professor, Université de Montréal, Department of Computer Science and Operations Research
Research Scientist, Google DeepMind
Research Topics
Reinforcement Learning

Biography

Pablo Samuel Castro was born and raised in Quito, Ecuador, and moved to Montréal after high school to study at McGill University. For his PhD, he studied reinforcement learning with Doina Precup and Prakash Panangaden at McGill. Castro has been working at Google for over eleven years. He is currently a staff research scientist at Google DeepMind in Montreal, where he conducts fundamental reinforcement learning research and is a regular advocate for increasing LatinX representation in the research community.

He is also an adjunct professor in the Department of Computer Science and Operations Research (DIRO) at Université de Montréal. In addition to his interest in coding, AI and math, Castro is an active musician.

Current Students

PhD - Université de Montréal
Principal supervisor :
Independent visiting researcher - RWTH Aachen University
Collaborating Alumni - Université de Montréal
Master's Research - Université de Montréal
PhD - Université de Montréal
Collaborating researcher
PhD - Université de Montréal
Principal supervisor :
PhD - McGill University
Principal supervisor :
PhD - McGill University
Principal supervisor :
PhD - Université de Montréal

Publications

A density estimation perspective on learning from pairwise human preferences
Learning from human feedback (LHF) -- and in particular learning from pairwise preferences -- has recently become a crucial ingredient in tr… (see more)aining large language models (LLMs), and has been the subject of much research. Most recent works frame it as a reinforcement learning problem, where a reward function is learned from pairwise preference data and the LLM is treated as a policy which is adapted to maximize the rewards, often under additional regularization constraints. We propose an alternative interpretation which centers on the generative process for pairwise preferences and treats LHF as a density estimation problem. We provide theoretical and empirical results showing that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution. Finally, we discuss and present findings on"annotator misspecification"-- failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models -- suggesting that approaches that learn from pairwise human preferences could have trouble learning from a population of annotators with diverse viewpoints.
Mixtures of Experts Unlock Parameter Scaling for Deep RL
The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance s… (see more)cales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
Mixtures of Experts Unlock Parameter Scaling for Deep RL
The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance s… (see more)cales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
Mixtures of Experts Unlock Parameter Scaling for Deep RL
JaxPruner: A concise library for sparsity research
Joo Hyung Lee
Wonpyo Park
Nicole Elyse Mitchell
Han-Byul Kim
Namhoon Lee
Elias Frantar
Yun Long
Amir Yazdanbakhsh
Shivani Agrawal
Suvinay Subramanian
Sheng-Chun Kao
Xingyao Zhang
Trevor Gale
Aart J.C. Bik
Woohyun Han
Milen Ferev
Zhonglin Han … (see 5 more)
Hong-Seok Kim
Utku Evci
This paper introduces JaxPruner, an open-source JAX-based pruning and sparse training library for machine learning research. JaxPruner aims … (see more)to accelerate research on sparse neural networks by providing concise implementations of popular pruning and sparse training algorithms with minimal memory and latency overhead. Algorithms implemented in JaxPruner use a common API and work seamlessly with the popular optimization library Optax, which, in turn, enables easy integration with existing JAX based libraries. We demonstrate this ease of integration by providing examples in four different codebases: Scenic, t5x, Dopamine and FedJAX and provide baseline experiments on popular benchmarks.
In deep reinforcement learning, a pruned network is a good network
Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage pri… (see more)or insights into the advantages of sparse training techniques and demonstrate that gradual magnitude pruning enables agents to maximize parameter effectiveness. This results in networks that yield dramatic performance improvements over traditional networks and exhibit a type of"scaling law", using only a small fraction of the full network parameters.
Mixture of Experts in a Mixture of RL settings
On the consistency of hyper-parameter selection in value-based deep reinforcement learning
Deep reinforcement learning (deep RL) has achieved tremendous success on various domains through a combination of algorithmic design and car… (see more)eful selection of hyper-parameters. Algorithmic improvements are often the result of iterative enhancements built upon prior approaches, while hyper-parameter choices are typically inherited from previous methods or fine-tuned specifically for the proposed technique. Despite their crucial impact on performance, hyper-parameter choices are frequently overshadowed by algorithmic advancements. This paper conducts an extensive empirical study focusing on the reliability of hyper-parameter selection for value-based deep reinforcement learning agents, including the introduction of a new score to quantify the consistency and reliability of various hyper-parameters. Our findings not only help establish which hyper-parameters are most critical to tune, but also help clarify which tunings remain consistent across different training regimes.
Learning Silicon Dopant Transitions in Graphene using Scanning Transmission Electron Microscopy
Joshua Greaves
Kevin Roccapriore
Ekin Dogus Cubuk
Sergei Kalinin
Igor Mordatch
We introduce a machine learning approach to determine the transition rates of silicon atoms on a single layer of carbon atoms, when stimulat… (see more)ed by the electron beam of a scanning transmission electron microscope (STEM). Our method is data-centric, leveraging data collected on a STEM. The data samples are processed and filtered to produce symbolic representations, which we use to train a neural network to predict transition rates. These rates are then applied to guide a single silicon atom throughout the lattice to pre-determined target destinations. We present empirical analyses that demonstrate the efficacy and generality of our approach.
Learning Silicon Dopant Transitions in Graphene using Scanning Transmission Electron Microscopy
Joshua Greaves
Kevin Roccapriore
Ekin Dogus Cubuk
Sergei Kalinin
Igor Mordatch
We introduce a machine learning approach to determine the transition rates of silicon atoms on a single layer of carbon atoms, when stimulat… (see more)ed by the electron beam of a scanning transmission electron microscope (STEM). Our method is data-centric, leveraging data collected on a STEM. The data samples are processed and filtered to produce symbolic representations, which we use to train a neural network to predict transition rates. These rates are then applied to guide a single silicon atom throughout the lattice to pre-determined target destinations. We present empirical analyses that demonstrate the efficacy and generality of our approach.
Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks
Maxime Chevalier-Boisvert
Bolun Dai
Mark Towers
Rodrigo De Lazcano Perez-Vicente
Suman Pal
J K Terry
We present the Minigrid and Miniworld libraries which provide a suite of goal-oriented 2D and 3D environments. The libraries were explicitly… (see more) created with a minimalistic design paradigm to allow users to rapidly develop new environments for a wide range of research-specific needs. As a result, both have received widescale adoption by the RL community, facilitating research in a wide range of areas. In this paper, we outline the design philosophy, environment details, and their world generation API. We also showcase the additional capabilities brought by the unified API between Minigrid and Miniworld through case studies on transfer learning (for both RL agents and humans) between the different observation spaces. The source code of Minigrid and Miniworld can be found at https://github.com/Farama-Foundation/Minigrid and https://github.com/Farama-Foundation/Miniworld along with their documentation at https://minigrid.farama.org/ and https://miniworld.farama.org/.
Small batch deep reinforcement learning
In value-based deep reinforcement learning with replay memories, the batch size parameter specifies how many transitions to sample for each … (see more)gradient update. Although critical to the learning process, this value is typically not adjusted when proposing new algorithms. In this work we present a broad empirical study that suggests {\em reducing} the batch size can result in a number of significant performance gains; this is surprising, as the general tendency when training neural networks is towards larger batch sizes for improved performance. We complement our experimental findings with a set of empirical analyses towards better understanding this phenomenon.