Portrait de Pablo Samuel Castro

Pablo Samuel Castro

Membre industriel principal
Professeur associé, Université de Montréal, Département d'informatique et de recherche opérationnelle
Chercheur scientifique, Google DeepMind
Sujets de recherche
Apprentissage par renforcement

Biographie

Pablo Samuel Castro est né et a grandi à Quito, en Équateur, et a déménagé à Montréal après l'école secondaire pour étudier à l’Université McGill. Il y a obtenu un doctorat en se concentrant sur l'apprentissage par renforcement, sous la supervision de Doina Precup et Prakash Panangaden. Il est chercheur scientifique à Google DeepMind à Montréal. Il s’intéresse particulièrement à la recherche fondamentale sur l'apprentissage par renforcement et plaide régulièrement en faveur d'une augmentation de la représentation des personnes d’origine latino-américaine dans la communauté de recherche. Il est également professeur adjoint au Département d'informatique et de recherche opérationnelle (DIRO) de l'Université de Montréal. Outre son intérêt pour le codage, l'intelligence artificielle et les mathématiques, Pablo Samuel est un musicien actif.

Étudiants actuels

Doctorat - UdeM
Superviseur⋅e principal⋅e :
Maîtrise recherche - UdeM
Doctorat - UdeM
Collaborateur·rice de recherche
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - UdeM

Publications

Mitigating Plasticity Loss in Continual Reinforcement Learning by Reducing Churn
Hongyao Tang
Johan Samir Obando Ceron
Plasticity, or the ability of an agent to adapt to new tasks, environments, or distributions, is crucial for continual learning. In this pap… (voir plus)er, we study the loss of plasticity in deep continual RL from the lens of churn: network output variability for out-of-batch data induced by mini-batch training. We demonstrate that (1) the loss of plasticity is accompanied by the exacerbation of churn due to the gradual rank decrease of the Neural Tangent Kernel (NTK) matrix; (2) reducing churn helps prevent rank collapse and adjusts the step size of regular RL gradients adaptively. Moreover, we introduce Continual Churn Approximated Reduction (C-CHAIN) and demonstrate it improves learning performance and outperforms baselines in a diverse range of continual learning environments on OpenAI Gym Control, ProcGen, DeepMind Control Suite, and MinAtar benchmarks.
The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep Reinforcement Learning
Jiashun Liu
Johan Samir Obando Ceron
Ling Pan
Off-policy deep reinforcement learning (RL) typically leverages replay buffers for reusing past experiences during learning. This can help i… (voir plus)mprove sample efficiency when the collected data is informative and aligned with the learning objectives; when that is not the case, it can have the effect of"polluting"the replay buffer with data which can exacerbate optimization challenges in addition to wasting environment interactions due to wasteful sampling. We argue that sampling these uninformative and wasteful transitions can be avoided by addressing the sunk cost fallacy, which, in the context of deep RL, is the tendency towards continuing an episode until termination. To address this, we propose learn to stop (LEAST), a lightweight mechanism that enables strategic early episode termination based on Q-value and gradient statistics, which helps agents recognize when to terminate unproductive episodes early. We demonstrate that our method improves learning efficiency on a variety of RL algorithms, evaluated on both the MuJoCo and DeepMind Control Suite benchmarks.
The Impact of On-Policy Parallelized Data Collection on Deep Reinforcement Learning Networks
Walter Mayor
Johan Samir Obando Ceron
The use of parallel actors for data collection has been an effective technique used in reinforcement learning (RL) algorithms. The manner in… (voir plus) which data is collected in these algorithms, controlled via the number of parallel environments and the rollout length, induces a form of bias-variance trade-off; the number of training passes over the collected data, on the other hand, must strike a balance between sample efficiency and overfitting. We conduct an empirical analysis of these trade-offs on PPO, one of the most popular RL algorithms that uses parallel actors, and establish connections to network plasticity and, more generally, optimization stability. We examine its impact on network architectures, as well as the hyper-parameter sensitivity when scaling data. Our analyses indicate that larger dataset sizes can increase final performance across a variety of settings, and that scaling parallel environments is more effective than increasing rollout lengths. These findings highlight the critical role of data collection strategies in improving agent performance.
Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL
Ghada Sokar
Johan Samir Obando Ceron
The use of deep neural networks in reinforcement learning (RL) often suffers from performance degradation as model size increases. While sof… (voir plus)t mixtures of experts (SoftMoEs) have recently shown promise in mitigating this issue for online RL, the reasons behind their effectiveness remain largely unknown. In this work we provide an in-depth analysis identifying the key factors driving this performance gain. We discover the surprising result that tokenizing the encoder output, rather than the use of multiple experts, is what is behind the efficacy of SoftMoEs. Indeed, we demonstrate that even with an appropriately scaled single expert, we are able to maintain the performance gains, largely thanks to tokenization.
Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning
Samuel Garcin
Trevor McInroe
Christopher G. Lucas
David Abel
Stefano V Albrecht
Extracting relevant information from a stream of high-dimensional observations is a central challenge for deep reinforcement learning agents… (voir plus). Actor-critic algorithms add further complexity to this challenge, as it is often unclear whether the same information will be relevant to both the actor and the critic. To this end, we here explore the principles that underlie effective representations for an actor and for a critic. We focus our study on understanding whether an actor and a critic will benefit from a decoupled, rather than shared, representation. Our primary finding is that when decoupled, the representations for the actor and critic systematically specialise in extracting different types of information from the environment---the actor's representation tends to focus on action-relevant information, while the critic's representation specialises in encoding value and dynamics information. Finally, we demonstrate how these insights help select representation learning objectives that play into the actor's and critic's respective knowledge specialisations, and improve performance in terms of agent returns.
Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning
Samuel Garcin
Trevor McInroe
Christopher G. Lucas
David Abel
Stefano V Albrecht
Overcoming State and Action Space Disparities in Multi-Domain, Multi-Task Reinforcement Learning
Reginald McLean
Kai Yuan
Isaac Woungang
Nariman Farsad
Current multi-task reinforcement learning (MTRL) methods have the ability to perform a large number of tasks with a single policy. However w… (voir plus)hen attempting to interact with a new domain, the MTRL agent would need to be re-trained due to differences in domain dynamics and structure. Because of these limitations, we are forced to train multiple policies even though tasks may have shared dynamics, leading to needing more samples and is thus sample inefficient. In this work, we explore the ability of MTRL agents to learn in various domains with various dynamics by simultaneously learning in multiple domains, without the need to fine-tune extra policies. In doing so we find that a MTRL agent trained in multiple domains induces an increase in sample efficiency of up to 70\% while maintaining the overall success rate of the MTRL agent.
Overcoming State and Action Space Disparities in Multi-Domain, Multi-Task Reinforcement Learning
Reginald McLean
Kai Yuan
Isaac Woungang
Nariman Farsad
Current multi-task reinforcement learning (MTRL) methods have the ability to perform a large number of tasks with a single policy. However w… (voir plus)hen attempting to interact with a new domain, the MTRL agent would need to be re-trained due to differences in domain dynamics and structure. Because of these limitations, we are forced to train multiple policies even though tasks may have shared dynamics, leading to needing more samples and is thus sample inefficient. In this work, we explore the ability of MTRL agents to learn in various domains with various dynamics by simultaneously learning in multiple domains, without the need to fine-tune extra policies. In doing so we find that a MTRL agent trained in multiple domains induces an increase in sample efficiency of up to 70\% while maintaining the overall success rate of the MTRL agent.
CALE: Continuous Arcade Learning Environment
Jesse Farebrother
We introduce the Continuous Arcade Learning Environment (CALE), an extension of the well-known Arcade Learning Environment (ALE) [Bellemare … (voir plus)et al., 2013]. The CALE uses the same underlying emulator of the Atari 2600 gaming system (Stella), but adds support for continuous actions. This enables the benchmarking and evaluation of continuous-control agents (such as PPO [Schulman et al., 2017] and SAC [Haarnoja et al., 2018]) and value-based agents (such as DQN [Mnih et al., 2015] and Rainbow [Hessel et al., 2018]) on the same environment suite. We provide a series of open questions and research directions that CALE enables, as well as initial baseline results using Soft Actor-Critic. CALE is available as part of the ALE athttps://github.com/Farama-Foundation/Arcade-Learning-Environment.
Adaptive Accompaniment with ReaLchords
Yusong Wu
Tim Cooijmans
Kyle Kastner
Adam Roberts
Ian Simon
Alexander Scarlatos
Chris Donahue
Cassie Tarakajian
Shayegan Omidshafiei
Natasha Jaques
Jamming requires coordination, anticipation, and collaborative creativity between musicians. Current generative models of music produce expr… (voir plus)essive output but are not able to generate in an online manner, meaning simultaneously with other musicians (human or otherwise). We propose ReaLchords, an online generative model for improvising chord accompaniment to user melody. We start with an online model pretrained by maximum likelihood, and use reinforcement learning to finetune the model for online use. The finetuning objective leverages both a novel reward model that provides feedback on both harmonic and temporal coherency between melody and chord, and a divergence term that implements a novel type of distillation from a teacher model that can see the future melody. Through quantitative experiments and listening tests, we demonstrate that the resulting model adapts well to unfamiliar input and produce fitting accompaniment. ReaLchords opens the door to live jamming, as well as simultaneous co-creation in other modalities.
Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
Jesse Farebrother
Jordi Orbay
Quan Vuong
Adrien Ali Taiga
Yevgen Chebotar
Ted Xiao
Alex Irpan
Sergey Levine
Aleksandra Faust
Aviral Kumar
Value functions are an essential component in deep reinforcement learning (RL), that are typically trained via mean squared error regression… (voir plus) to match bootstrapped target values. However, scaling value-based RL methods to large networks has proven challenging. This difficulty is in stark contrast to supervised learning: by leveraging a cross-entropy classification loss, supervised methods have scaled reliably to massive networks. Observing this discrepancy, in this paper, we investigate whether the scalability of deep RL can also be improved simply by using classification in place of regression for training value functions. We show that training value functions with categorical cross-entropy significantly enhances performance and scalability across various domains, including single-task RL on Atari 2600 games, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving state-of-the-art results on these domains. Through careful analysis, we show that categorical cross-entropy mitigates issues inherent to value-based RL, such as noisy targets and non-stationarity. We argue that shifting to categorical cross-entropy for training value functions can substantially improve the scalability of deep RL at little-to-no cost.
In value-based deep reinforcement learning, a pruned network is a good network
Johan Samir Obando Ceron
Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage pri… (voir plus)or insights into the advantages of sparse training techniques and demonstrate that gradual magnitude pruning enables {value-based} agents to maximize parameter effectiveness. This results in networks that yield dramatic performance improvements over traditional networks, using only a small fraction of the full network parameters. Our code is publicly available, see Appendix A for details.