Johan Samir Obando Ceron

Mixtures of Experts Unlock Parameter Scaling for Deep RL

Ghada Sokar

Timon Willi

Clare Lyle

Jakob Nicolaus Foerster

The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance s… (see more)cales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.

2024-05-01

ICML.cc/2024/Conference (spotlight)

Mixtures of Experts Unlock Parameter Scaling for Deep RL

Ghada Sokar

Timon Willi

Clare Lyle

Jakob Nicolaus Foerster

2024-02-13

ArXiv (preprint)

Mixtures of Experts Unlock Parameter Scaling for Deep RL

Ghada Sokar

Timon Willi

Clare Lyle

Jakob Nicolaus Foerster

The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance s… (see more)cales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.

2024-02-13

ArXiv (preprint)

Mixtures of Experts Unlock Parameter Scaling for Deep RL

Ghada Sokar

Timon Willi

Clare Lyle

Jakob Nicolaus Foerster

The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance s… (see more)cales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.

2024-02-13

ArXiv (preprint)

JaxPruner: A concise library for sparsity research

Joo Hyung Lee

Wonpyo Park

Nicole Elyse Mitchell

Jonathan Pilault

Han-Byul Kim

Namhoon Lee

Elias Frantar

Yun Long

Amir Yazdanbakhsh

Shivani Agrawal

Suvinay Subramanian

Xin Wang

Sheng-Chun Kao

Xingyao Zhang

Trevor Gale

Aart J.C. Bik

Woohyun Han

Milen Ferev

Zhonglin Han … (see 5 more)

Hong-Seok Kim

Yann Dauphin

Utku Evci

This paper introduces JaxPruner, an open-source JAX-based pruning and sparse training library for machine learning research. JaxPruner aims … (see more)to accelerate research on sparse neural networks by providing concise implementations of popular pruning and sparse training algorithms with minimal memory and latency overhead. Algorithms implemented in JaxPruner use a common API and work seamlessly with the popular optimization library Optax, which, in turn, enables easy integration with existing JAX based libraries. We demonstrate this ease of integration by providing examples in four different codebases: Scenic, t5x, Dopamine and FedJAX and provide baseline experiments on popular benchmarks.

2024-01-08

Conference on Parsimony and Learning (published)

In deep reinforcement learning, a pruned network is a good network

Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage pri… (see more)or insights into the advantages of sparse training techniques and demonstrate that gradual magnitude pruning enables agents to maximize parameter effectiveness. This results in networks that yield dramatic performance improvements over traditional networks and exhibit a type of"scaling law", using only a small fraction of the full network parameters.

2024-01-01

ICML (published)

Mixture of Experts in a Mixture of RL settings

Timon Willi

Jakob Nicolaus Foerster

2024-01-01

RLC (published)

On the consistency of hyper-parameter selection in value-based deep reinforcement learning

João Guilherme Madeira Araújo

Deep reinforcement learning (deep RL) has achieved tremendous success on various domains through a combination of algorithmic design and car… (see more)eful selection of hyper-parameters. Algorithmic improvements are often the result of iterative enhancements built upon prior approaches, while hyper-parameter choices are typically inherited from previous methods or fine-tuned specifically for the proposed technique. Despite their crucial impact on performance, hyper-parameter choices are frequently overshadowed by algorithmic advancements. This paper conducts an extensive empirical study focusing on the reliability of hyper-parameter selection for value-based deep reinforcement learning agents, including the introduction of a new score to quantify the consistency and reliability of various hyper-parameters. Our findings not only help establish which hyper-parameters are most critical to tune, but also help clarify which tunings remain consistent across different training regimes.

2024-01-01

RLJ (published)

Small batch deep reinforcement learning

Marc Gendron-Bellemare

In value-based deep reinforcement learning with replay memories, the batch size parameter specifies how many transitions to sample for each … (see more)gradient update. Although critical to the learning process, this value is typically not adjusted when proposing new algorithms. In this work we present a broad empirical study that suggests {\em reducing} the batch size can result in a number of significant performance gains; this is surprising, as the general tendency when training neural networks is towards larger batch sizes for improved performance. We complement our experimental findings with a set of empirical analyses towards better understanding this phenomenon.

Bigger, Better, Faster: Human-level Atari with human-level efficiency

Max Schwarzer

Marc Gendron-Bellemare

Rishabh Agarwal

We introduce a value-based RL agent, which we call BBF, that achieves super-human performance in the Atari 100K benchmark. BBF relies on sca… (see more)ling the neural networks used for value estimation, as well as a number of other design choices that enable this scaling in a sample-efficient manner. We conduct extensive analyses of these design choices and provide insights for future work. We end with a discussion about updating the goalposts for sample-efficient RL research on the ALE. We make our code and data publicly available at https://github.com/google-research/google-research/tree/master/bigger_better_faster.

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

Bigger, Better, Faster: Human-level Atari with human-level efficiency

Max Schwarzer

Marc Gendron-Bellemare

Rishabh Agarwal

We introduce a value-based RL agent, which we call BBF, that achieves super-human performance in the Atari 100K benchmark. BBF relies on sca… (see more)ling the neural networks used for value estimation, as well as a number of other design choices that enable this scaling in a sample-efficient manner. We conduct extensive analyses of these design choices and provide insights for future work. We end with a discussion about updating the goalposts for sample-efficient RL research on the ALE. We make our code and data publicly available at https://github.com/google-research/google-research/tree/master/bigger_better_faster.

2023-05-30

ArXiv (preprint)