Evgenii Nikishin

Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons

2025-02-13

TMLR (accepted)

doi.org

openreview.net

Forgetting Transformer: Softmax Attention with a Forget Gate

Zhixuan Lin

Evgenii Nikishin

Xu He

Aaron Courville

An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we… (see more) show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism the Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a ``Pro'' block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at [`https://github.com/zhixuan-lin/forgetting-transformer`](https://github.com/zhixuan-lin/forgetting-transformer).

2025-01-22

ICLR.cc/2025/Conference (poster)

openreview.net

Forgetting Transformer: Softmax Attention with a Forget Gate

Zhixuan Lin

Evgenii Nikishin

Xu He

Aaron Courville

An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we… (see more) show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a "Pro" block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at [`https://github.com/zhixuan-lin/forgetting-transformer`](https://github.com/zhixuan-lin/forgetting-transformer).

2025-01-22

ICLR.cc/2025/Conference (poster)

openreview.net

The Curse of Diversity in Ensemble-Based Exploration

We uncover a surprising phenomenon in deep reinforcement learning: training a diverse ensemble of data-sharing agents -- a well-established … (see more)exploration strategy -- can significantly impair the performance of the individual ensemble members when compared to standard single-agent training. Through careful analysis, we attribute the degradation in performance to the low proportion of self-generated data in the shared training data for each ensemble member, as well as the inefficiency of the individual ensemble members to learn from such highly off-policy data. We thus name this phenomenon *the curse of diversity*. We find that several intuitive solutions -- such as a larger replay buffer or a smaller ensemble size -- either fail to consistently mitigate the performance loss or undermine the advantages of ensembling. Finally, we demonstrate the potential of representation learning to counteract the curse of diversity with a novel method named Cross-Ensemble Representation Learning (CERL) in both discrete and continuous control domains. Our work offers valuable insights into an unexpected pitfall in ensemble-based exploration and raises important caveats for future applications of similar approaches.

2024-01-16

ICLR.cc/2024/Conference (poster)

openreview.net

Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier

Pierluca D'Oro

Max Schwarzer

Evgenii Nikishin

Pierre-Luc Bacon

Marc Gendron-Bellemare

Aaron Courville

Increasing the replay ratio, the number of updates of an agent's parameters per environment interaction, is an appealing strategy for improv… (see more)ing the sample efficiency of deep reinforcement learning algorithms. In this work, we show that fully or partially resetting the parameters of deep reinforcement learning agents causes better replay ratio scaling capabilities to emerge. We push the limits of the sample efficiency of carefully-modified algorithms by training them using an order of magnitude more updates than usual, significantly improving their performance in the Atari 100k and DeepMind Control Suite benchmarks. We then provide an analysis of the design choices required for favorable replay ratio scaling to be possible and discuss inherent limits and tradeoffs.

2023-02-01

ICLR.cc/2023/Conference (notable)

openreview.net