Aaron Courville

Reza Bayat

PhD - Université de Montréal

Co-supervisor :

Pascal Vincent

Anirudh Buvanesh

PhD - Université de Montréal

Principal supervisor :

Laurent Charlin

anirudb1102@gmail.com

Razvan Ciuca

Master's Research - Université de Montréal

Alexandre Diz Ganito

Master's Research - Université de Montréal

Juan Duque

PhD - Université de Montréal

PhD - Université de Montréal

Arian Hosseini

PhD - Université de Montréal

Uday Kapur

Professional Master's - Université de Montréal

Amr Khalifa

PhD - Université de Montréal

andrei.nicolicioiu@gmail.com

Samuel Lavoie

PhD - Université de Montréal

Zhixuan Lin

PhD - Université de Montréal

PhD - Université de Montréal

Principal supervisor :

PhD - Université de Montréal

PhD - Université de Montréal

Co-supervisor :

Rishabh Agarwal

Andrei Nicolicioiu

PhD - Université de Montréal

Michell Mercedes Payano Perez

Evgenii Nikishin

PhD - Université de Montréal

Principal supervisor :

PhD - Université de Montréal

Co-supervisor :

Johan Samir Obando Ceron

PhD - Université de Montréal

Co-supervisor :

Research Intern - Université de Montréal

Dereck Piché

Master's Research - Université de Montréal

pichedereck@gmail.com

Esra'a Saleh

PhD - Université de Montréal

Principal supervisor :

Master's Research - Université de Montréal

Principal supervisor :

Anna (Cheng-Zhi) Huang

Shawn Tan

PhD - Université de Montréal

PhD - Université de Montréal

Principal supervisor :

(Rex) Devon Hjelm

Yusong Wu

PhD - Université de Montréal

Principal supervisor :

Anna (Cheng-Zhi) Huang

Xiaofeng Zhang

PhD - Université de Montréal

Dinghuai Zhang

PhD - Université de Montréal

Co-supervisor :

Yoshua Bengio

Hattie Zhou

PhD - Université de Montréal

Principal supervisor :

Hugo Larochelle

Publications

Faster, More Efficient RLHF through Off-Policy Asynchronous Learning

Michael Noukhovitch

Shengyi Huang

Sophie Xhonneux

Arian Hosseini

Rishabh Agarwal

To achieve state-of-the-art chatbots, large language models are finetuned with reinforcement learning (RL), frequently to optimize human fee… (see more)dback (RLHF). This process is computationally expensive and can take weeks. Offline approaches, like DPO, learn on a static dataset and are efficient but not performant. The dominant paradigm, online and on-policy---synchronously generating from the model, labelling with a reward model, and learning on feedback from the model's own outputs---is performant but not efficient. Following prior work in the generall deep RL setting, we propose separating the actor and learner in RLHF. This enables the asynchronously generation of new samples while learning on prior samples, thus leading to overall faster training and better scaling. But this requires a novel regime for RLHF, online but off-policy: learning on samples from a previous version of our model. We ask a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? We find that a contrastive loss, Online DPO, is most robust to off-policy data and that robustness increases with the scale of the policy model. We show even further compute optimizations but demonstrate that they come at a performance cost, giving rise to a trade-off. Finally, we verify our design choices by training LLaMA 3.1 8B with RLHF as a helpful chatbot in half the time of a synchronous run while matching final performance.

2025-01-22

ICLR.cc/2025/Conference (poster)

Forgetting Transformer: Softmax Attention with a Forget Gate

Zhixuan Lin

Evgenii Nikishin

Xu He

An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we… (see more) show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism the Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a ``Pro'' block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at [`https://github.com/zhixuan-lin/forgetting-transformer`](https://github.com/zhixuan-lin/forgetting-transformer).

2025-01-22

ICLR.cc/2025/Conference (poster)

Forgetting Transformer: Softmax Attention with a Forget Gate

Zhixuan Lin

Evgenii Nikishin

Xu He

An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we… (see more) show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a "Pro" block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at [`https://github.com/zhixuan-lin/forgetting-transformer`](https://github.com/zhixuan-lin/forgetting-transformer).

2025-01-22

ICLR.cc/2025/Conference (poster)

Neuroplastic Expansion in Deep Reinforcement Learning

Jiashun Liu

Johan Samir Obando Ceron

Ling Pan

2025-01-22

ICLR.cc/2025/Conference (poster)

doi.org

Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study

Shawn Tan

Songlin Yang

Rameswar Panda

Yikang Shen

The self-attention mechanism traditionally relies on the softmax operator, necessitating positional embeddings like RoPE, or position biases… (see more) to account for token order. But current methods using still face length generalisation challenges. We investigate an alternative attention mechanism based on the stick-breaking process in larger scale settings. The method works as follows: For each token before the current, we determine a break point, which represents the proportion of the stick, the weight of the attention, to allocate to the current token. We repeat this on the remaining stick, until all tokens are allocated a weight, resulting in a sequence of attention weights. This process naturally incorporates recency bias, which has linguistic motivations for grammar parsing (Shen et al., 2017). We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking attention. We then discuss implementation of numerically stable stick-breaking attention and adapt Flash Attention to accommodate this mechanism. When used as a drop-in replacement for current softmax+RoPE attention systems, we find that stick-breaking attention performs competitively with current methods on length generalisation and downstream tasks. Stick-breaking also performs well at length generalisation, allowing a model trained with

2025-01-22

ICLR.cc/2025/Conference (poster)

Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study

Shawn Tan

Songlin Yang

Rameswar Panda

Yikang Shen

2025-01-22

ICLR.cc/2025/Conference (poster)

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Michael Noukhovitch

Shengyi Huang

Sophie Xhonneux

Arian Hosseini

Rishabh Agarwal

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling wi… (see more)th a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.

2024-10-23

ArXiv (preprint)

doi.org

arxiv.org

Stick-breaking Attention

Shawn Tan

Yikang Shen

Songlin Yang