Aaron Courville

Razvan Ciuca

Maîtrise recherche - Université de Montréal

Alexandre Diz Ganito

Maîtrise recherche - UdeM

Juan Duque

Doctorat - UdeM

Doctorat - UdeM

Doctorat - UdeM

Uday Kapur

Maîtrise professionnelle - UdeM

Amr Khalifa

Doctorat - UdeM

Samuel Lavoie

Doctorat - UdeM

Zhixuan Lin

Doctorat - UdeM

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Doctorat - UdeM

Doctorat - UdeM

Co-superviseur⋅e :

andrei.nicolicioiu@gmail.com

Andrei Nicolicioiu

Doctorat - UdeM

Site web

Google Scholar

Evgenii Nikishin

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Doctorat - UdeM

Co-superviseur⋅e :

Johan Samir Obando Ceron

Doctorat - UdeM

Co-superviseur⋅e :

Maîtrise recherche - UdeM

pichedereck@gmail.com

Site web

Esra'a Saleh

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Maîtrise recherche - UdeM

Superviseur⋅e principal⋅e :

Anna (Cheng-Zhi) Huang

Shawn Tan

Doctorat - UdeM

Doctorat - UdeM

Superviseur⋅e principal⋅e :

(Rex) Devon Hjelm

Google Scholar

Yusong Wu

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Anna (Cheng-Zhi) Huang

Dinghuai Zhang

Doctorat - UdeM

Co-superviseur⋅e :

Doctorat - UdeM

Hattie Zhou

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Hugo Larochelle

Publications

Stick-breaking Attention

Shawn Tan

Yikang Shen

Songlin Yang

Rameswar Panda

2024-10-23

ArXiv (prépublication)

Stick-breaking Attention

Shawn Tan

Yikang Shen

Songlin Yang

Rameswar Panda

2024-10-23

ArXiv (prépublication)

Stick-breaking Attention

Shawn Tan

Yikang Shen

Songlin Yang

Rameswar Panda

2024-10-23

ArXiv (prépublication)

Stick-breaking Attention

Shawn Tan

Yikang Shen

Songlin Yang

Rameswar Panda

2024-10-23

ArXiv (prépublication)

Faster, More Efficient RLHF through Off-Policy Asynchronous Learning

Michael Noukhovitch

Shengyi Huang

Sophie Xhonneux

Arian Hosseini

To achieve state-of-the-art chatbots, large language models are finetuned with reinforcement learning (RL), frequently to optimize human fee… (voir plus)dback (RLHF). This process is computationally expensive and can take weeks. Offline approaches, like DPO, learn on a static dataset and are efficient but not performant. The dominant paradigm, online and on-policy---synchronously generating from the model, labelling with a reward model, and learning on feedback from the model's own outputs---is performant but not efficient. Following prior work in the generall deep RL setting, we propose separating the actor and learner in RLHF. This enables the asynchronously generation of new samples while learning on prior samples, thus leading to overall faster training and better scaling. But this requires a novel regime for RLHF, online but off-policy: learning on samples from a previous version of our model. We ask a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? We find that a contrastive loss, Online DPO, is most robust to off-policy data and that robustness increases with the scale of the policy model. We show even further compute optimizations but demonstrate that they come at a performance cost, giving rise to a trade-off. Finally, we verify our design choices by training LLaMA 3.1 8B with RLHF as a helpful chatbot in half the time of a synchronous run while matching final performance.

2024-10-10

NeurIPS.cc/2024/Workshop/FITML (poster)

Not All LLM Reasoners Are Created Equal

Arian Hosseini

Alessandro Sordoni

Daniel Toyama

2024-10-09

NeurIPS.cc/2024/Workshop/Sys2-Reasoning (poster)

VinePPO: Accurate Credit Assignment in RL for LLM Mathematical Reasoning

Amirhossein Kazemnejad

Milad Aghajohari

Large language models (LLMs) are increasingly required to solve complex reasoning tasks, like mathematical problems, that involve multiple r… (voir plus)easoning steps before feedback is received. Effectively identifying and prioritizing key steps by accurately assigning credit to these intermediate steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning algorithm for finetuning LLMs, addresses the credit assignment problem by employing value networks to predict the expected cumulative rewards of intermediate states. In this work, we identify significant limitations with this value estimation method. To address this, we propose \methodname that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates of the intermediate values. VinePPO consistently outperforms standard PPO, doing so more efficiently and with lower divergence from the reference model. Our findings underscore the critical importance of accurate credit assignment in LLM post-training and present a simple, yet effective solution.

2024-10-09

NeurIPS.cc/2024/Workshop/MATH-AI (accepté)

VinePPO: Accurate Credit Assignment in RL for LLM Mathematical Reasoning

Amirhossein Kazemnejad

Milad Aghajohari

2024-10-09

NeurIPS.cc/2024/Workshop/MATH-AI (accepté)

Not All LLM Reasoners Are Created Equal

Arian Hosseini

Alessandro Sordoni

Daniel Toyama

We study the depth of grade-school math (GSM) problem-solving capabilities of LLMs. To this end, we evaluate their performance on pairs of e… (voir plus)xisting math word problems together so that the answer to the second problem depends on correctly answering the first problem. Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs and solving each question independently. This gap is more pronounced in smaller, more cost-efficient, and math-specialized models. Moreover, instruction-tuning recipes and code generation have varying effects across LLM sizes, while finetuning on GSM can lead to task overfitting. Our analysis indicates that large reasoning gaps are not because of test-set leakage, but due to distraction from additional context and poor second-hop reasoning. Overall, LLMs exhibit systematic differences in their reasoning abilities, despite what their performance on standard benchmarks indicates.

2024-10-02

ArXiv (prépublication)

VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment

Amirhossein Kazemnejad

Milad Aghajohari

Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receivi… (voir plus)ng any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, value networks face challenges in predicting the expected cumulative rewards accurately in complex reasoning tasks, often leading to high-variance updates and suboptimal performance. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they barely outperform a random baseline when comparing alternative steps. To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates, bypassing the need for large value networks. Our method consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets with fewer gradient updates (up to 9x), less wall-clock time (up to 3.0x). These results emphasize the importance of accurate credit assignment in RL finetuning of LLM and demonstrate VinePPO's potential as a superior alternative.

2024-10-02

ArXiv (prépublication)