Nicolas Le Roux

VinePPO: Refining Credit Assignment in RL Training of LLMs

Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receivi… (see more)ng any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a common reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, recent approaches achieve strong results without it, raising questions about the efficacy of value networks in practice. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they often produce poor estimate of expected return and barely outperform a random baseline when comparing alternative steps. This motivates our key question: Can improved credit assignment enhance RL training for LLMs? To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates. Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time (up to 3.0x). Crucially, it achieves higher test accuracy for a given training accuracy, capturing more generalization signal per sample. These results emphasize the importance of accurate credit assignment in RL training of LLM.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

openreview.net

VinePPO: Refining Credit Assignment in RL Training of LLMs

Amirhossein Kazemnejad

Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receivi… (see more)ng any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a common reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, recent approaches achieve strong results without it, raising questions about the efficacy of value networks in practice. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they often produce poor estimate of expected return and barely outperform a random baseline when comparing alternative steps. This motivates our key question: Can improved credit assignment enhance RL training for LLMs? To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates. Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time (up to 3.0x). Crucially, it achieves higher test accuracy for a given training accuracy, capturing more generalization signal per sample. These results emphasize the importance of accurate credit assignment in RL training of LLM.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Tapered Off-Policy REINFORCE Stable and efficient reinforcement learning for LLMs

Nicolas Le Roux

Marc Gendron-Bellemare

Jonathan Lebensoldt

Arnaud Bergeron

Joshua Greaves

Alex Fréchette

Carolyne Pelletier

Éric Thibodeau-Laufer

Sándor Toth

Sam Work

We propose a new algorithm for fine-tuning large language models using reinforcement learning. Tapered Off-Policy REINFORCE (TOPR) uses an a… (see more)symmetric, tapered variant of importance sampling to speed up learning while maintaining stable learning dynamics, even without the use of KL regularization. TOPR can be applied in a fully offline fashion, allows the handling of positive and negative examples in a unified framework, and benefits from the implementational simplicity that is typical of Monte Carlo algorithms. We demonstrate the effectiveness of our approach with a series of experiments on the GSM8K and MATH reasoning benchmarks, finding performance gains for training both a model for solution generation and as a generative verifier. We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency, all the while avoiding the ``wasted inference'' that comes with discarding negative examples. We find that this advantage persists over multiple iterations of training and can be amplified by dataset curation techniques, enabling us to match 70B-parameter model performance with 8B language models. As a corollary to this work, we find that REINFORCE's baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples, and is consequently critical in driving off-policy performance.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

openreview.net

Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs

Nicolas Le Roux

Marc Gendron-Bellemare

Jonathan Lebensold

Arnaud Bergeron

Joshua Greaves

Alex Fr'echette

Carolyne Pelletier

Éric Thibodeau-Laufer

S'andor Toth

Sam Work

2025-03-18

ArXiv (preprint)

arxiv.org

Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs

Nicolas Le Roux

Marc Gendron-Bellemare

Jonathan Lebensold

Arnaud Bergeron

Joshua Greaves

Alex Fr'echette

Carolyne Pelletier

Éric Thibodeau-Laufer

S'andor Toth

Sam Work

2025-03-18

ArXiv (preprint)

doi.org

arxiv.org

Fast Convergence of Softmax Policy Mirror Ascent

Reza Asad

Reza Babanezhad Harikandeh

Issam Hadj Laradji

Nicolas Le Roux

Sharan Vaswani

Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed as mirror ascent in the space of probabilities. Re… (see more)cently,~\citet{vaswani2021general} introduced a policy gradient method that corresponds to mirror ascent in the dual space of logits. We refine this algorithm, removing its need for a normalization across actions and analyze the resulting method (referred to as SPMA). For tabular MDPs, we prove that SPMA with a constant step-size matches the linear convergence of NPG and achieves a faster convergence than constant step-size (accelerated) softmax policy gradient. To handle large state-action spaces, we extend SPMA to use a log-linear policy parameterization. Unlike that for NPG, generalizing SPMA to the linear function approximation (FA) setting does not require compatible function approximation. Unlike MDPO, a practical generalization of NPG, SPMA with linear FA only requires solving convex softmax classification problems. We prove that SPMA achieves linear convergence to the neighbourhood of the optimal value function. We extend SPMA to handle non-linear FA and evaluate its empirical performance on the MuJoCo and Atari benchmarks. Our results demonstrate that SPMA consistently achieves similar or better performance compared to MDPO, PPO and TRPO.

2025-01-22

aistats.org/AISTATS/2025/Conference (poster)

openreview.net

Fast Convergence of Softmax Policy Mirror Ascent

Reza Asad

Reza Babanezhad Harikandeh

Issam Hadj Laradji

Nicolas Le Roux

Sharan Vaswani

Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed as mirror ascent in the space of probabilities. Re… (see more)cently, Vaswani et al. (2021) introduced a policy gradient method that corresponds to mirror ascent in the dual space of logits. We refine this algorithm, removing its need for a normalization across actions and analyze the resulting method (referred to as SPMA). For tabular MDPs, we prove that SPMA with a constant step-size matches the linear convergence of NPG and achieves a faster convergence than constant step-size (accelerated) softmax policy gradient. To handle large state-action spaces, we extend SPMA to use a log-linear policy parameterization. Unlike that for NPG, generalizing SPMA to the linear function approximation (FA) setting does not require compatible function approximation. Unlike MDPO, a practical generalization of NPG, SPMA with linear FA only requires solving convex softmax classification problems. We prove that SPMA achieves linear convergence to the neighbourhood of the optimal value function. We extend SPMA to handle non-linear FA and evaluate its empirical performance on the MuJoCo and Atari benchmarks. Our results demonstrate that SPMA consistently achieves similar or better performance compared to MDPO, PPO and TRPO.

2025-01-22

aistats.org/AISTATS/2025/Conference (poster)

proceedings.mlr.press

openreview.net

fLSA: Learning Semantic Structures in Document Collections Using Foundation Models

Weijia Xu

Nebojsa Jojic

Nicolas Le Roux

Humans can learn to solve new tasks by inducing high-level strategies from example solutions to similar problems and then adapting these str… (see more)ategies to solve unseen problems. Can we use large language models to induce such high-level structure from example documents or solutions? We introduce fLSA, a foundation-model-based Latent Semantic Analysis method that iteratively clusters and tags document segments based on document-level contexts. These tags can be used to model the latent structure of given documents and for hierarchical sampling of new texts. Our experiments on story writing, math, and multi-step reasoning datasets demonstrate that fLSA tags are more informative in reconstructing the original texts than existing tagging methods. Moreover, when used for hierarchical sampling, fLSA tags help expand the output space in the right directions that lead to correct solutions more often than direct sampling and hierarchical sampling with existing tagging methods. Code: https://github.com/microsoft/fLSA

2024-12-31

EMNLP (published)

doi.org

arxiv.org

Fast Convergence of Softmax Policy Mirror Ascent for Bandits & Tabular MDPs

Reza Asad

Reza Babanezhad Harikandeh

Issam Hadj Laradji

Nicolas Le Roux

Sharan Vaswani

We analyze the convergence of a novel policy gradient algorithm (referred to as SPMA) for multi-armed bandits and tabular Markov decision pr… (see more)ocesses (MDPs). SPMA is an instantiation of mirror ascent and uses the softmax parameterization with a log-sum-exp mirror map. Given access to the exact policy gradients, we prove that SPMA with a constant step-size requires

2024-10-10

NeurIPS.cc/2024/Workshop/OPT (published)

openreview.net

How Learning Rates Shape Neural Network Focus: Insights from Example Ranking

Ekaterina Lobacheva

Keller Jordan

Aristide Baratin

Nicolas Le Roux

The learning rate is a key hyperparameter that affects both the speed of training and the generalization performance of neural networks. Th… (see more)rough a new {\it loss-based example ranking} analysis, we show that networks trained with different learning rates focus their capacity on different parts of the data distribution, leading to solutions with different generalization properties. These findings, which hold across architectures and datasets, provide new insights into how learning rates affect model performance and example-level dynamics in neural networks.

2024-10-10

NeurIPS.cc/2024/Workshop/SciForDL (poster)

openreview.net

VinePPO: Accurate Credit Assignment in RL for LLM Mathematical Reasoning

Amirhossein Kazemnejad

Large language models (LLMs) are increasingly required to solve complex reasoning tasks, like mathematical problems, that involve multiple r… (see more)easoning steps before feedback is received. Effectively identifying and prioritizing key steps by accurately assigning credit to these intermediate steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning algorithm for finetuning LLMs, addresses the credit assignment problem by employing value networks to predict the expected cumulative rewards of intermediate states. In this work, we identify significant limitations with this value estimation method. To address this, we propose \methodname that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates of the intermediate values. VinePPO consistently outperforms standard PPO, doing so more efficiently and with lower divergence from the reference model. Our findings underscore the critical importance of accurate credit assignment in LLM post-training and present a simple, yet effective solution.

2024-10-09

NeurIPS.cc/2024/Workshop/MATH-AI (accepted)

openreview.net

VinePPO: Accurate Credit Assignment in RL for LLM Mathematical Reasoning

Amirhossein Kazemnejad

Large language models (LLMs) are increasingly required to solve complex reasoning tasks, like mathematical problems, that involve multiple r… (see more)easoning steps before feedback is received. Effectively identifying and prioritizing key steps by accurately assigning credit to these intermediate steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning algorithm for finetuning LLMs, addresses the credit assignment problem by employing value networks to predict the expected cumulative rewards of intermediate states. In this work, we identify significant limitations with this value estimation method. To address this, we propose \methodname that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates of the intermediate values. VinePPO consistently outperforms standard PPO, doing so more efficiently and with lower divergence from the reference model. Our findings underscore the critical importance of accurate credit assignment in LLM post-training and present a simple, yet effective solution.

2024-10-09

NeurIPS.cc/2024/Workshop/MATH-AI (accepted)

openreview.net

Opening Conference | Building Safer AI for Youth Mental Health

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

Indigenous Pathfinders in AI

Nicolas Le Roux

Biography

Current Students

Publications

Opening Conference | Building Safer AI for Youth Mental Health

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

Indigenous Pathfinders in AI

Popular keywords:

Nicolas Le Roux

Biography

Current Students

Publications