Portrait de Nicolas Le Roux

Nicolas Le Roux

Membre industriel principal
Professeur associé, McGill University, École d'informatique
Professeur associé, Université de Montréal, Département d'informatique et de recherche opérationnelle
Chercheur scientifique, Microsoft Research
Sujets de recherche
Apprentissage par renforcement
Apprentissage profond
Modèles génératifs
Optimisation

Biographie

Je suis un chercheur universitaire spécialisé dans l'apprentissage automatique, la vision par ordinateur, les réseaux de neurones, l'apprentissage en profondeur, l'optimisation, l'apprentissage à grande échelle et la modélisation statistique en général.

Étudiants actuels

Maîtrise recherche - UdeM
Co-superviseur⋅e :
Maîtrise recherche - McGill
Co-superviseur⋅e :
Doctorat - McGill
Superviseur⋅e principal⋅e :

Publications

Tapered Off-Policy REINFORCE Stable and efficient reinforcement learning for LLMs
Bellemare Marc-Emmanuel
Jonathan Lebensoldt
Joshua Greaves
Alex Fréchette
Sándor Toth
Sam Work
We propose a new algorithm for fine-tuning large language models using reinforcement learning. Tapered Off-Policy REINFORCE (TOPR) uses an a… (voir plus)symmetric, tapered variant of importance sampling to speed up learning while maintaining stable learning dynamics, even without the use of KL regularization. TOPR can be applied in a fully offline fashion, allows the handling of positive and negative examples in a unified framework, and benefits from the implementational simplicity that is typical of Monte Carlo algorithms. We demonstrate the effectiveness of our approach with a series of experiments on the GSM8K and MATH reasoning benchmarks, finding performance gains for training both a model for solution generation and as a generative verifier. We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency, all the while avoiding the ``wasted inference'' that comes with discarding negative examples. We find that this advantage persists over multiple iterations of training and can be amplified by dataset curation techniques, enabling us to match 70B-parameter model performance with 8B language models. As a corollary to this work, we find that REINFORCE's baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples, and is consequently critical in driving off-policy performance.
VinePPO: Refining Credit Assignment in RL Training of LLMs
Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receivi… (voir plus)ng any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a common reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, recent approaches achieve strong results without it, raising questions about the efficacy of value networks in practice. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they often produce poor estimate of expected return and barely outperform a random baseline when comparing alternative steps. This motivates our key question: Can improved credit assignment enhance RL training for LLMs? To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates. Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time (up to 3.0x). Crucially, it achieves higher test accuracy for a given training accuracy, capturing more generalization signal per sample. These results emphasize the importance of accurate credit assignment in RL training of LLM.
Language-guided Skill Learning with Temporal Variational Inference
Haotian Fu
Pratyusha Sharma
Elias Stengel-Eskin
George Konidaris
Xingdi Yuan
We present an algorithm for skill discovery from expert demonstrations. The algorithm first utilizes Large Language Models (LLMs) to propose… (voir plus) an initial segmentation of the trajectories. Following that, a hierarchical variational inference framework incorporates the LLM-generated segmentation information to discover reusable skills by merging trajectory segments. To further control the trade-off between compression and reusability, we introduce a novel auxiliary objective based on the Minimum Description Length principle that helps guide this skill discovery process. Our results demonstrate that agents equipped with our method are able to discover skills that help accelerate learning and outperform baseline skill learning approaches on new long-horizon tasks in BabyAI, a grid world navigation environment, as well as ALFRED, a household simulation environment.
Towards Modular LLMs by Building and Reusing a Library of LoRAs
The growing number of parameter-efficient adaptations of a base large language model (LLM) calls for studying whether we can reuse such trai… (voir plus)ned adapters to improve performance for new tasks. We study how to best build a library of adapters given multi-task data and devise techniques for both zero-shot and supervised task generalization through routing in such library. We benchmark existing approaches to build this library and introduce model-based clustering, MBC, a method that groups tasks based on the similarity of their adapter parameters, indirectly optimizing for transfer across the multi-task dataset. To re-use the library, we present a novel zero-shot routing mechanism, Arrow, which enables dynamic selection of the most relevant adapters for new inputs without the need for retraining. We experiment with several LLMs, such as Phi-2 and Mistral, on a wide array of held-out tasks, verifying that MBC-based adapters and Arrow routing lead to superior generalization to new tasks. We make steps towards creating modular, adaptable LLMs that can match or outperform traditional joint training.
Unraveling the Interconnected Axes of Heterogeneity in Machine Learning for Democratic and Inclusive Advancements
The growing utilization of machine learning (ML) in decision-making processes raises questions about its benefits to society. In this study,… (voir plus) we identify and analyze three axes of heterogeneity that significantly influence the trajectory of ML products. These axes are i) values, culture and regulations, ii) data composition, and iii) resource and infrastructure capacity. We demonstrate how these axes are interdependent and mutually influence one another, emphasizing the need to consider and address them jointly. Unfortunately, the current research landscape falls short in this regard, often failing to adopt a holistic approach. We examine the prevalent practices and methodologies that skew these axes in favor of a selected few, resulting in power concentration, homogenized control, and increased dependency. We discuss how this fragmented study of the three axes poses a significant challenge, leading to an impractical solution space that lacks reflection of real-world scenarios. Addressing these issues is crucial to ensure a more comprehensive understanding of the interconnected nature of society and to foster the democratic and inclusive development of ML systems that are more aligned with real-world complexities and its diverse requirements.
A Case Study of Instruction Tuning with Mixture of Parameter-Efficient Experts
Joint Prompt Optimization of Stacked LLMs using Variational Inference
Xingdi Yuan
Xingdi Yuan
Matheus Pereira
Adam Trischler
Ziang Xiao
Friederike Niedtner
Large language models (LLMs) can be seen as atomic units of computation mapping sequences to a distribution over sequences. Thus, they can b… (voir plus)e seen as stochastic language layers in a language network, where the learnable parameters are the natural language prompts at each layer. By stacking two such layers and feeding the output of one layer to the next, we obtain a Deep Language Network (DLN). We first show how to effectively perform prompt optimization for a 1-Layer language network (DLN-1). Then, we present an extension that applies to 2-layer DLNs (DLN-2), where two prompts must be learned. The key idea is to consider the output of the first layer as a latent variable, which requires inference, and prompts to be learned as the parameters of the generative distribution. We first test the effectiveness of DLN-1 in multiple reasoning and natural language understanding tasks. Then, we show that DLN-2 can reach higher performance than a single layer, showing promise that we might reach comparable performance to GPT-4, even when each LLM in the network is smaller and less powerful.
Multi-Head Adapter Routing for Cross-Task Generalization
Lucas Caccia
Edoardo Ponti
Matheus Pereira
Parameter-efficient fine-tuning (PEFT) for cross-task generalization consists in pre-training adapters on a multi-task training set before f… (voir plus)ew-shot adaptation to test tasks. Polytropon [Ponti et al., 2023] (
A general class of surrogate functions for stable and efficient reinforcement learning
Olivier Bachem
Robert Müller
Shivam Garg
Matthieu Geist
Marlos C. Machado
Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions… (voir plus) have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.
Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization
Valentin Thomas
Marlos C. Machado
Bandit and reinforcement learning (RL) problems can often be framed as optimization problems where the goal is to maximize average performan… (voir plus)ce while having access only to stochastic estimates of the true gradient. Traditionally, stochastic optimization theory predicts that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. In this paper we demonstrate that this is not the case for bandit and RL problems. To allow our analysis to be interpreted in light of multi-step MDPs, we focus on techniques derived from stochastic optimization principles (e.g., natural policy gradient and EXP3) and we show that some standard assumptions from optimization theory are violated in these problems. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics and that seemingly innocuous choices like the baseline can determine whether an algorithm converges. These theoretical findings match our empirical evaluation, which we extend to multi-state MDPs.
On the interplay between noise and curvature and its effect on optimization and generalization
Valentin Thomas
Fabian Pedregosa
Pierre-Antoine Mangazol
The speed at which one can minimize an expected loss using stochastic methods depends on two properties: the curvature of the loss and the v… (voir plus)ariance of the gradients. While most previous works focus on one or the other of these properties, we explore how their interaction affects optimization speed. Further, as the ultimate goal is good generalization performance, we clarify how both curvature and noise are relevant to properly estimate the generalization gap. Realizing that the limitations of some existing works stems from a confusion between these matrices, we also clarify the distinction between the Fisher matrix, the Hessian, and the covariance matrix of the gradients.