Nicolas Le Roux

Core Industry Member

Adjunct Professor, McGill University, School of Computer Science

Adjunct Professor, Université de Montréal, Department of Computer Science and Operations Research

Research Scientist, Microsoft Research

Research Topics

Deep Learning

Generative Models

Optimization

Reinforcement Learning

Biography

I am an academic researcher with expertise in machine learning, computer vision, neural networks, deep learning, optimization, large-scale learning and statistical modelling in general.

Current Students

Arnaud Bergeron

Master's Research - Université de Montréal

Alan Chan

PhD - Université de Montréal

Co-supervisor :

Master's Research - McGill University

Co-supervisor :

PhD - McGill University

Principal supervisor :

Joelle Pineau

Website

Google Scholar

Publications

Multi-Head Adapter Routing for Cross-Task Generalization

Lucas Caccia

Edoardo Ponti

Zhan Su

Matheus Pereira

Nicolas Le Roux

Alessandro Sordoni

Parameter-efficient fine-tuning (PEFT) for cross-task generalization consists in pre-training adapters on a multi-task training set before f… (see more)ew-shot adaptation to test tasks. Polytropon [Ponti et al., 2023] (

openreview.net

Target-based Surrogates for Stochastic Optimization

Jonathan Wilder Lavington

Sharan Vaswani

Reza Babanezhad Harikandeh

Mark Schmidt

Nicolas Le Roux

We consider minimizing functions for which it is expensive to compute the gradient. Such functions are prevalent in reinforcement learning, … (see more)imitation learning and bilevel optimization. Our target optimization framework uses the (expensive) gradient computation to construct surrogate functions in a \emph{target space} (e.g. the logits output by a linear model for classification) that can be minimized efficiently. This allows for multiple parameter updates to the model, amortizing the cost of gradient computation. In the full-batch setting, we prove that our surrogate is a global upper-bound on the loss, and can be (locally) minimized using a black-box optimization algorithm. We prove that the resulting majorization-minimization algorithm ensures convergence to a stationary point of the loss. Next, we instantiate our framework in the stochastic setting and propose the

2023-04-24

ICML.cc/2023/Conference (poster)

doi.org

openreview.net

Multi-Head Adapter Routing for Cross-Task Generalization

Lucas Caccia

Edoardo Ponti

Zhan Su

Matheus Pereira

Nicolas Le Roux

Alessandro Sordoni

2022-11-07

ArXiv (preprint)

arxiv.org

A general class of surrogate functions for stable and efficient reinforcement learning

Sharan Vaswani

Olivier Bachem

Simone Totaro

Robert Müller

Shivam Garg

Matthieu Geist

Marlos C. Machado

Pablo Samuel Castro

Nicolas Le Roux

2022-01-01

AISTATS (published)

proceedings.mlr.press

arxiv.org

Multi-Head Adapter Routing for Data-Efficient Fine-Tuning

Lucas Caccia

Edoardo Ponti

Lu Liu

Matheus Pereira

Nicolas Le Roux

Alessandro Sordoni

Parameter-efﬁcient ﬁne-tuning (PEFT) methods can adapt large language models to downstream tasks by training a small amount of newly add… (see more)ed parameters. In multi-task settings, PEFT adapters typically train on each task independently, inhibiting transfer across tasks, or on the concatenation of all tasks, which can lead to negative interference. To address this, Polytropon [Ponti et al., 2022] jointly learns an inventory of PEFT adapters and a routing function to share variable-size sets of adapters across tasks. Subsequently, adapters can be re-combined and ﬁne-tuned on novel tasks even with limited data. In this paper, we investigate to what extent the ability to control which adapters are active for each task leads to sample-efﬁcient generalization. Thus, we propose less expressive variants where we perform weighted averaging of the adapters before few-shot adaptation ( Poly - µ ) instead of learning a routing function. Moreover, we introduce more expressive variants where ﬁner-grained task–adapter allocation is learned through a multi-head routing function ( Poly - S ). We test these variants on three separate benchmarks for multi-task learning. We ﬁnd that Poly - S achieves gains on all three (up to 5.3 points on average) over strong baselines, while incurring a negligible additional cost in parameter count. In particular, we ﬁnd that instruction tuning, where models are fully ﬁne-tuned on natural language instructions for each task, is inferior to modular methods such as Polytropon and our proposed variants.

2022-01-01

arXiv.org (preprint)

doi.org

On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging

Chris Junchi Li

Yaodong Yu

Nicolas Loizou

Gauthier Gidel

Yitong Ma

Nicolas Le Roux

Michael I. Jordan

We study the stochastic bilinear minimax optimization problem, presenting an analysis of the same-sample Stochastic ExtraGradient (SEG) meth… (see more)od with constant step size, and presenting variations of the method that yield favorable convergence. In sharp contrasts with the basic SEG method whose last iterate only contracts to a fixed neighborhood of the Nash equilibrium, SEG augmented with iteration averaging provably converges to the Nash equilibrium under the same standard settings, and such a rate is further improved by incorporating a scheduled restarting procedure. In the interpolation setting where noise vanishes at the Nash equilibrium, we achieve an optimal convergence rate up to tight constants. We present numerical experiments that validate our theoretical findings and demonstrate the effectiveness of the SEG method when equipped with iteration averaging and restarting.

2022-01-01

AISTATS (published)

arxiv.org

Impact of Aliasing on Generalization in Deep Convolutional Networks

Cristina Vasconcelos

Hugo Larochelle

Vincent Dumoulin

Rob Romijnders

Nicolas Le Roux

Ross Goroshin

We investigate the impact of aliasing on generalization in Deep Convolutional Networks and show that data augmentation schemes alone are una… (see more)ble to prevent it due to structural limitations in widely used architectures. Drawing insights from frequency analysis theory, we take a closer look at ResNet and EfficientNet architectures and review the trade-off between aliasing and information loss in each of their major components. We show how to mitigate aliasing by inserting non-trainable low-pass filters at key locations, particularly where networks lack the capacity to learn them. These simple architectural changes lead to substantial improvements in generalization on i.i.d. and even more on out-of-distribution conditions, such as image classification under natural corruptions on ImageNet-C [11] and few-shot learning on Meta-Dataset [26]. State-of-the art results are achieved on both datasets without introducing additional trainable parameters and using the default hyper-parameters of open source codebases.

2021-10-10

2021 IEEE/CVF International Conference on Computer Vision (ICCV) (published)

doi.org

arxiv.org

Beyond variance reduction: Understanding the true impact of baselines on policy optimization

Wesley Chung

Valentin Thomas

Marlos C. Machado

Nicolas Le Roux

2021-07-01

Proceedings of the 38th International Conference on Machine Learning (published)

proceedings.mlr.press

arxiv.org

Bridging the Gap Between Adversarial Robustness and Optimization Bias

Fartash Faghri

Cristina Vasconcelos

David J Fleet

Fabian Pedregosa

Nicolas Le Roux

2021-02-17

ArXiv (preprint)

arxiv.org

Batch Reinforcement Learning Through Continuation Method

Yijie Guo

Shengyu Feng

Nicolas Le Roux

Ed Chi

Honglak Lee

Minmin Chen

Many real-world applications of reinforcement learning (RL) require the agent to learn from a fixed set of trajectories, without collecting … (see more)new interactions. Policy optimization under this setting is extremely challenging as: 1) the geometry of the objective function is hard to optimize efficiently; 2) the shift of data distributions causes high noise in the value estimation. In this work, we propose a simple yet effective policy iteration approach to batch RL using global optimization techniques known as continuation. By constraining the difference between the learned policy and the behavior policy that generates the fixed trajectories, and continuously relaxing the constraint, our method 1) helps the agent escape local optima; 2) reduces the error in policy evaluation in the optimization procedure. We present results on a variety of control tasks, game environments, and a recommendation task to empirically demonstrate the efficacy of our proposed method.

2021-01-01

ICLR (published)

openreview.net

An Effective Anti-Aliasing Approach for Residual Networks

Cristina Vasconcelos

Hugo Larochelle

Vincent Dumoulin

Nicolas Le Roux

Ross Goroshin

Image pre-processing in the frequency domain has traditionally played a vital role in computer vision and was even part of the standard pipe… (see more)line in the early days of deep learning. However, with the advent of large datasets, many practitioners concluded that this was unnecessary due to the belief that these priors can be learned from the data itself. Frequency aliasing is a phenomenon that may occur when sub-sampling any signal, such as an image or feature map, causing distortion in the sub-sampled output. We show that we can mitigate this effect by placing non-trainable blur filters and using smooth activation functions at key locations, particularly where networks lack the capacity to learn them. These simple architectural changes lead to substantial improvements in out-of-distribution generalization on both image classification under natural corruptions on ImageNet-C [10] and few-shot learning on Meta-Dataset [17], without introducing additional trainable parameters and using the default hyper-parameters of open source codebases.

2020-11-20

ArXiv (preprint)

arxiv.org

To Each Optimizer a Norm, To Each Norm its Generalization

Sharan Vaswani

Reza Babanezhad Harikandeh

Jose Gallego

Aaron Mishkin

Simon Lacoste-Julien

Nicolas Le Roux

We study the implicit regularization of optimization methods for linear models interpolating the training data in the under-parametrized and… (see more) over-parametrized regimes. Since it is difficult to determine whether an optimizer converges to solutions that minimize a known norm, we flip the problem and investigate what is the corresponding norm minimized by an interpolating solution. Using this reasoning, we prove that for over-parameterized linear regression, projections onto linear spans can be used to move between different interpolating solutions. For under-parameterized linear classification, we prove that for any linear classifier separating the data, there exists a family of quadratic norms ||.||_P such that the classifier's direction is the same as that of the maximum P-margin solution. For linear classification, we argue that analyzing convergence to the standard maximum l2-margin is arbitrary and show that minimizing the norm induced by the data results in better generalization. Furthermore, for over-parameterized linear classification, projections onto the data-span enable us to use techniques from the under-parameterized setting. On the empirical side, we propose techniques to bias optimizers towards better generalizing solutions, improving their test performance. We validate our theoretical results via synthetic experiments, and use the neural tangent kernel to handle non-linear models.

2020-06-11

ArXiv (preprint)

arxiv.org

AI Advantage

Mila AI Policy Fellowship

Strategic Priorities

AI Advantage

Mila AI Policy Fellowship

Nicolas Le Roux

Biography

Current Students

Publications

AI Advantage

Mila AI Policy Fellowship

Strategic Priorities

AI Advantage

Mila AI Policy Fellowship

Popular keywords:

Nicolas Le Roux

Biography

Current Students

Publications