Portrait de Nicolas Le Roux

Nicolas Le Roux

Membre industriel principal
Chaire en IA Canada-CIFAR
Professeur associé, McGill University, École d'informatique
Professeur associé, Université de Montréal, Département d'informatique et de recherche opérationnelle
Chercheur scientifique, Microsoft Research

Biographie

Je suis un chercheur universitaire spécialisé dans l'apprentissage automatique, la vision par ordinateur, les réseaux de neurones, l'apprentissage en profondeur, l'optimisation, l'apprentissage à grande échelle et la modélisation statistique en général.

Étudiants actuels

Doctorat - Université de Montréal
Co-superviseur⋅e :
Doctorat - Université de Montréal
Superviseur⋅e principal⋅e :

Publications

On the interplay between noise and curvature and its effect on optimization and generalization
Valentin Thomas
Fabian Pedregosa
Bart van Merriënboer
Pierre-Antoine Manzagol
The speed at which one can minimize an expected loss using stochastic methods depends on two properties: the curvature of the loss and the v… (voir plus)ariance of the gradients. While most previous works focus on one or the other of these properties, we explore how their interaction affects optimization speed. Further, as the ultimate goal is good generalization performance, we clarify how both curvature and noise are relevant to properly estimate the generalization gap. Realizing that the limitations of some existing works stems from a confusion between these matrices, we also clarify the distinction between the Fisher matrix, the Hessian, and the covariance matrix of the gradients.
On the interplay between noise and curvature and its effect on optimization and generalization
Valentin Thomas
Fabian Pedregosa
Bart van Merriënboer
Pierre-Antoine Mangazol
The speed at which one can minimize an expected loss using stochastic methods depends on two properties: the curvature of the loss and the v… (voir plus)ariance of the gradients. While most previous works focus on one or the other of these properties, we explore how their interaction affects optimization speed. Further, as the ultimate goal is good generalization performance, we clarify how both curvature and noise are relevant to properly estimate the generalization gap. Realizing that the limitations of some existing works stems from a confusion between these matrices, we also clarify the distinction between the Fisher matrix, the Hessian, and the covariance matrix of the gradients.
Information matrices and generalization
Valentin Thomas
Fabian Pedregosa
Bart van Merriënboer
Pierre-Antoine Manzagol
This work revisits the use of information criteria to characterize the generalization of deep learning models. In particular, we empirically… (voir plus) demonstrate the effectiveness of the Takeuchi information criterion (TIC), an extension of the Akaike information criterion (AIC) for misspecified models, in estimating the generalization gap, shedding light on why quantities such as the number of parameters cannot quantify generalization. The TIC depends on both the Hessian of the loss H and the covariance of the gradients C. By exploring the similarities and differences between these two matrices as well as the Fisher information matrix F, we study the interplay between noise and curvature in deep models. We also address the question of whether C is a reasonable approximation to F, as is commonly assumed.
Information matrices and generalization
Valentin Thomas
Fabian Pedregosa
Bart van Merriënboer
Pierre-Antoine Manzagol
This work revisits the use of information criteria to characterize the generalization of deep learning models. In particular, we empirically… (voir plus) demonstrate the effectiveness of the Takeuchi information criterion (TIC), an extension of the Akaike information criterion (AIC) for misspecified models, in estimating the generalization gap, shedding light on why quantities such as the number of parameters cannot quantify generalization. The TIC depends on both the Hessian of the loss H and the covariance of the gradients C. By exploring the similarities and differences between these two matrices as well as the Fisher information matrix F, we study the interplay between noise and curvature in deep models. We also address the question of whether C is a reasonable approximation to F, as is commonly assumed.
The Value Function Polytope in Reinforcement Learning
Robert Dadashi
Adrien Ali Taiga
Dale Schuurmans
We establish geometric and topological properties of the space of value functions in finite state-action Markov decision processes. Our main… (voir plus) contribution is the characterization of the nature of its shape: a general polytope (Aigner et al., 2010). To demonstrate this result, we exhibit several properties of the structural relationship between policies and value functions including the line theorem, which shows that the value functions of policies constrained on all but one state describe a line segment. Finally, we use this novel perspective to introduce visualizations to enhance the understanding of the dynamics of reinforcement learning algorithms.
Understanding the impact of entropy on policy optimization
Zafarali Ahmed
Mohammad Norouzi
Dale Schuurmans
Entropy regularization is commonly used to improve policy optimization in reinforcement learning. It is believed to help with \emph{explorat… (voir plus)ion} by encouraging the selection of more stochastic policies. In this work, we analyze this claim using new visualizations of the optimization landscape based on randomly perturbing the loss function. We first show that even with access to the exact gradient, policy optimization is difficult due to the geometry of the objective function. Then, we qualitatively show that in some environments, a policy with higher entropy can make the optimization landscape smoother, thereby connecting local optima and enabling the use of larger learning rates. This paper presents new tools for understanding the optimization landscape, shows that policy entropy serves as a regularizer, and highlights the challenge of designing general-purpose policy optimization algorithms.
Distributional reinforcement learning with linear function approximation
Despite many algorithmic advances, our theoretical understanding of practical distributional reinforcement learning methods remains limited.… (voir plus) One exception is Rowland et al. (2018)'s analysis of the C51 algorithm in terms of the Cramer distance, but their results only apply to the tabular setting and ignore C51's use of a softmax to produce normalized distributions. In this paper we adapt the Cramer distance to deal with arbitrary vectors. From it we derive a new distributional algorithm which is fully Cramer-based and can be combined to linear function approximation, with formal guarantees in the context of policy evaluation. In allowing the model's prediction to be any real vector, we lose the probabilistic interpretation behind the method, but otherwise maintain the appealing properties of distributional approaches. To the best of our knowledge, ours is the first proof of convergence of a distributional algorithm combined with function approximation. Perhaps surprisingly, our results provide evidence that Cramer-based distributional methods may perform worse than directly approximating the value function.
Anytime Tail Averaging
Tail averaging consists in averaging the last examples in a stream. Common techniques either have a memory requirement which grows with the … (voir plus)number of samples to average, are not available at every timestep or do not accomodate growing windows. We propose two techniques with a low constant memory cost that perform tail averaging with access to the average at every time step. We also show how one can improve the accuracy of that average at the cost of increased memory consumption.
Distributional reinforcement learning with linear function approximation
Despite many algorithmic advances, our theoretical understanding of practical distributional reinforcement learning methods remains limited.… (voir plus) One exception is Rowland et al. (2018)'s analysis of the C51 algorithm in terms of the Cramer distance, but their results only apply to the tabular setting and ignore C51's use of a softmax to produce normalized distributions. In this paper we adapt the Cramer distance to deal with arbitrary vectors. From it we derive a new distributional algorithm which is fully Cramer-based and can be combined to linear function approximation, with formal guarantees in the context of policy evaluation. In allowing the model's prediction to be any real vector, we lose the probabilistic interpretation behind the method, but otherwise maintain the appealing properties of distributional approaches. To the best of our knowledge, ours is the first proof of convergence of a distributional algorithm combined with function approximation. Perhaps surprisingly, our results provide evidence that Cramer-based distributional methods may perform worse than directly approximating the value function.
Distributional reinforcement learning with linear function approximation
Despite many algorithmic advances, our theoretical understanding of practical distributional reinforcement learning methods remains limited.… (voir plus) One exception is Rowland et al. (2018)'s analysis of the C51 algorithm in terms of the Cramer distance, but their results only apply to the tabular setting and ignore C51's use of a softmax to produce normalized distributions. In this paper we adapt the Cramer distance to deal with arbitrary vectors. From it we derive a new distributional algorithm which is fully Cramer-based and can be combined to linear function approximation, with formal guarantees in the context of policy evaluation. In allowing the model's prediction to be any real vector, we lose the probabilistic interpretation behind the method, but otherwise maintain the appealing properties of distributional approaches. To the best of our knowledge, ours is the first proof of convergence of a distributional algorithm combined with function approximation. Perhaps surprisingly, our results provide evidence that Cramer-based distributional methods may perform worse than directly approximating the value function.
A Geometric Perspective on Optimal Representations for Reinforcement Learning
Will Dabney
Robert Dadashi
Adrien Ali Taiga
Dale Eric. Schuurmans
Tor Lattimore
Clare Lyle
We propose a new perspective on representation learning in reinforcement learning based on geometric properties of the space of value functi… (voir plus)ons. We leverage this perspective to provide formal evidence regarding the usefulness of value functions as auxiliary tasks. Our formulation considers adapting the representation to minimize the (linear) approximation of the value function of all stationary policies for a given environment. We show that this optimization reduces to making accurate predictions regarding a special class of value functions which we call adversarial value functions (AVFs). We demonstrate that using value functions as auxiliary tasks corresponds to an expected-error relaxation of our formulation, with AVFs a natural candidate, and identify a close relationship with proto-value functions (Mahadevan, 2005). We highlight characteristics of AVFs and their usefulness as auxiliary tasks in a series of experiments on the four-room domain.
A Geometric Perspective on Optimal Representations for Reinforcement Learning
Will Dabney
Robert Dadashi
Adrien Ali Taiga
Dale Schuurmans
Tor Lattimore
Clare Lyle
We propose a new perspective on representation learning in reinforcement learning based on geometric properties of the space of value functi… (voir plus)ons. We leverage this perspective to provide formal evidence regarding the usefulness of value functions as auxiliary tasks. Our formulation considers adapting the representation to minimize the (linear) approximation of the value function of all stationary policies for a given environment. We show that this optimization reduces to making accurate predictions regarding a special class of value functions which we call adversarial value functions (AVFs). We demonstrate that using value functions as auxiliary tasks corresponds to an expected-error relaxation of our formulation, with AVFs a natural candidate, and identify a close relationship with proto-value functions (Mahadevan, 2005). We highlight characteristics of AVFs and their usefulness as auxiliary tasks in a series of experiments on the four-room domain.