Nicolas Le Roux

Marc-Alexandre Côté

Xingdi Yuan

2024-03-11

ICLR.cc/2024/Workshop/LLMAgents (poster)

Language-guided Skill Learning with Temporal Variational Inference

Haotian Fu

Pratyusha Sharma

Elias Stengel-Eskin

George Konidaris

Marc-Alexandre Côté

Xingdi Yuan

We present an algorithm for skill discovery from expert demonstrations. The algorithm first utilizes Large Language Models (LLMs) to propose… (see more) an initial segmentation of the trajectories. Following that, a hierarchical variational inference framework incorporates the LLM-generated segmentation information to discover reusable skills by merging trajectory segments. To further control the trade-off between compression and reusability, we introduce a novel auxiliary objective based on the Minimum Description Length principle that helps guide this skill discovery process. We test our system on BabyAI, a grid world navigation environment, as well as ALFRED, a household simulation environment.Our results demonstrate that agents equipped with our method can discover skills that help accelerate learning and outperform baseline skill learning approaches on new long-horizon tasks.

2024-03-11

ICLR.cc/2024/Workshop/LLMAgents (poster)

Unraveling the Interconnected Axes of Heterogeneity in Machine Learning for Democratic and Inclusive Advancements

Maryam Molamohammadi

Afaf Taïk

Golnoosh Farnadi

2023-10-30

Equity and Access in Algorithms, Mechanisms, and Optimization (published)

arxiv.org

Surrogate Minimization: An Optimization Algorithm for Training Large Neural Networks with Model Parallelism

Reza Asad

Reza Babanezhad Harikandeh

Issam Hadj Laradji

Sharan Vaswani

2023-10-26

NeurIPS.cc/2023/Workshop/OPT (poster)

Decision-Aware Actor-Critic with Function Approximation and Theoretical Guarantees

Sharan Vaswani

Amirreza Kazemi

Reza Babanezhad Harikandeh

Actor-critic (AC) methods are widely used in reinforcement learning (RL) and benefit from the flexibility of using any policy gradient metho… (see more)d as the actor and value-based method as the critic. The critic is usually trained by minimizing the TD error, an objective that is potentially decorrelated with the true goal of achieving a high reward with the actor. We address this mismatch by designing a joint objective for training the actor and critic in a decision-aware fashion. We use the proposed objective to design a generic, AC algorithm that can easily handle any function approximation. We explicitly characterize the conditions under which the resulting algorithm guarantees monotonic policy improvement, regardless of the choice of the policy and critic parameterization. Instantiating the generic algorithm results in an actor that involves maximizing a sequence of surrogate functions (similar to TRPO, PPO) and a critic that involves minimizing a closely connected objective. Using simple bandit examples, we provably establish the benefit of the proposed critic objective over the standard squared error. Finally, we empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems.

2023-09-21

NeurIPS.cc/2023/Conference (poster)

Joint Prompt Optimization of Stacked LLMs using Variational Inference

Alessandro Sordoni

Xingdi Yuan

Marc-Alexandre Côté

Matheus Pereira

Adam Trischler

Ziang Xiao

Arian Hosseini

Friederike Niedtner

Large language models (LLMs) can be seen as atomic units of computation mapping sequences to a distribution over sequences. Thus, they can b… (see more)e seen as stochastic language layers in a language network, where the learnable parameters are the natural language prompts at each layer. By stacking two such layers and feeding the output of one layer to the next, we obtain a Deep Language Network (DLN). We first show how to effectively perform prompt optimization for a 1-Layer language network (DLN-1). Then, we present an extension that applies to 2-layer DLNs (DLN-2), where two prompts must be learned. The key idea is to consider the output of the first layer as a latent variable, which requires inference, and prompts to be learned as the parameters of the generative distribution. We first test the effectiveness of DLN-1 in multiple reasoning and natural language understanding tasks. Then, we show that DLN-2 can reach higher performance than a single layer, showing promise that we might reach comparable performance to GPT-4, even when each LLM in the network is smaller and less powerful.

2023-09-21

NeurIPS.cc/2023/Conference (poster)

Multi-Head Adapter Routing for Cross-Task Generalization

Lucas Caccia

Edoardo Ponti

Zhan Su

Matheus Pereira

Alessandro Sordoni

Parameter-efficient fine-tuning (PEFT) for cross-task generalization consists in pre-training adapters on a multi-task training set before f… (see more)ew-shot adaptation to test tasks. Polytropon [Ponti et al., 2023] (

2023-09-21

NeurIPS.cc/2023/Conference (poster)

Target-based Surrogates for Stochastic Optimization

Jonathan Wilder Lavington

Sharan Vaswani

Reza Babanezhad Harikandeh

Mark Schmidt

We consider minimizing functions for which it is expensive to compute the gradient. Such functions are prevalent in reinforcement learning, … (see more)imitation learning and bilevel optimization. Our target optimization framework uses the (expensive) gradient computation to construct surrogate functions in a \emph{target space} (e.g. the logits output by a linear model for classification) that can be minimized efficiently. This allows for multiple parameter updates to the model, amortizing the cost of gradient computation. In the full-batch setting, we prove that our surrogate is a global upper-bound on the loss, and can be (locally) minimized using a black-box optimization algorithm. We prove that the resulting majorization-minimization algorithm ensures convergence to a stationary point of the loss. Next, we instantiate our framework in the stochastic setting and propose the

2023-04-24

ICML.cc/2023/Conference (poster)

Information matrices and generalization

Valentin Thomas

Fabian Pedregosa

Bart Van Merrienboer

Pierre-Antoine Manzagol

Yoshua Bengio

This work revisits the use of information criteria to characterize the generalization of deep learning models. In particular, we empirically… (see more) demonstrate the effectiveness of the Takeuchi information criterion (TIC), an extension of the Akaike information criterion (AIC) for misspecified models, in estimating the generalization gap, shedding light on why quantities such as the number of parameters cannot quantify generalization. The TIC depends on both the Hessian of the loss H and the covariance of the gradients C. By exploring the similarities and differences between these two matrices as well as the Fisher information matrix F, we study the interplay between noise and curvature in deep models. We also address the question of whether C is a reasonable approximation to F, as is commonly assumed.

2019-06-18

arXiv.org (preprint)

dblp.uni-trier.de

Information matrices and generalization

Valentin Thomas

Fabian Pedregosa

Bart Van Merrienboer

Pierre-Antoine Manzagol

Yoshua Bengio

This work revisits the use of information criteria to characterize the generalization of deep learning models. In particular, we empirically… (see more) demonstrate the effectiveness of the Takeuchi information criterion (TIC), an extension of the Akaike information criterion (AIC) for misspecified models, in estimating the generalization gap, shedding light on why quantities such as the number of parameters cannot quantify generalization. The TIC depends on both the Hessian of the loss H and the covariance of the gradients C. By exploring the similarities and differences between these two matrices as well as the Fisher information matrix F, we study the interplay between noise and curvature in deep models. We also address the question of whether C is a reasonable approximation to F, as is commonly assumed.

2019-06-18

arXiv.org (preprint)

dblp.uni-trier.de

Distributional reinforcement learning with linear function approximation

Marc Gendron-Bellemare

Pablo Samuel Castro

Subhodeep Moitra

Despite many algorithmic advances, our theoretical understanding of practical distributional reinforcement learning methods remains limited.… (see more) One exception is Rowland et al. (2018)'s analysis of the C51 algorithm in terms of the Cramer distance, but their results only apply to the tabular setting and ignore C51's use of a softmax to produce normalized distributions. In this paper we adapt the Cramer distance to deal with arbitrary vectors. From it we derive a new distributional algorithm which is fully Cramer-based and can be combined to linear function approximation, with formal guarantees in the context of policy evaluation. In allowing the model's prediction to be any real vector, we lose the probabilistic interpretation behind the method, but otherwise maintain the appealing properties of distributional approaches. To the best of our knowledge, ours is the first proof of convergence of a distributional algorithm combined with function approximation. Perhaps surprisingly, our results provide evidence that Cramer-based distributional methods may perform worse than directly approximating the value function.

2019-02-08

ArXiv (preprint)

arxiv.org

Distributional reinforcement learning with linear function approximation

Marc Gendron-Bellemare