Publications

Joint Prompt Optimization of Stacked LLMs using Variational Inference

Alessandro Sordoni

Xingdi Yuan

Marc-Alexandre Côté

Matheus Pereira

Adam Trischler

Ziang Xiao

Arian Hosseini

Friederike Niedtner

Nicolas Le Roux

Large language models (LLMs) can be seen as atomic units of computation mapping sequences to a distribution over sequences. Thus, they can b… (see more)e seen as stochastic language layers in a language network, where the learnable parameters are the natural language prompts at each layer. By stacking two such layers and feeding the output of one layer to the next, we obtain a Deep Language Network (DLN). We first show how to effectively perform prompt optimization for a 1-Layer language network (DLN-1). Then, we present an extension that applies to 2-layer DLNs (DLN-2), where two prompts must be learned. The key idea is to consider the output of the first layer as a latent variable, which requires inference, and prompts to be learned as the parameters of the generative distribution. We first test the effectiveness of DLN-1 in multiple reasoning and natural language understanding tasks. Then, we show that DLN-2 can reach higher performance than a single layer, showing promise that we might reach comparable performance to GPT-4, even when each LLM in the network is smaller and less powerful.

2023-09-20

NeurIPS.cc/2023/Conference (poster)

doi.org

openreview.net

Language Model Alignment with Elastic Reset

Michael Noukhovitch

Samuel Lavoie

Florian Strub

Aaron Courville

Finetuning language models with reinforcement learning (RL), e.g. from human feedback (HF), is a prominent method for alignment. But optimiz… (see more)ing against a reward model can improve on reward while degrading performance in other areas, a phenomenon known as reward hacking, alignment tax, or language drift. First, we argue that commonly-used test metrics are insufficient and instead measure how different algorithms tradeoff between reward and drift. The standard method modified the reward with a Kullback-Lieber (KL) penalty between the online and initial model. We propose Elastic Reset, a new algorithm that achieves higher reward with less drift without explicitly modifying the training objective. We periodically reset the online model to an exponentially moving average (EMA) of itself, then reset the EMA model to the initial model. Through the use of an EMA, our model recovers quickly after resets and achieves higher reward with less drift in the same number of steps. We demonstrate that fine-tuning language models with Elastic Reset leads to state-of-the-art performance on a small scale pivot-translation benchmark, outperforms all baselines in a medium-scale RLHF-like IMDB mock sentiment task and leads to a more performant and more aligned technical QA chatbot with LLaMA-7B. Code available at github.com/mnoukhov/elastic-reset.

2023-09-20

NeurIPS.cc/2023/Conference (poster)

doi.org

openreview.net

Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions

Stefano Massaroli

Michael Poli

Daniel Y Fu

Hermann Kumbong

Rom Nishijima Parnichkun

Aman Timalsina

David W. Romero

Quinn McIntyre

Beidi Chen

Atri Rudra

Ce Zhang

Christopher Re

Stefano Ermon

Yoshua Bengio

Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers… (see more). In particular, long convolution sequence models have achieved state-of-the-art performance in many domains, but incur a significant cost during auto-regressive inference workloads -- naively requiring a full pass (or caching of activations) over the input sequence for each generated token -- similarly to attention-based models. In this paper, we seek to enable

2023-09-20

NeurIPS.cc/2023/Conference (poster)

doi.org

openreview.net

Learning better with Dale's Law: A Spectral Perspective

Most recurrent neural networks (RNNs) do not include a fundamental constraint of real neural circuits: Dale’s Law, which implies that neur… (see more)ons must be excitatory (E) or inhibitory (I). Dale’s Law is generally absent from RNNs because simply partitioning a standard network’s units into E and I populations impairs learning. However, here we extend a recent feedforward bio-inspired EI network architecture, named Dale’s ANNs, to recurrent networks, and demonstrate that good performance is possible while respecting Dale’s Law. This begs the question: What makes some forms of EI network learn poorly and others learn well? And, why does the simple approach of incorporating Dale’s Law impair learning? Historically the answer was thought to be the sign constraints on EI network parameters, and this was a motivation behind Dale’s ANNs. However, here we show the spectral properties of the recurrent weight matrix at initialisation are more impactful on network performance than sign constraints. We find that simple EI partitioning results in a singular value distribution that is multimodal and dispersed, whereas standard RNNs have an unimodal, more clustered singular value distribution, as do recurrent Dale’s ANNs. We also show that the spectral properties and performance of partitioned EI networks are worse for small networks with fewer I units, and we present normalised SVD entropy as a measure of spectrum pathology that correlates with performance. Overall, this work sheds light on a long-standing mystery in neuroscience-inspired AI and computational neuroscience, paving the way for greater alignment between neural networks and biology.

2023-09-20

NeurIPS.cc/2023/Conference (poster)

doi.org

openreview.net

Learning Reliable Logical Rules with SATNet

Zhaoyu Li

Jinpei Guo

Yuhe Jiang

Xujie Si

2023-09-20

NeurIPS.cc/2023/Conference (poster)

doi.org

openreview.net

Let the Flows Tell: Solving Graph Combinatorial Optimization Problems with GFlowNets

Dinghuai Zhang

Hanjun Dai

Nikolay Malkin

Aaron Courville

Yoshua Bengio

Ling Pan

2023-09-20

Neural Information Processing Systems (spotlight)

doi.org

openreview.net

Lie Point Symmetry and Physics Informed Networks

Tara Akhound-Sadegh

Laurence Perreault-Levasseur

Johannes Brandstetter

MAX WELLING

Siamak Ravanbakhsh

Symmetries have been leveraged to improve the generalization of neural networks through different mechanisms from data augmentation to equiv… (see more)ariant architectures. However, despite their potential, their integration into neural solvers for partial differential equations (PDEs) remains largely unexplored. We explore the integration of PDE symmetries, known as Lie point symmetries, in a major family of neural solvers known as physics-informed neural networks (PINNs). We propose a loss function that informs the network about Lie point symmetries in the same way that PINN models try to enforce the underlying PDE through a loss function. Intuitively, our symmetry loss ensures that the infinitesimal generators of the Lie group conserve the PDE solutions. Effectively, this means that once the network learns a solution, it also learns the neighbouring solutions generated by Lie point symmetries. Empirical evaluations indicate that the inductive bias introduced by the Lie point symmetries of the PDEs greatly boosts the sample efficiency of PINNs.

2023-09-20

NeurIPS.cc/2023/Conference (poster)

doi.org

openreview.net

Maximum State Entropy Exploration using Predecessor and Successor Representations

Animals have a developed ability to explore that aids them in important tasks such as locating food, exploring for shelter, and finding misp… (see more)laced items. These exploration skills necessarily track where they have been so that they can plan for finding items with relative efficiency. Contemporary exploration algorithms often learn a less efficient exploration strategy because they either condition only on the current state or simply rely on making random open-loop exploratory moves. In this work, we propose

2023-09-20

NeurIPS.cc/2023/Conference (poster)

doi.org

openreview.net

Multi-Head Adapter Routing for Cross-Task Generalization

Lucas Caccia

Edoardo Ponti

Zhan Su

Matheus Pereira

Nicolas Le Roux

Alessandro Sordoni

Parameter-efficient fine-tuning (PEFT) for cross-task generalization consists in pre-training adapters on a multi-task training set before f… (see more)ew-shot adaptation to test tasks. Polytropon [Ponti et al., 2023] (

2023-09-20

NeurIPS.cc/2023/Conference (poster)

doi.org

openreview.net

Neural Graph Generation from Graph Statistics.

Kiarash Zahirnia

Yaochen Hu

Mark J. Coates

Oliver Schulte

2023-09-20

NeurIPS.cc/2023/Conference (poster)

openreview.net

Optimal Extragradient-Based Algorithms for Stochastic Variational Inequalities with Separable Structure

Angela Yuan

Chris Junchi Li

Gauthier Gidel

Michael Jordan

Quanquan Gu

Simon Shaolei Du

We consider the problem of solving stochastic monotone variational inequalities with a separable structure using a stochastic first-order or… (see more)acle. Building on standard extragradient for variational inequalities we propose a novel algorithm---stochastic \emph{accelerated gradient-extragradient} (AG-EG)---for strongly monotone variational inequalities (VIs). Our approach combines the strengths of extragradient and Nesterov acceleration. By showing that its iterates remain in a bounded domain and applying scheduled restarting, we prove that AG-EG has an optimal convergence rate for strongly monotone VIs. Furthermore, when specializing to the particular case of bilinearly coupled strongly-convex-strongly-concave saddle-point problems, including bilinear games, our algorithm achieves fine-grained convergence rates that match the respective lower bounds, with the stochasticity being characterized by an additive statistical error term that is optimal up to a constant prefactor.

2023-09-20

NeurIPS.cc/2023/Conference (poster)

openreview.net

Parallel-mentoring for Offline Model-based Optimization

Can (Sam) Chen

Christopher Beckham

Zixuan Liu

Xue Liu

Christopher Pal

We study offline model-based optimization to maximize a black-box objective function with a static dataset of designs and scores. These desi… (see more)gns encompass a variety of domains, including materials, robots and DNA sequences. A common approach trains a proxy on the static dataset to approximate the black-box objective function and performs gradient ascent to obtain new designs. However, this often results in poor designs due to the proxy inaccuracies for out-of-distribution designs. Recent studies indicate that: (a) gradient ascent with a mean ensemble of proxies generally outperforms simple gradient ascent, and (b) a trained proxy provides weak ranking supervision signals for design selection. Motivated by (a) and (b), we propose \textit{parallel-mentoring} as an effective and novel method that facilitates mentoring among parallel proxies, creating a more robust ensemble to mitigate the out-of-distribution issue. We focus on the three-proxy case and our method consists of two modules. The first module, \textit{voting-based pairwise supervision}, operates on three parallel proxies and captures their ranking supervision signals as pairwise comparison labels. These labels are combined through majority voting to generate consensus labels, which incorporate ranking supervision signals from all proxies and enable mutual mentoring. However, label noise arises due to possible incorrect consensus. To alleviate this, we introduce an \textit{adaptive soft-labeling} module with soft-labels initialized as consensus labels. Based on bi-level optimization, this module fine-tunes proxies in the inner level and learns more accurate labels in the outer level to adaptively mentor proxies, resulting in a more robust ensemble. Experiments validate the effectiveness of our method. Our code is available here.

2023-09-20

NeurIPS.cc/2023/Conference (poster)

doi.org

openreview.net

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Publications

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Publications