Publications

L AUGHING H YENA D ISTILLERY Extracting Compact Recurrences From Convolutions
∗. StefanoMassaroli
∗. MichaelPoli
∗. DanielY.Fu
Hermann Kumbong
Rom N. Parnichkun
Aman Timalsina
David W. Romero
Quinn McIntyre
Beidi Chen
Atri Rudra
Ce Zhang
Christopher Re
Stefano Ermon
Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers… (voir plus). In particular, long convolution sequence models have achieved state-of-the-art performance in many domains, but incur a significant cost during auto-regressive inference workloads – naively requiring a full pass (or caching of activations) over the input sequence for each generated token – similarly to attention-based models. In this paper, we seek to enable O (1) compute and memory cost per token in any pre-trained long convolution architecture to reduce memory footprint and increase throughput during generation. Concretely, our methods consist in extracting low-dimensional linear state-space models from each convolution layer, building upon rational interpolation and model-order reduction techniques. We further introduce architectural improvements to convolution-based layers such as Hyena : by weight-tying the filters across channels into heads , we achieve higher pre-training quality and reduce the number of filters to be distilled. The resulting model achieves 10 × higher throughput than Transformers and 1 . 5 × higher than Hyena at 1 . 3 B parameters, without any loss in quality after distillation.
Cognitive Models as Simulators: Using Cognitive Models to Tap into Implicit Human Feedback
Ardavan S Nobandegani
Thomas R. Shultz
In this work, we substantiate the idea of cognitive models as simulators , which is to have AI systems interact with, and collect feedback f… (voir plus)rom, cognitive models instead of humans, thereby making the training process safer, cheaper, and faster. We leverage this idea in the context of learning a fair behavior toward a counterpart exhibiting various emotional states — as implicit human feedback. As a case study, we adopt the Ultima-tum game (UG), a canonical task in behavioral and brain sciences for studying fairness. We show that our reinforcement learning (RL) agents learn to exhibit differential, rationally-justified behaviors under various emotional states of their UG counterpart. We discuss the implications of our work for AI and cognitive science research, and its potential for interactive learning with implicit human feedback.
Deep PDE Solvers for Subgrid Modelling and Out-of-Distribution Generalization
Patrick Chatain
Adam M. Oberman
Climate and weather modelling (CWM) is an important area where ML models are used for subgrid modelling: making predictions of processes occ… (voir plus)urring at scales too small to be resolved by standard solution methods(Brasseur & Jacob, 2017). These models are expected to make accurate predictions, even on out-of-distribution (OOD) data, and are additionally expected to respect important physical constraints of the ground truth model (Kashinath et al., 2021). While many specialized ML PDE solvers have been developed, the particular requirements of CWM models have not been addressed so far. The goal of this work is to address them. We propose and develop a novel architecture, which matches or exceeds the performance of standard ML models, and which demonstrably succeeds in OOD generalization. The architecture is based on expert knowledge of the structure of PDE solution operators, which permits the model to also obey important physical constraints
Learning Optimizers for Local SGD
Charles-Étienne Joseph
Benjamin Thérien
Abhinav Moudgil
Boris Knyazev
Communication-efficient variants of SGD, specifically local SGD, have received a great deal of interest in recent years. These approaches co… (voir plus)mpute multiple gradient steps locally, that is on each worker, before averaging model parameters, helping relieve the critical communication bottleneck in distributed deep learning training. Although many variants of these approaches have been proposed, they can sometimes lag behind state-of-the-art optimizers for deep learning. In this work, we incorporate local optimizers that compute multiple updates into a learned optimization framework, allowing to meta-learn potentially more efficient local SGD algorithms. Our results demonstrate that local learned optimizers can substantially outperform local SGD and its sophisticated variants while maintaining their communication efficiency. We show that the learned optimizers can generalize to new datasets and architectures, demonstrating the potential of learned optimizers for improving communication-efficient distributed learning.
Physics-Informed Transformer Networks
F. Dos
Santos
Tara Akhound-Sadegh
Physics-informed neural networks (PINNs) have been recognized as a viable alternative to conventional numerical solvers for Partial Differen… (voir plus)tial Equations (PDEs). The main appeal of PINNs is that since they directly enforce the PDE equation, one does not require access to costly ground truth solutions for training the model. However, a key challenge is their limited generalization across varied initial conditions. Addressing this, our study presents a novel Physics-Informed Transformer (PIT) model for learning the solution operator for PDEs. Using the attention mechanism, PIT learns to leverage the relationships between its initial condition and query points, resulting in a significant improvement in generalization. Moreover, in contrast to existing physics-informed networks, our model is invariant to the discretization of the input domain, providing great flexibility in problem specification and training. We validated our proposed method on the 1D Burgers’ and the 2D Heat equations, demonstrating notable improvement over standard PINN models for operator learning with negligible computational overhead.
On the Varied Faces of Overparameterization in Supervised and Self-Supervised Learning
Matteo Gamba
Arna Ghosh
Kumar Krishna
Agrawal
Blake A. Richards
Hossein Azizpour
Mårten Björkman
The quality of the representations learned by neural networks depends on several factors, including the loss function, learning algorithm, a… (voir plus)nd model architecture. In this work, we use information geometric measures to assess the representation quality in a principled manner. We demonstrate that the sensitivity of learned representations to input perturbations, measured by the spectral norm of the feature Jacobian, provides valuable information about downstream generalization. On the other hand, measuring the coefficient of spectral decay observed in the eigen-spectrum of feature covariance provides insights into the global representation geometry. First, we empirically establish an equivalence between these notions of representation quality and show that they are inversely correlated. Second, our analysis reveals the varying roles that overparameterization plays in improving generalization. Unlike supervised learning, we observe that increasing model width leads to higher discriminability and less smoothness in the self-supervised regime. Furthermore, we report that there is no observable double descent phenomenon in SSL with non-contrastive objectives for commonly used parameterization regimes, which opens up new opportunities for tight asymptotic analysis. Taken together, our results provide a loss-aware characterization of the different role of overparam-eterization in supervised and self-supervised learning.