Think Before You Act: Decision Transformers with Working Memory
Jikun Kang
Romain Laroche
Xingdi Yuan
Adam Trischler
Jie Fu
Decision Transformer-based decision-making agents have shown the ability to generalize across multiple tasks. However, their performance rel… (voir plus)ies on massive data and computation. We argue that this inefficiency stems from the forgetting phenomenon, in which a model memorizes its behaviors in parameters throughout training. As a result, training on a new task may deteriorate the model’s performance on previous tasks. In contrast to LLMs’ implicit memory mechanism, the human brain utilizes distributed memory storage, which helps manage and organize multiple skills efficiently, mitigating the forgetting phenomenon. Inspired by this, we propose a working memory module to store, blend, and retrieve information for different downstream tasks. Evaluation results show that the proposed method improves training efficiency and generalization in Atari games and Meta-World object manipulation tasks. Moreover, we demonstrate that memory fine-tuning further enhances the adaptability of the proposed architecture.
Towards Modular LLMs by Building and Reusing a Library of LoRAs
Oleksiy Ostapenko
Zhan Su
Edoardo Ponti
Matheus Pereira
Lucas Caccia
Do Transformer World Models Give Better Policy Gradients?
Michel Ma
Tianwei Ni
Clement Gehring
Pierluca D'Oro
Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues
Antonio Orvieto
Soham De
Caglar Gulcehre
Samuel L. Smith
Deep neural networks based on linear RNNs interleaved with position-wise MLPs are gaining traction as competitive approaches for sequence mo… (voir plus)deling. Examples of such architectures include state-space models (SSMs) like S4, LRU, and Mamba: recently proposed models that achieve promising performance on text, genetics, and other data that require long-range reasoning. Despite experimental evidence highlighting these architectures’ effectiveness and computational efficiency, their expressive power remains relatively unexplored, especially in connection to specific choices crucial in practice - e.g., carefully designed initialization distribution and potential use of complex numbers. In this paper, we show that combining MLPs with both real or complex linear diagonal recurrences leads to arbitrarily precise approximation of regular causal sequence-to-sequence maps. At the heart of our proof, we rely on a separation of concerns: the linear RNN provides a lossless encoding of the input sequence, and the MLP performs non-linear processing on this encoding. While we show that real diagonal linear recurrences are enough to achieve universality in this architecture, we prove that employing complex eigenvalues near unit disk - i.e., empirically the most successful strategy in S4 - greatly helps the RNN in storing information. We connect this finding with the vanishing gradient issue and provide experiments supporting our claims.
Unsupervised Concept Discovery Mitigates Spurious Correlations
Md Rifat Arefin
Yan Zhang
Aristide Baratin
Francesco Locatello
Dianbo Liu
Kenji Kawaguchi
In value-based deep reinforcement learning, a pruned network is a good network
Johan Samir Obando Ceron
Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage pri… (voir plus)or insights into the advantages of sparse training techniques and demonstrate that gradual magnitude pruning enables {value-based} agents to maximize parameter effectiveness. This results in networks that yield dramatic performance improvements over traditional networks, using only a small fraction of the full network parameters. Our code is publicly available, see Appendix A for details.
When is Transfer Learning Possible?
My Phan
Kianté Brantley
Stephanie Milani
Soroush Mehri
Gokul Swamy
WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?
Massimo Caccia
Issam Hadj Laradji
Manuel Del Verme
Tom Marty
Léo Boisvert
Megh Thakkar
David Vazquez
Alexandre Lacoste
No Wrong Turns: The Simple Geometry Of Neural Networks Optimization Paths
Charles Guille-Escuret
Hiroki Naganuma
Kilian FATRAS
Understanding the optimization dynamics of neural networks is necessary for closing the gap between theory and practice. Stochastic first-or… (voir plus)der optimization algorithms are known to efficiently locate favorable minima in deep neural networks. This efficiency, however, contrasts with the non-convex and seemingly complex structure of neural loss landscapes. In this study, we delve into the fundamental geometric properties of sampled gradients along optimization paths. We focus on two key quantities, which appear in the restricted secant inequality and error bound. Both hold high significance for first-order optimization. Our analysis reveals that these quantities exhibit predictable, consistent behavior throughout training, despite the stochasticity induced by sampling minibatches. Our findings suggest that not only do optimization trajectories never encounter significant obstacles, but they also maintain stable dynamics during the majority of training. These observed properties are sufficiently expressive to theoretically guarantee linear convergence and prescribe learning rate schedules mirroring empirical practices. We conduct our experiments on image classification, semantic segmentation and language modeling across different batch sizes, network architectures, datasets, optimizers, and initialization seeds. We discuss the impact of each factor. Our work provides novel insights into the properties of neural network loss functions, and opens the door to theoretical frameworks more relevant to prevalent practice.
Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis
Stefan Horoi
Albert Manuel Orozco Camacho
Combining the predictions of multiple trained models through ensembling is generally a good way to improve accuracy by leveraging the differ… (voir plus)ent learned features of the models, however it comes with high computational and storage costs. Model fusion, the act of merging multiple models into one by combining their parameters reduces these costs but doesn't work as well in practice. Indeed, neural network loss landscapes are high-dimensional and non-convex and the minima found through learning are typically separated by high loss barriers. Numerous recent works have been focused on finding permutations matching one network features to the features of a second one, lowering the loss barrier on the linear path between them in parameter space. However, permutations are restrictive since they assume a one-to-one mapping between the different models' neurons exists. We propose a new model merging algorithm, CCA Merge, which is based on Canonical Correlation Analysis and aims to maximize the correlations between linear combinations of the model features. We show that our alignment method leads to better performances than past methods when averaging models trained on the same, or differing data splits. We also extend this analysis into the harder setting where more than 2 models are merged, and we find that CCA Merge works significantly better than past methods. Our code is publicly available at https://github.com/shoroi/align-n-merge
Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis
Stefan Horoi
Albert Manuel Orozco Camacho
Combining the predictions of multiple trained models through ensembling is generally a good way to improve accuracy by leveraging the differ… (voir plus)ent learned features of the models, however it comes with high computational and storage costs. Model fusion, the act of merging multiple models into one by combining their parameters reduces these costs but doesn't work as well in practice. Indeed, neural network loss landscapes are high-dimensional and non-convex and the minima found through learning are typically separated by high loss barriers. Numerous recent works have been focused on finding permutations matching one network features to the features of a second one, lowering the loss barrier on the linear path between them in parameter space. However, permutations are restrictive since they assume a one-to-one mapping between the different models' neurons exists. We propose a new model merging algorithm, CCA Merge, which is based on Canonical Correlation Analysis and aims to maximize the correlations between linear combinations of the model features. We show that our alignment method leads to better performances than past methods when averaging models trained on the same, or differing data splits. We also extend this analysis into the harder setting where more than 2 models are merged, and we find that CCA Merge works significantly better than past methods. Our code is publicly available at https://github.com/shoroi/align-n-merge
Prospective Messaging: Learning in Networks with Communication Delays
Ryan Fayyazi
Christian Dietrich Weilbach
Frank Wood
Inter-neuron communication delays are ubiquitous in physically realized neural networks such as biological neural circuits and neuromorphic … (voir plus)hardware. These delays have significant and often disruptive consequences on network dynamics during training and inference. It is therefore essential that communication delays be accounted for, both in computational models of biological neural networks and in large-scale neuromorphic systems. Nonetheless, communication delays have yet to be comprehensively addressed in either domain. In this paper, we first show that delays prevent state-of-the-art continuous-time neural networks called Latent Equilibrium (LE) networks from learning even simple tasks despite significant overparameterization. We then propose to compensate for communication delays by predicting future signals based on currently available ones. This conceptually straightforward approach, which we call prospective messaging (PM), uses only neuron-local information, and is flexible in terms of memory and computation requirements. We demonstrate that incorporating PM into delayed LE networks prevents reaction lags, and facilitates successful learning on Fourier synthesis and autoregressive video prediction tasks.