Publications

Think Before You Act: Decision Transformers with Working Memory

Jikun Kang

Romain Laroche

Xingdi Yuan

Adam Trischler

Jie Fu

Decision Transformer-based decision-making agents have shown the ability to generalize across multiple tasks. However, their performance rel… (see more)ies on massive data and computation. We argue that this inefficiency stems from the forgetting phenomenon, in which a model memorizes its behaviors in parameters throughout training. As a result, training on a new task may deteriorate the model’s performance on previous tasks. In contrast to LLMs’ implicit memory mechanism, the human brain utilizes distributed memory storage, which helps manage and organize multiple skills efficiently, mitigating the forgetting phenomenon. Inspired by this, we propose a working memory module to store, blend, and retrieve information for different downstream tasks. Evaluation results show that the proposed method improves training efficiency and generalization in Atari games and Meta-World object manipulation tasks. Moreover, we demonstrate that memory fine-tuning further enhances the adaptability of the proposed architecture.

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

proceedings.mlr.press

Towards Modular LLMs by Building and Reusing a Library of LoRAs

Edoardo Ponti

Matheus Pereira

Lucas Caccia

Alessandro Sordoni

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

doi.org

openreview.net

Do Transformer World Models Give Better Policy Gradients?

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

doi.org

openreview.net

Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues

Antonio Orvieto

Soham De

Caglar Gulcehre

Razvan Pascanu

Samuel L. Smith

Deep neural networks based on linear RNNs interleaved with position-wise MLPs are gaining traction as competitive approaches for sequence mo… (see more)deling. Examples of such architectures include state-space models (SSMs) like S4, LRU, and Mamba: recently proposed models that achieve promising performance on text, genetics, and other data that require long-range reasoning. Despite experimental evidence highlighting these architectures’ effectiveness and computational efficiency, their expressive power remains relatively unexplored, especially in connection to specific choices crucial in practice - e.g., carefully designed initialization distribution and potential use of complex numbers. In this paper, we show that combining MLPs with both real or complex linear diagonal recurrences leads to arbitrarily precise approximation of regular causal sequence-to-sequence maps. At the heart of our proof, we rely on a separation of concerns: the linear RNN provides a lossless encoding of the input sequence, and the MLP performs non-linear processing on this encoding. While we show that real diagonal linear recurrences are enough to achieve universality in this architecture, we prove that employing complex eigenvalues near unit disk - i.e., empirically the most successful strategy in S4 - greatly helps the RNN in storing information. We connect this finding with the vanishing gradient issue and provide experiments supporting our claims.

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

proceedings.mlr.press

arxiv.org

Unsupervised Concept Discovery Mitigates Spurious Correlations

Md Rifat Arefin

Yan Zhang

Aristide Baratin

Francesco Locatello

Irina Rish

Dianbo Liu

Kenji Kawaguchi

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

doi.org

openreview.net

In value-based deep reinforcement learning, a pruned network is a good network

Johan Samir Obando Ceron

Aaron Courville

Pablo Samuel Castro

Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage pri… (see more)or insights into the advantages of sparse training techniques and demonstrate that gradual magnitude pruning enables {value-based} agents to maximize parameter effectiveness. This results in networks that yield dramatic performance improvements over traditional networks, using only a small fraction of the full network parameters. Our code is publicly available, see Appendix A for details.

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

proceedings.mlr.press

openreview.net

When is Transfer Learning Possible?

My Phan

Kianté Brantley

Stephanie Milani

Soroush Mehri

Gokul Swamy

Geoff Gordon

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

proceedings.mlr.press

WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin

Maxime Gasse

Massimo Caccia

Issam Hadj Laradji

Manuel Del Verme

David Vazquez

Alexandre Lacoste

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

doi.org

openreview.net

No Wrong Turns: The Simple Geometry Of Neural Networks Optimization Paths

Charles Guille-escuret

Hiroki Naganuma

Kilian FATRAS

Ioannis Mitliagkas

Understanding the optimization dynamics of neural networks is necessary for closing the gap between theory and practice. Stochastic first-or… (see more)der optimization algorithms are known to efficiently locate favorable minima in deep neural networks. This efficiency, however, contrasts with the non-convex and seemingly complex structure of neural loss landscapes. In this study, we delve into the fundamental geometric properties of sampled gradients along optimization paths. We focus on two key quantities, which appear in the restricted secant inequality and error bound. Both hold high significance for first-order optimization. Our analysis reveals that these quantities exhibit predictable, consistent behavior throughout training, despite the stochasticity induced by sampling minibatches. Our findings suggest that not only do optimization trajectories never encounter significant obstacles, but they also maintain stable dynamics during the majority of training. These observed properties are sufficiently expressive to theoretically guarantee linear convergence and prescribe learning rate schedules mirroring empirical practices. We conduct our experiments on image classification, semantic segmentation and language modeling across different batch sizes, network architectures, datasets, optimizers, and initialization seeds. We discuss the impact of each factor. Our work provides novel insights into the properties of neural network loss functions, and opens the door to theoretical frameworks more relevant to prevalent practice.

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

doi.org

openreview.net

Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis

Stefan Horoi

Albert Manuel Orozco Camacho

Eugene Belilovsky

Guy Wolf

Combining the predictions of multiple trained models through ensembling is generally a good way to improve accuracy by leveraging the differ… (see more)ent learned features of the models, however it comes with high computational and storage costs. Model fusion, the act of merging multiple models into one by combining their parameters reduces these costs but doesn't work as well in practice. Indeed, neural network loss landscapes are high-dimensional and non-convex and the minima found through learning are typically separated by high loss barriers. Numerous recent works have been focused on finding permutations matching one network features to the features of a second one, lowering the loss barrier on the linear path between them in parameter space. However, permutations are restrictive since they assume a one-to-one mapping between the different models' neurons exists. We propose a new model merging algorithm, CCA Merge, which is based on Canonical Correlation Analysis and aims to maximize the correlations between linear combinations of the model features. We show that our alignment method leads to better performances than past methods when averaging models trained on the same, or differing data splits. We also extend this analysis into the harder setting where more than 2 models are merged, and we find that CCA Merge works significantly better than past methods. Our code is publicly available at https://github.com/shoroi/align-n-merge

2024-07-07

ArXiv (preprint)

doi.org

arxiv.org

Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis

Stefan Horoi

Albert Manuel Orozco Camacho

Eugene Belilovsky

Guy Wolf

Combining the predictions of multiple trained models through ensembling is generally a good way to improve accuracy by leveraging the differ… (see more)ent learned features of the models, however it comes with high computational and storage costs. Model fusion, the act of merging multiple models into one by combining their parameters reduces these costs but doesn't work as well in practice. Indeed, neural network loss landscapes are high-dimensional and non-convex and the minima found through learning are typically separated by high loss barriers. Numerous recent works have been focused on finding permutations matching one network features to the features of a second one, lowering the loss barrier on the linear path between them in parameter space. However, permutations are restrictive since they assume a one-to-one mapping between the different models' neurons exists. We propose a new model merging algorithm, CCA Merge, which is based on Canonical Correlation Analysis and aims to maximize the correlations between linear combinations of the model features. We show that our alignment method leads to better performances than past methods when averaging models trained on the same, or differing data splits. We also extend this analysis into the harder setting where more than 2 models are merged, and we find that CCA Merge works significantly better than past methods. Our code is publicly available at https://github.com/shoroi/align-n-merge

2024-07-07

ArXiv (preprint)

doi.org

arxiv.org

Prospective Messaging: Learning in Networks with Communication Delays

Ryan Fayyazi

Christian Dietrich Weilbach

Frank N. Wood

Inter-neuron communication delays are ubiquitous in physically realized neural networks such as biological neural circuits and neuromorphic … (see more)hardware. These delays have significant and often disruptive consequences on network dynamics during training and inference. It is therefore essential that communication delays be accounted for, both in computational models of biological neural networks and in large-scale neuromorphic systems. Nonetheless, communication delays have yet to be comprehensively addressed in either domain. In this paper, we first show that delays prevent state-of-the-art continuous-time neural networks called Latent Equilibrium (LE) networks from learning even simple tasks despite significant overparameterization. We then propose to compensate for communication delays by predicting future signals based on currently available ones. This conceptually straightforward approach, which we call prospective messaging (PM), uses only neuron-local information, and is flexible in terms of memory and computation requirements. We demonstrate that incorporating PM into delayed LE networks prevents reaction lags, and facilitates successful learning on Fourier synthesis and autoregressive video prediction tasks.

2024-07-07

ArXiv (preprint)

doi.org

arxiv.org

AI Advantage

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

AI Advantage

Leveraging AI for a Sustainable Future

Publications

AI Advantage

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

AI Advantage

Leveraging AI for a Sustainable Future

Popular keywords:

Publications