Publications

Model evaluation for extreme risks
Toby Shevlane
Sebastian Farquhar
Ben Garfinkel
Mary Phuong
Jess Whittlestone
Jade Leung
Daniel Kokotajlo
Nahema A. Marchal
Markus Anderljung
Noam Kolt
Lewis Ho
Divya Siddarth
Shahar Avin
W. Hawkins
Been Kim
Iason Gabriel
Vijay Bolina
Jack Clark
Paul F. Christiano … (see 1 more)
Allan Dafoe
De novo motor learning creates structure in neural activity space that shapes adaptation
Joanna C. Chang
Lee Miller
Juan A. Gallego
Claudia Clopath
Realistically distributing object placements in synthetic training data improves the performance of vision-based object detection models
Setareh Dabiri
Vasileios Lioutas
Berend Zwartsenberg
Yunpeng Liu
Matthew Niedoba
Xiaoxuan Liang
Dylan Green
Justice Sefas
Jonathan Wilder Lavington
Adam Ścibior
When training object detection models on synthetic data, it is important to make the distribution of synthetic data as close as possible to … (see more)the distribution of real data. We investigate specifically the impact of object placement distribution, keeping all other aspects of synthetic data fixed. Our experiment, training a 3D vehicle detection model in CARLA and testing on KITTI, demonstrates a substantial improvement resulting from improving the object placement distribution.
Think Before You Act: Decision Transformers with Internal Working Memory
Jikun Kang
Romain Laroche
Xingdi Yuan
Adam P. Trischler
Xuefei Liu
Jie Fu
Large language model (LLM)-based decision-making agents have shown the ability to generalize across multiple tasks. However, their performan… (see more)ce relies on massive data and compute. We argue that this inefficiency stems from the forgetting phenomenon, in which a model memorizes its behaviors in parameters throughout training. As a result, training on a new task may deteriorate the model's performance on previous tasks. In contrast to LLMs' implicit memory mechanism, the human brain utilizes distributed memory storage, which helps manage and organize multiple skills efficiently, mitigating the forgetting phenomenon. Thus inspired, we propose an internal working memory module to store, blend, and retrieve information for different downstream tasks. Evaluation results show that the proposed method improves training efficiency and generalization in both Atari games and meta-world object manipulation tasks. Moreover, we demonstrate that memory fine-tuning further enhances the adaptability of the proposed architecture.
Think Before You Act: Decision Transformers with Internal Working Memory
Jikun Kang
Romain Laroche
Xingdi Yuan
Adam Trischler
Jie Fu
Large language model (LLM)-based decision-making agents have shown the ability to generalize across multiple tasks. However, their performan… (see more)ce relies on massive data and compute. We argue that this inefficiency stems from the forgetting phenomenon, in which a model memorizes its behaviors in parameters throughout training. As a result, training on a new task may deteriorate the model's performance on previous tasks. In contrast to LLMs' implicit memory mechanism, the human brain utilizes distributed memory storage, which helps manage and organize multiple skills efficiently, mitigating the forgetting phenomenon. Thus inspired, we propose an internal working memory module to store, blend, and retrieve information for different downstream tasks. Evaluation results show that the proposed method improves training efficiency and generalization in both Atari games and meta-world object manipulation tasks. Moreover, we demonstrate that memory fine-tuning further enhances the adaptability of the proposed architecture.
Fourier Neural Operators for Arbitrary Resolution Climate Data Downscaling
Qidong Yang
Alex Hernandez-Garcia
Paula Harder
Venkatesh Ramesh
Prasanna Sattegeri
D. Szwarcman
C. Watson
Climate simulations are essential in guiding our understanding of climate change and responding to its effects. However, it is computational… (see more)ly expensive to resolve complex climate processes at high spatial resolution. As one way to speed up climate simulations, neural networks have been used to downscale climate variables from fast-running low-resolution simulations, but high-resolution training data are often unobtainable or scarce, greatly limiting accuracy. In this work, we propose a downscaling method based on the Fourier neural operator. It trains with data of a small upsampling factor and then can zero-shot downscale its input to arbitrary unseen high resolution. Evaluated both on ERA5 climate model data and on the Navier-Stokes equation solution data, our downscaling model significantly outperforms state-of-the-art convolutional and generative adversarial downscaling models, both in standard single-resolution downscaling and in zero-shot generalization to higher upsampling factors. Furthermore, we show that our method also outperforms state-of-the-art data-driven partial differential equation solvers on Navier-Stokes equations. Overall, our work bridges the gap between simulation of a physical process and interpolation of low-resolution output, showing that it is possible to combine both approaches and significantly improve upon each other.
Guillotine Regularization: Why removing layers is needed to improve generalization in Self-Supervised Learning
Florian Bordes
Randall Balestriero
Quentin Garrido
Adrien Bardes
One unexpected technique that emerged in recent years consists in training a Deep Network (DN) with a Self-Supervised Learning (SSL) method,… (see more) and using this network on downstream tasks but with its last few projector layers entirely removed. This trick of throwing away the projector is actually critical for SSL methods to display competitive performances on ImageNet for which more than 30 percentage points can be gained that way. This is a little vexing, as one would hope that the network layer at which invariance is explicitly enforced by the SSL criterion during training (the last projector layer) should be the one to use for best generalization performance downstream. But it seems not to be, and this study sheds some light on why. This trick, which we name Guillotine Regularization (GR), is in fact a generically applicable method that has been used to improve generalization performance in transfer learning scenarios. In this work, we identify the underlying reasons behind its success and show that the optimal layer to use might change significantly depending on the training setup, the data or the downstream task. Lastly, we give some insights on how to reduce the need for a projector in SSL by aligning the pretext SSL task and the downstream task.
Identifying Critical Neurons in ANN Architectures using Mixed Integer Programming
Mostafa ElAraby
Should We Attend More or Less? Modulating Attention for Fairness
A. Zayed
Goncalo Mordido
Samira Shabanian
Data Imputation with an Autoencoder and MAGIC
Devin Eddington
Andres Felipe Duque Correa
Kevin R. Moon
Missing data is a common problem in many applications. Imputing missing values is a challenging task, as the imputations need to be accurate… (see more) and robust to avoid introducing bias in downstream analysis. In this paper, we propose an ensemble method that combines the strengths of a manifold learning-based imputation method called MAGIC and an autoencoder deep learning model. We call our method Deep MAGIC. Deep MAGIC is trained on a linear combination of the mean squared error of the original data and the mean squared error of the MAGIC-imputed data. Experimental results on three benchmark datasets show that Deep MAGIC outperforms several state-of-the-art imputation methods, demonstrating its effectiveness and robustness in handling large amounts of missing data.
Graph Fourier MMD for Signals on Graphs
Samuel Leone
Aarthi Venkat
Guillaume Huguet
Alexander Tong
Smita Krishnaswamy
While numerous methods have been proposed for computing distances between probability distributions in Euclidean space, relatively little at… (see more)tention has been given to computing such distances for distributions on graphs. However, there has been a marked increase in data that either lies on graph (such as protein interaction networks) or can be modeled as a graph (single cell data), particularly in the biomedical sciences. Thus, it becomes important to find ways to compare signals defined on such graphs. Here, we propose Graph Fourier MMD (GFMMD), a novel distance between distributions and signals on graphs. GFMMD is defined via an optimal witness function that is both smooth on the graph and maximizes the difference in expectation between the pair of distributions on the graph. We find an analytical solution to this optimization problem as well as an embedding of distributions that results from this method. We also prove several properties of this method including scale invariance and applicability to disconnected graphs. We showcase it on graph benchmark datasets as well on single cell RNA-sequencing data analysis. In the latter, we use the GFMMD-based gene embeddings to find meaningful gene clusters. We also propose a novel type of score for gene selection called gene localization score which helps select genes for cellular state space characterization.
Hybrid GRAND Sphere Decoding: Accelerated GRAND for Low-Rate Codes
Huayi Zhou
Guessing random additive noise decoding (GRAND) and sphere decoding (SD) are two algorithms that can achieve maximum likelihood decoding. In… (see more) this paper, a hybrid GRAND-SD (HGRAND) scheme is proposed to extend GRAND to low-rate codes. An accelerated GRAND decoder, assisted by a sphere decoder running in parallel and giving hints to it to allow skipping of certain candidates allows HGRAND to achieve a latency below the minimum latency of the individual component decoders while guaranteeing error-correction performance.