Publications

Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density
Randall Balestriero
Michael G. Rabbat
Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine… (see more) two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample's representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs' anti-collapse term does much more--it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used--in any case one can compute the learned probabilities of sample
Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise
Berton Earnshaw
Jason Hartford
Mitigating Disparate Impact of Differential Privacy in Federated Learning through Robust Clustering
Federated Learning (FL) is a decentralized machine learning (ML) approach that keeps data localized and often incorporates Differential Priv… (see more)acy (DP) to enhance privacy guarantees. Similar to previous work on DP in ML, we observed that differentially private federated learning (DPFL) introduces performance disparities, particularly affecting minority groups. Recent work has attempted to address performance fairness in vanilla FL through clustering, but this method remains sensitive and prone to errors, which are further exacerbated by the DP noise in DPFL. To fill this gap, in this paper, we propose a novel clustered DPFL algorithm designed to effectively identify clients' clusters in highly heterogeneous settings while maintaining high accuracy with DP guarantees. To this end, we propose to cluster clients based on both their model updates and training loss values. Our proposed approach also addresses the server's uncertainties in clustering clients' model updates by employing larger batch sizes along with Gaussian Mixture Model (GMM) to alleviate the impact of noise and potential clustering errors, especially in privacy-sensitive scenarios. We provide theoretical analysis of the effectiveness of our proposed approach. We also extensively evaluate our approach across diverse data distributions and privacy budgets and show its effectiveness in mitigating the disparate impact of DP in FL settings with a small computational cost.
RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies
Pranav Atreya
Karl Pertsch
Tony Lee
Moo Jin Kim
Arhan Jain
Cyrus Neary
Edward S. Hu
Kanav Arora
Luca Macesanu
Matthew Leonard
Meedeum Cho
Shivin Dass
Tony Wang
Xingfang Yuan
Abhishek Gupta
Dinesh Jayaraman
Kostas Daniilidis
Roberto Martín-Martín
Youngwoon Lee
Percy Liang
Chelsea Finn
Sergey Levine
In silico neutron relative biological effectiveness estimations for Pre-DNA repair and post-DNA repair endpoints
Nicolas Desjardins
J. Kildea
A comprehensive understanding of the energy-dependent stochastic risks associated with neutron exposure is crucial to develop robust radiopr… (see more)otection systems. However, the scarcity of experimental data presents significant challenges in this domain. Track-structure Monte Carlo (TSMC) simulations with DNA models have demonstrated their potential to further our fundamental understanding of neutron-induced stochastic risks. To date, most TSMC studies on the relative biological effectiveness (RBE) of neutrons have focused on various types of DNA damage clusters defined using base pair distances. In this study, we extend these methodologies by incorporating the simulation of non-homologous end joining DNA repair in order to evaluate the RBE of neutrons for misrepairs. To achieve this, we adapted our previously published Monte Carlo DNA damage simulation pipeline, which combines condensed-history and TSMC methods, to support the standard DNA damage data format. This adaptation enabled seamless integration of neutron-induced DNA damage results with the DNA mechanistic repair simulator toolkit. Additionally, we developed a clustering algorithm that reproduces pre-repair endpoints studied in prior works, as well as novel damage clusters based on Euclidean distances. The neutron RBE for misrepairs obtained in this study exhibits a qualitatively similar shape as the RBE obtained for previously reported pre-repair endpoints. However, it peaks higher, reaching a maximum RBE value of 23(1) at a neutron energy of 0.5 MeV. Furthermore, we found that misrepair outcomes were better reproduced using the pre-repair endpoint defined with the Euclidean distance between double-strand breaks rather than with previously published pre-repair endpoints based on base-pair distances. The optimal maximal Euclidean distances were 18 nm for 0.5 MeV neutrons and 60 nm for 250 keV photons. Although this may indicate that Euclidean-distance-based clustering more accurately reflects the DNA damage configurations that lead to misrepairs, the fact that neutrons and photons require different distances raises doubts on whether a single, universal pre-repair endpoint can used as a stand-in for larger-scale aberrations across all radiation qualities.
Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics
Mark Rowland
Yunhao Tang
Murat A Erdogdu
Compositional Risk Minimization
Compositional generalization is a crucial step towards developing data-efficient intelligent machines that generalize in human-like ways. In… (see more) this work, we tackle a challenging form of distribution shift, termed compositional shift, where some attribute combinations are completely absent at training but present in the test distribution. This shift tests the model's ability to generalize compositionally to novel attribute combinations in discriminative tasks. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.
CurvGAD: Leveraging Curvature for Enhanced Graph Anomaly Detection
Karish Grover
Geoff Gordon
Christos Faloutsos
Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking
Elvis Dopgima Dohmatob
Grokking is the phenomenon whereby, unlike the training performance, which peaks early in the training process, the test/generalization perf… (see more)ormance of a model stagnates over arbitrarily many epochs and then suddenly jumps to usually close to perfect levels. In practice, it is desirable to reduce the length of such plateaus, that is to make the learning process"grok"faster. In this work, we provide new insights into grokking. First, we show both empirically and theoretically that grokking can be induced by asymmetric speeds of (stochastic) gradient descent, along different principal (i.e singular directions) of the gradients. We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed. Then, we establish that this modified method, which we call egalitarian gradient descent (EGD) and can be seen as a carefully modified form of natural gradient descent, groks much faster. In fact, in some cases the stagnation is completely removed. Finally, we empirically show that on classical arithmetic problems such as modular addition and sparse parity problem which this stagnation has been widely observed and intensively studied, that our proposed method eliminates the plateaus.
Feynman-Kac Correctors in Diffusion: Annealing, Guidance, and Product of Experts
Viktor Ohanesian
Roberto Bondesan
Alán Aspuru-Guzik
Arnaud Doucet
Rob Brekelmans
Kirill Neklyudov
While score-based generative models are the model of choice across diverse domains, there are limited tools available for controlling infere… (see more)nce-time behavior in a principled manner, e.g. for composing multiple pretrained models. Existing classifier-free guidance methods use a simple heuristic to mix conditional and unconditional scores to approximately sample from conditional distributions. However, such methods do not approximate the intermediate distributions, necessitating additional `corrector' steps. In this work, we provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We derive a weighted simulation scheme which we call Feynman-Kac Correctors (FKCs) based on the celebrated Feynman-Kac formula by carefully accounting for terms in the appropriate partial differential equations (PDEs). To simulate these PDEs, we propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality. We empirically demonstrate the utility of our methods by proposing amortized sampling via inference-time temperature annealing, improving multi-objective molecule generation using pretrained models, and improving classifier-free guidance for text-to-image generation. Our code is available at https://github.com/martaskrt/fkc-diffusion.
From Language Models over Tokens to Language Models over Characters
Tim Vieira
Mario Giulianelli
Juan Luis Gastaldi
Brian DuSell
John Terilla
Timothy J. O'Donnell
Ryan Cotterell
Modern language models are internally -- and mathematically -- distributions over …
Generalization Bounds via Meta-Learned Model Representations: PAC-Bayes and Sample Compression Hypernetworks
Nathaniel D'Amours
Pascal Germain
Both PAC-Bayesian and Sample Compress learning frameworks have been shown instrumental for deriving tight (non-vacuous) generalization bound… (see more)s for neural networks. We leverage these results in a meta-learning scheme, relying on a hypernetwork that outputs the parameters of a downstream predictor from a dataset input. The originality of our approach lies in the investigated hypernetwork architectures that encode the dataset before decoding the parameters: (1) a PAC-Bayesian encoder that expresses a posterior distribution over a latent space, (2) a Sample Compress encoder that selects a small sample of the dataset input along with a message from a discrete set, and (3) a hybrid between both approaches motivated by a new Sample Compress theorem handling continuous messages. The latter theorem exploits the pivotal information transiting at the encoder-decoder junction in order to compute generalization guarantees for each downstream predictor obtained by our meta-learning scheme.