Publications

Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density

Randall Balestriero

Michael G. Rabbat

Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine… (see more) two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample's representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs' anti-collapse term does much more--it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used--in any case one can compute the learned probabilities of sample

2025-10-06

ArXiv (preprint)

doi.org

arxiv.org

Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise

Luca Scimeca

Thomas Jiralerspong

Berton Earnshaw

Jason Hartford

Yoshua Bengio

2025-10-06

ArXiv (preprint)

doi.org

arxiv.org

Mitigating Disparate Impact of Differential Privacy in Federated Learning through Robust Clustering

Saber Malekmohammadi

Afaf Taïk

Golnoosh Farnadi

Federated Learning (FL) is a decentralized machine learning (ML) approach that keeps data localized and often incorporates Differential Priv… (see more)acy (DP) to enhance privacy guarantees. Similar to previous work on DP in ML, we observed that differentially private federated learning (DPFL) introduces performance disparities, particularly affecting minority groups. Recent work has attempted to address performance fairness in vanilla FL through clustering, but this method remains sensitive and prone to errors, which are further exacerbated by the DP noise in DPFL. To fill this gap, in this paper, we propose a novel clustered DPFL algorithm designed to effectively identify clients' clusters in highly heterogeneous settings while maintaining high accuracy with DP guarantees. To this end, we propose to cluster clients based on both their model updates and training loss values. Our proposed approach also addresses the server's uncertainties in clustering clients' model updates by employing larger batch sizes along with Gaussian Mixture Model (GMM) to alleviate the impact of noise and potential clustering errors, especially in privacy-sensitive scenarios. We provide theoretical analysis of the effectiveness of our proposed approach. We also extensively evaluate our approach across diverse data distributions and privacy budgets and show its effectiveness in mitigating the disparate impact of DP in FL settings with a small computational cost.

2025-10-06

TMLR (accepted)

doi.org

openreview.net

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

Pranav Atreya

Karl Pertsch

Tony Lee

Moo Jin Kim

Arhan Jain

Artur Kuramshin

Cyrus Neary

Edward S. Hu

Kanav Arora

Kirsty Ellis

Luca Macesanu

Matthew Leonard

Meedeum Cho

Özgür Aslan

Shivin Dass

Tony Wang

Xingfang Yuan

Abhishek Gupta

Dinesh Jayaraman

Glen Berseth … (see 6 more)

Kostas Daniilidis

Roberto Martín-Martín

Youngwoon Lee

Percy Liang

Chelsea Finn

Sergey Levine

2025-10-06

Proceedings of The 8th Conference on Robot Learning (published)

proceedings.mlr.press

In silico neutron relative biological effectiveness estimations for Pre-DNA repair and post-DNA repair endpoints

Nicolas Desjardins

J. Kildea

A comprehensive understanding of the energy-dependent stochastic risks associated with neutron exposure is crucial to develop robust radiopr… (see more)otection systems. However, the scarcity of experimental data presents significant challenges in this domain. Track-structure Monte Carlo (TSMC) simulations with DNA models have demonstrated their potential to further our fundamental understanding of neutron-induced stochastic risks. To date, most TSMC studies on the relative biological effectiveness (RBE) of neutrons have focused on various types of DNA damage clusters defined using base pair distances. In this study, we extend these methodologies by incorporating the simulation of non-homologous end joining DNA repair in order to evaluate the RBE of neutrons for misrepairs. To achieve this, we adapted our previously published Monte Carlo DNA damage simulation pipeline, which combines condensed-history and TSMC methods, to support the standard DNA damage data format. This adaptation enabled seamless integration of neutron-induced DNA damage results with the DNA mechanistic repair simulator toolkit. Additionally, we developed a clustering algorithm that reproduces pre-repair endpoints studied in prior works, as well as novel damage clusters based on Euclidean distances. The neutron RBE for misrepairs obtained in this study exhibits a qualitatively similar shape as the RBE obtained for previously reported pre-repair endpoints. However, it peaks higher, reaching a maximum RBE value of 23(1) at a neutron energy of 0.5 MeV. Furthermore, we found that misrepair outcomes were better reproduced using the pre-repair endpoint defined with the Euclidean distance between double-strand breaks rather than with previously published pre-repair endpoints based on base-pair distances. The optimal maximal Euclidean distances were 18 nm for 0.5 MeV neutrons and 60 nm for 250 keV photons. Although this may indicate that Euclidean-distance-based clustering more accurately reflects the DNA damage configurations that lead to misrepairs, the fact that neutrons and photons require different distances raises doubts on whether a single, universal pre-repair endpoint can used as a stand-in for larger-scale aberrations across all radiation qualities.

2025-10-06

bioRxiv (preprint)

doi.org

Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics

Tyler Kastner

Mark Rowland

Yunhao Tang

Murat A Erdogdu

Amir-massoud Farahmand

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Compositional Risk Minimization

Charles Arnal

Compositional generalization is a crucial step towards developing data-efficient intelligent machines that generalize in human-like ways. In… (see more) this work, we tackle a challenging form of distribution shift, termed compositional shift, where some attribute combinations are completely absent at training but present in the test distribution. This shift tests the model's ability to generalize compositionally to novel attribute combinations in discriminative tasks. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

CurvGAD: Leveraging Curvature for Enhanced Graph Anomaly Detection

Karish Grover

Geoff Gordon

Christos Faloutsos

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking

Ali Saheb Pasand

Elvis Dopgima Dohmatob

Grokking is the phenomenon whereby, unlike the training performance, which peaks early in the training process, the test/generalization perf… (see more)ormance of a model stagnates over arbitrarily many epochs and then suddenly jumps to usually close to perfect levels. In practice, it is desirable to reduce the length of such plateaus, that is to make the learning process"grok"faster. In this work, we provide new insights into grokking. First, we show both empirically and theoretically that grokking can be induced by asymmetric speeds of (stochastic) gradient descent, along different principal (i.e singular directions) of the gradients. We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed. Then, we establish that this modified method, which we call egalitarian gradient descent (EGD) and can be seen as a carefully modified form of natural gradient descent, groks much faster. In fact, in some cases the stagnation is completely removed. Finally, we empirically show that on classical arithmetic problems such as modular addition and sparse parity problem which this stagnation has been widely observed and intensively studied, that our proposed method eliminates the plateaus.

2025-10-05

ArXiv (preprint)

doi.org

arxiv.org

Feynman-Kac Correctors in Diffusion: Annealing, Guidance, and Product of Experts

Marta Skreta

Tara Akhound-Sadegh

Viktor Ohanesian

Roberto Bondesan

Alán Aspuru-Guzik

Arnaud Doucet

Rob Brekelmans

Alexander Tong

Kirill Neklyudov

While score-based generative models are the model of choice across diverse domains, there are limited tools available for controlling infere… (see more)nce-time behavior in a principled manner, e.g. for composing multiple pretrained models. Existing classifier-free guidance methods use a simple heuristic to mix conditional and unconditional scores to approximately sample from conditional distributions. However, such methods do not approximate the intermediate distributions, necessitating additional `corrector' steps. In this work, we provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We derive a weighted simulation scheme which we call Feynman-Kac Correctors (FKCs) based on the celebrated Feynman-Kac formula by carefully accounting for terms in the appropriate partial differential equations (PDEs). To simulate these PDEs, we propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality. We empirically demonstrate the utility of our methods by proposing amortized sampling via inference-time temperature annealing, improving multi-objective molecule generation using pretrained models, and improving classifier-free guidance for text-to-image generation. Our code is available at https://github.com/martaskrt/fkc-diffusion.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

From Language Models over Tokens to Language Models over Characters

Tim Vieira

Benjamin LeBrun

Mario Giulianelli

Juan Luis Gastaldi

Brian DuSell

John Terilla

Timothy J. O'Donnell

Ryan Cotterell

Modern language models are internally -- and mathematically -- distributions over …

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

Generalization Bounds via Meta-Learned Model Representations: PAC-Bayes and Sample Compression Hypernetworks

Benjamin Leblanc

Mathieu Bazinet

Nathaniel D'Amours

Alexandre Drouin

Pascal Germain

Both PAC-Bayesian and Sample Compress learning frameworks have been shown instrumental for deriving tight (non-vacuous) generalization bound… (see more)s for neural networks. We leverage these results in a meta-learning scheme, relying on a hypernetwork that outputs the parameters of a downstream predictor from a dataset input. The originality of our approach lies in the investigated hypernetwork architectures that encode the dataset before decoding the parameters: (1) a PAC-Bayesian encoder that expresses a posterior distribution over a latent space, (2) a Sample Compress encoder that selects a small sample of the dataset input along with a message from a discrete set, and (3) a hybrid between both approaches motivated by a new Sample Compress theorem handling continuous messages. The latter theorem exploits the pivotal information transiting at the encoder-decoder junction in order to compute generalization guarantees for each downstream predictor obtained by our meta-learning scheme.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Publications