Learn how to leverage generative AI to support and improve your productivity at work. The next cohort will take place online on April 28 and 30, 2026, in French.
We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Publications
Mitigating Disparate Impact of Differential Privacy in Federated Learning through Robust Clustering
Federated Learning (FL) is a decentralized machine learning (ML) approach that keeps data localized and often incorporates Differential Priv… (see more)acy (DP) to enhance privacy guarantees. Similar to previous work on DP in ML, we observed that differentially private federated learning (DPFL) introduces performance disparities, particularly affecting minority groups. Recent work has attempted to address performance fairness in vanilla FL through clustering, but this method remains sensitive and prone to errors, which are further exacerbated by the DP noise in DPFL. To fill this gap, in this paper, we propose a novel clustered DPFL algorithm designed to effectively identify clients' clusters in highly heterogeneous settings while maintaining high accuracy with DP guarantees. To this end, we propose to cluster clients based on both their model updates and training loss values. Our proposed approach also addresses the server's uncertainties in clustering clients' model updates by employing larger batch sizes along with Gaussian Mixture Model (GMM) to alleviate the impact of noise and potential clustering errors, especially in privacy-sensitive scenarios. We provide theoretical analysis of the effectiveness of our proposed approach. We also extensively evaluate our approach across diverse data distributions and privacy budgets and show its effectiveness in mitigating the disparate impact of DP in FL settings with a small computational cost.
In silico neutron relative biological effectiveness estimations for Pre-DNA repair and post-DNA repair endpoints
Nicolas Desjardins
J. Kildea
A comprehensive understanding of the energy-dependent stochastic risks associated with neutron exposure is crucial to develop robust radiopr… (see more)otection systems. However, the scarcity of experimental data presents significant challenges in this domain. Track-structure Monte Carlo (TSMC) simulations with DNA models have demonstrated their potential to further our fundamental understanding of neutron-induced stochastic risks. To date, most TSMC studies on the relative biological effectiveness (RBE) of neutrons have focused on various types of DNA damage clusters defined using base pair distances. In this study, we extend these methodologies by incorporating the simulation of non-homologous end joining DNA repair in order to evaluate the RBE of neutrons for misrepairs. To achieve this, we adapted our previously published Monte Carlo DNA damage simulation pipeline, which combines condensed-history and TSMC methods, to support the standard DNA damage data format. This adaptation enabled seamless integration of neutron-induced DNA damage results with the DNA mechanistic repair simulator toolkit. Additionally, we developed a clustering algorithm that reproduces pre-repair endpoints studied in prior works, as well as novel damage clusters based on Euclidean distances. The neutron RBE for misrepairs obtained in this study exhibits a qualitatively similar shape as the RBE obtained for previously reported pre-repair endpoints. However, it peaks higher, reaching a maximum RBE value of 23(1) at a neutron energy of 0.5 MeV. Furthermore, we found that misrepair outcomes were better reproduced using the pre-repair endpoint defined with the Euclidean distance between double-strand breaks rather than with previously published pre-repair endpoints based on base-pair distances. The optimal maximal Euclidean distances were 18 nm for 0.5 MeV neutrons and 60 nm for 250 keV photons. Although this may indicate that Euclidean-distance-based clustering more accurately reflects the DNA damage configurations that lead to misrepairs, the fact that neutrons and photons require different distances raises doubts on whether a single, universal pre-repair endpoint can used as a stand-in for larger-scale aberrations across all radiation qualities.
Millions of abandoned oil and gas wells are scattered across the world, leaching methane into the atmosphere and toxic compounds into the gr… (see more)oundwater. Many of these locations are unknown, preventing the wells from being plugged and their polluting effects averted. Remote sensing is a relatively unexplored tool for pinpointing abandoned wells at scale. We introduce the first large-scale benchmark dataset for this problem, leveraging medium-resolution multi-spectral satellite imagery from Planet Labs. Our curated dataset comprises over 213,000 wells (abandoned, suspended, and active) from Alberta, a region with especially high well density, sourced from the Alberta Energy Regulator and verified by domain experts. We evaluate baseline algorithms for well detection and segmentation, showing the promise of computer vision approaches but also significant room for improvement.
2025-10-05
Proceedings of the 42nd International Conference on Machine Learning (published)
Protein dynamics play a crucial role in protein biological functions and properties, and their traditional study typically relies on time-co… (see more)nsuming molecular dynamics (MD) simulations conducted in silico. Recent advances in generative modeling, particularly denoising diffusion models, have enabled efficient accurate protein structure prediction and conformation sampling by learning distributions over crystallographic structures. However, effectively integrating physical supervision into these data-driven approaches remains challenging, as standard energy-based objectives often lead to intractable optimization. In this paper, we introduce Energy-based Alignment (EBA), a method that aligns generative models with feedback from physical models, efficiently calibrating them to appropriately balance conformational states based on their energy differences. Experimental results on the MD ensemble benchmark demonstrate that EBA achieves state-of-the-art performance in generating high-quality protein ensembles. By improving the physical plausibility of generated structures, our approach enhances model predictions and holds promise for applications in structural biology and drug discovery.
2025-10-05
Proceedings of the 42nd International Conference on Machine Learning (published)
Compositional generalization is a crucial step towards developing data-efficient intelligent machines that generalize in human-like ways. In… (see more) this work, we tackle a challenging form of distribution shift, termed compositional shift, where some attribute combinations are completely absent at training but present in the test distribution. This shift tests the model's ability to generalize compositionally to novel attribute combinations in discriminative tasks. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.
2025-10-05
Proceedings of the 42nd International Conference on Machine Learning (published)
While score-based generative models are the model of choice across diverse domains, there are limited tools available for controlling infere… (see more)nce-time behavior in a principled manner, e.g. for composing multiple pretrained models. Existing classifier-free guidance methods use a simple heuristic to mix conditional and unconditional scores to approximately sample from conditional distributions. However, such methods do not approximate the intermediate distributions, necessitating additional `corrector' steps. In this work, we provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We derive a weighted simulation scheme which we call Feynman-Kac Correctors (FKCs) based on the celebrated Feynman-Kac formula by carefully accounting for terms in the appropriate partial differential equations (PDEs). To simulate these PDEs, we propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality. We empirically demonstrate the utility of our methods by proposing amortized sampling via inference-time temperature annealing, improving multi-objective molecule generation using pretrained models, and improving classifier-free guidance for text-to-image generation. Our code is available at https://github.com/martaskrt/fkc-diffusion.
2025-10-05
Proceedings of the 42nd International Conference on Machine Learning (published)
Both PAC-Bayesian and Sample Compress learning frameworks have been shown instrumental for deriving tight (non-vacuous) generalization bound… (see more)s for neural networks. We leverage these results in a meta-learning scheme, relying on a hypernetwork that outputs the parameters of a downstream predictor from a dataset input. The originality of our approach lies in the investigated hypernetwork architectures that encode the dataset before decoding the parameters: (1) a PAC-Bayesian encoder that expresses a posterior distribution over a latent space, (2) a Sample Compress encoder that selects a small sample of the dataset input along with a message from a discrete set, and (3) a hybrid between both approaches motivated by a new Sample Compress theorem handling continuous messages. The latter theorem exploits the pivotal information transiting at the encoder-decoder junction in order to compute generalization guarantees for each downstream predictor obtained by our meta-learning scheme.
2025-10-05
Proceedings of the 42nd International Conference on Machine Learning (published)
We propose a computationally efficient alternative to generalized random forests (GRFs) for estimating heterogeneous effects in large dimens… (see more)ions. While GRFs rely on a gradient-based splitting criterion, which in large dimensions is computationally expensive and unstable, our method introduces a fixed-point approximation that eliminates the need for Jacobian estimation. This gradient-free approach preserves GRF's theoretical guarantees of consistency and asymptotic normality while significantly improving computational efficiency. We demonstrate that our method achieves a speedup of multiple times over standard GRFs without compromising statistical accuracy. Experiments on both simulated and real-world data validate our approach. Our findings suggest that the proposed method is a scalable alternative for localized effect estimation in machine learning and causal inference applications
2025-10-05
Proceedings of the 42nd International Conference on Machine Learning (published)