Resilience and Mental-Health Symptoms in ICU Healthcare Professionals Facing Repeated COVID-19 Waves
Elie Azoulay
Frédéric Pochard
Laurent Argaud
Alain Cariou
Raphael Clere-Jehl
Olivier Guisset
Vincent Labbé
Fabienne Tamion
Fabrice Bruneel
Mercé Jourdain
Danielle Reuter
Kada Klouche
Achille Kouatchet
Virginie Souppart
Alexandre Lautrette
Julien Bohé
Antoine Vieillard Baron
Jean Dellamonica
Laurent Papazian
Jean Reignier … (see 3 more)
François Barbier
Nancy Kentish-Barnes
SAFT: Towards Out-of-Distribution Generalization in Fine-Tuning
Bac Nguyen
Stefan Uhlich
Fabien Cardinaux
Lukas Mauch
Marzieh Edraki
Handling distribution shifts from training data, known as out-of-distribution (OOD) generalization, poses a significant challenge in the fie… (see more)ld of machine learning. While a pre-trained vision-language model like CLIP has demonstrated remarkable zero-shot performance, further adaptation of the model to downstream tasks leads to undesirable degradation for OOD data. In this work, we introduce Sparse Adaptation for Fine-Tuning (SAFT), a method that prevents fine-tuning from forgetting the general knowledge in the pre-trained model. SAFT only updates a small subset of important parameters whose gradient magnitude is large, while keeping the other parameters frozen. SAFT is straightforward to implement and conceptually simple. Extensive experiments show that with only 0.1% of the model parameters, SAFT can significantly improve the performance of CLIP. It consistently outperforms baseline methods across several benchmarks. On the few-shot learning benchmark of ImageNet and its variants, SAFT gives a gain of 5.15% on average over the conventional fine-tuning method in OOD settings.
Scaling Laws Do Not Scale
Michael Madaio
Recent work has proposed a power law relationship, referred to as ``scaling laws,'' between the performance of artificial intelligence (AI) … (see more)models and aspects of those models' design (e.g., dataset size). In other words, as the size of a dataset (or model parameters, etc) increases, the performance of a given model trained on that dataset will correspondingly increase. However, while compelling in the aggregate, this scaling law relationship overlooks the ways that metrics used to measure performance may be precarious and contested, or may not correspond with how different groups of people may perceive the quality of models' output. In this paper, we argue that as the size of datasets used to train large AI models grows, the number of distinct communities (including demographic groups) whose data is included in a given dataset is likely to grow, each of whom may have different values. As a result, there is an increased risk that communities represented in a dataset may have values or preferences not captured by (or in the worst case, at odds with) the metrics used to evaluate model performance for scaling laws. We end the paper with implications for AI scaling laws -- that models may not, in fact, continue to improve as the datasets get larger -- at least not for all people or communities impacted by those models.
SCIsegV2: A Universal Tool for Segmentation of Intramedullary Lesions in Spinal Cord Injury
Enamundram Naga Karthik
Jan Valošek
Lynn Farner
Dario Pfyffer
Simon Schading-Sassenhausen
Anna Lebret
Gergely David
Andrew C. Smith
Kenneth A. Weber
Maryam Seif
Rhscir Network Imaging Group
Patrick Freund
Scope Ambiguities in Large Language Models
Gaurav Kamath
Sebastian Schuster
Sowmya Vajjala
Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Generation.
Guillaume Huguet
James Vuckovic
Kilian FATRAS
Eric Laufer
Pablo Lemos
Riashat Islam
Cheng-Hao Liu
Jarrid Rector-Brooks
Tara Akhound-Sadegh
Michael M. Bronstein
Alexander Tong
Sharpness-Aware Minimization Scaled by Outlier Normalization for Robust DNNs on In-Memory Computing Accelerators
Sébastien Henwood
Goncalo Mordido
Yvon Savaria
François Leduc-Primeau
Many deep neural network (DNN) models consume a significant amount of energy at inference time, in large part due to energy consumed by memo… (see more)ry access. In-memory computing addresses this problem by eliminating many memory accesses, but exposes model weights to noise and circuit variations. While several methods have been proposed to train DNNs robust to weight noise they typically require knowledge of the noise distribution, or degrade the DNN performance in noiseless setting. In this work, we first show that applying sharpness-aware training, by optimizing for both the loss value and loss sharpness, significantly improves robustness to noisy weights at inference time. Then, we propose a new adaptive sharpness-aware method that conditions the worst-case perturbation of a given weight not only on its magnitude but also on the range of the weight distribution. This is achieved by performing sharpness-aware minimization scaled by outlier normalization (SAMSON). Results on computer-vision benchmarks show that SAMSON increases model robustness to noisy weights without compromising generalization performance in noiseless regimes.
SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
Hannah Liu
Xiaoyu Shen
Nikita Vassilyev
Jesujoba Oluwadara Alabi
Yanke Mao
Haonan Gao
Annie En-Shiun Lee
Simulation-Free Schrödinger Bridges via Score and Flow Matching
Alexander Tong
Nikolay Malkin
Kilian FATRAS
Lazar Atanackovic
Yanlei Zhang
Guillaume Huguet
We present simulation-free score and flow matching ([SF]…
Simultaneous linear connectivity of neural networks modulo permutation
Ekansh Sharma
Devin Kwok
Tom Denton
Daniel M. Roy
softmax is not enough (for sharp out-of-distribution)
Petar Veličković
Christos Perivolaropoulos
Federico Barbero
A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier o… (see more)f sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.
SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision
Ankit Vani
Bac Nguyen
Samuel Lavoie
Ranjay Krishna
Selective attention helps us focus on task-relevant aspects in the constant flood of our sensory input. This constraint in our perception al… (see more)lows us to robustly generalize under distractions and to new compositions of perceivable concepts. Transformers employ a similar notion of attention in their architecture, but representation learning models with transformer backbones like CLIP and DINO often fail to demonstrate robustness and compositionality. We highlight a missing architectural prior: unlike human perception, transformer encodings do not separately attend over individual concepts. In response, we propose SPARO, a read-out mechanism that partitions encodings into separately-attended slots, each produced by a single attention head. Using SPARO with CLIP imparts an inductive bias that the vision and text modalities are different views of a shared compositional world with the same corresponding concepts. Using SPARO, we demonstrate improvements on downstream recognition, robustness, retrieval, and compositionality benchmarks with CLIP (up to +14% for ImageNet, +4% for SugarCrepe), and on nearest neighbors and linear probe for ImageNet with DINO (+3% each). We also showcase a powerful ability to intervene and select individual SPARO concepts to further improve downstream task performance (up from +4% to +9% for SugarCrepe) and use this ability to study the robustness of SPARO's representation structure. Finally, we provide insights through ablation experiments and visualization of learned concepts.