Publications

Scaling Laws Do Not Scale

Michael Madaio

Recent work has proposed a power law relationship, referred to as ``scaling laws,'' between the performance of artificial intelligence (AI) … (see more)models and aspects of those models' design (e.g., dataset size). In other words, as the size of a dataset (or model parameters, etc) increases, the performance of a given model trained on that dataset will correspondingly increase. However, while compelling in the aggregate, this scaling law relationship overlooks the ways that metrics used to measure performance may be precarious and contested, or may not correspond with how different groups of people may perceive the quality of models' output. In this paper, we argue that as the size of datasets used to train large AI models grows, the number of distinct communities (including demographic groups) whose data is included in a given dataset is likely to grow, each of whom may have different values. As a result, there is an increased risk that communities represented in a dataset may have values or preferences not captured by (or in the worst case, at odds with) the metrics used to evaluate model performance for scaling laws. We end the paper with implications for AI scaling laws -- that models may not, in fact, continue to improve as the datasets get larger -- at least not for all people or communities impacted by those models.

2024-01-01

AIES (1) (published)

doi.org

arxiv.org

SCIsegV2: A Universal Tool for Segmentation of Intramedullary Lesions in Spinal Cord Injury

Enamundram Naga Karthik

Jan Valošek

Lynn Farner

Dario Pfyffer

Simon Schading-Sassenhausen

Anna Lebret

Gergely David

Andrew C. Smith

Kenneth A. Weber

Maryam Seif

Rhscir Network Imaging Group

Patrick Freund

Julien Cohen-Adad

2024-01-01

AMAI@MICCAI (published)

doi.org

arxiv.org

Scope Ambiguities in Large Language Models

Gaurav Kamath

Sebastian Schuster

Sowmya Vajjala

Siva Reddy

2024-01-01

Trans. Assoc. Comput. Linguistics (published)

doi.org

arxiv.org

Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Generation.

Guillaume Huguet

James Vuckovic

Kilian FATRAS

Eric Laufer

Pablo Lemos

Riashat Islam

Cheng-Hao Liu

Jarrid Rector-Brooks

Tara Akhound-Sadegh

Michael M. Bronstein

Alexander Tong

Joey Bose

2024-01-01

Neural Information Processing Systems (published)

dblp.uni-trier.de

Sharpness-Aware Minimization Scaled by Outlier Normalization for Robust DNNs on In-Memory Computing Accelerators

Sébastien Henwood

Goncalo Mordido

Yvon Savaria

Sarath Chandar

François Leduc-Primeau

Many deep neural network (DNN) models consume a significant amount of energy at inference time, in large part due to energy consumed by memo… (see more)ry access. In-memory computing addresses this problem by eliminating many memory accesses, but exposes model weights to noise and circuit variations. While several methods have been proposed to train DNNs robust to weight noise they typically require knowledge of the noise distribution, or degrade the DNN performance in noiseless setting. In this work, we first show that applying sharpness-aware training, by optimizing for both the loss value and loss sharpness, significantly improves robustness to noisy weights at inference time. Then, we propose a new adaptive sharpness-aware method that conditions the worst-case perturbation of a given weight not only on its magnitude but also on the range of the weight distribution. This is achieved by performing sharpness-aware minimization scaled by outlier normalization (SAMSON). Results on computer-vision benchmarks show that SAMSON increases model robustness to noisy weights without compromising generalization performance in noiseless regimes.

2024-01-01

IEEECONF (published)

doi.org

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

David Ifeoluwa Adelani

Hannah Liu

Xiaoyu Shen

Nikita Vassilyev

Jesujoba Oluwadara Alabi

Yanke Mao

Haonan Gao

Annie En-Shiun Lee

2024-01-01

EACL (1) (published)

doi.org

arxiv.org

Simulation-Free Schrödinger Bridges via Score and Flow Matching

Alexander Tong

Nikolay Malkin

Kilian FATRAS

Lazar Atanackovic

Yanlei Zhang

Guillaume Huguet

Guy Wolf

Yoshua Bengio

We present simulation-free score and flow matching ([SF]…

2024-01-01

AISTATS (published)

doi.org

openreview.net

Simultaneous linear connectivity of neural networks modulo permutation

Ekansh Sharma

Devin Kwok

Tom Denton

Daniel M. Roy

David Rolnick

Gintare Karolina Dziugaite

2024-01-01

ECML/PKDD (7) (published)

doi.org

arxiv.org

softmax is not enough (for sharp out-of-distribution)

Petar Veličković

Christos Perivolaropoulos

Federico Barbero

Razvan Pascanu

A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier o… (see more)f sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.

2024-01-01

arXiv.org (preprint)

doi.org

arxiv.org

SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

Ankit Vani

Bac Nguyen

Samuel Lavoie

Ranjay Krishna

Aaron Courville

Selective attention helps us focus on task-relevant aspects in the constant flood of our sensory input. This constraint in our perception al… (see more)lows us to robustly generalize under distractions and to new compositions of perceivable concepts. Transformers employ a similar notion of attention in their architecture, but representation learning models with transformer backbones like CLIP and DINO often fail to demonstrate robustness and compositionality. We highlight a missing architectural prior: unlike human perception, transformer encodings do not separately attend over individual concepts. In response, we propose SPARO, a read-out mechanism that partitions encodings into separately-attended slots, each produced by a single attention head. Using SPARO with CLIP imparts an inductive bias that the vision and text modalities are different views of a shared compositional world with the same corresponding concepts. Using SPARO, we demonstrate improvements on downstream recognition, robustness, retrieval, and compositionality benchmarks with CLIP (up to +14% for ImageNet, +4% for SugarCrepe), and on nearest neighbors and linear probe for ImageNet with DINO (+3% each). We also showcase a powerful ability to intervene and select individual SPARO concepts to further improve downstream task performance (up from +4% to +9% for SugarCrepe) and use this ability to study the robustness of SPARO's representation structure. Finally, we provide insights through ablation experiments and visualization of learned concepts.

2024-01-01

ECCV (66) (published)

doi.org

arxiv.org

Stochastic Frank-Wolfe: Unified Analysis and Zoo of Special Cases

Ruslan Nazykov

Aleksandr Shestakov

Vladimir Solodkin

Aleksandr Beznosikov

Gauthier Gidel

Alexander Gasnikov

The Conditional Gradient (or Frank-Wolfe) method is one of the most well-known methods for solving constrained optimization problems appeari… (see more)ng in various machine learning tasks. The simplicity of iteration and applicability to many practical problems helped the method to gain popularity in the community. In recent years, the Frank-Wolfe algorithm received many different extensions, including stochastic modifications with variance reduction and coordinate sampling for training of huge models or distributed variants for big data problems. In this paper, we present a unified convergence analysis of the Stochastic Frank-Wolfe method that covers a large number of particular practical cases that may have completely different nature of stochasticity, intuitions and application areas. Our analysis is based on a key parametric assumption on the variance of the stochastic gradients. But unlike most works on unified analysis of other methods, such as SGD, we do not assume an unbiasedness of the real gradient estimation. We conduct analysis for convex and non-convex problems due to the popularity of both cases in machine learning. With this general theoretical framework, we not only cover rates of many known methods, but also develop numerous new methods. This shows the flexibility of our approach in developing new algorithms based on the Conditional Gradient approach. We also demonstrate the properties of the new methods through numerical experiments.

2024-01-01

International Conference on Artificial Intelligence and Statistics (published)

proceedings.mlr.press

arxiv.org

Stochastic Simulated Quantum Annealing for Fast Solution of Combinatorial Optimization Problems

Naoya Onizawa

Ryoma Sasaki

Duckgyu Shin

Warren Gross

Takahiro Hanyu

In this paper, we introduce stochastic simulated quantum annealing (SSQA) for large-scale combinatorial optimization problems. SSQA is desig… (see more)ned based on stochastic computing and quantum Monte Carlo, which can simulate quantum annealing (QA) by using multiple replicas of spins (probabilistic bits) in classical computing. The use of stochastic computing leads to an efficient parallel spin-state update algorithm, enabling quick search for a solution around the global minimum energy. Therefore, SSQA realizes quantum-like annealing for large-scale problems and can handle fully connected models in combinatorial optimization, unlike QA. The proposed method is evaluated in MATLAB on graph isomorphism problems, which are typical combinatorial optimization problems. The proposed method achieves a convergence speed an order of magnitude faster than a conventional stochastic simulaated annealing method. Additionally, it can handle a 100-times larger problem size compared to QA and a 25-times larger problem size compared to a traditional SA method, respectively, for similar convergence probabilities.

2024-01-01

IEEE Access (published)

doi.org

arxiv.org

AI Advantage

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

AI Advantage

Leveraging AI for a Sustainable Future

Publications

AI Advantage

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

AI Advantage

Leveraging AI for a Sustainable Future

Popular keywords:

Publications