Publications

Evaluating Generative AI Systems is a Social Science Measurement Challenge

Hanna Wallach

Meera Desai

Nicholas Pangakis

A. Feder Cooper

Angelina Wang

Solon Barocas

Alexandra Chouldechova

Chad Atalla

Su Lin Blodgett

Emily Corvi

P. A. Dow

Jean Garcia-Gathright

Alexandra Olteanu

Stefanie Reed

Emily Sheng

Dan Vann

Jennifer Wortman Vaughan

Matthew Vogel

Hannah Washington

Abigail Z. Jacobs … (see 1 more)

Microsoft Research

Across academia, industry, and government, there is an increasing awareness that the measurement tasks involved in evaluating generative AI … (see more)(GenAI) systems are especially difficult. We argue that these measurement tasks are highly reminiscent of measurement tasks found throughout the social sciences. With this in mind, we present a framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems. The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves. This four-level approach differs from the way measurement is typically done in ML, where researchers and practitioners appear to jump straight from background concepts to measurement instruments, with little to no explicit systematization in between. As well as surfacing assumptions, thereby making it easier to understand exactly what the resulting measurements do and do not mean, this framework has two important implications for evaluating evaluations: First, it can enable stakeholders from different worlds to participate in conceptual debates, broadening the expertise involved in evaluating GenAI systems. Second, it brings rigor to operational debates by offering a set of lenses for interrogating the validity of measurement instruments and their resulting measurements.

2024-11-01

arXiv (published)

doi.org

arxiv.org

GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models

Nizar Islah

Justine Gehring

Diganta Misra

Eilif Benjamin Muller

Irina Rish

Terry Yue Zhuo

Massimo Caccia

2024-11-01

arXiv (published)

doi.org

arxiv.org

Imagining and building wise machines: The centrality of AI metacognition

Samuel G. B. Johnson

Amir-Hossein Karimi

Yoshua Bengio

Nick Chater

Tobias Gerstenberg

Kate Larson

Sydney Levine

Melanie Mitchell

Iyad Rahwan

Bernhard Schölkopf

Igor Grossmann

2024-11-01

arXiv (published)

doi.org

arxiv.org

Impact of LLM-based Review Comment Generation in Practice: A Mixed Open-/Closed-source User Study

Doriane Olewicki

Leuson Da Silva

Suhaib Mujahid

Arezou Amini

Benjamin Mah

Marco Castelluccio

Sarra Habchi

Foutse Khomh

Bram Adams

2024-11-01

arXiv (published)

doi.org

arxiv.org

Predictive Modeling of Body Image Dissatisfaction in People With Type 1 Diabetes

COURTNEY SOUTH

SHAHRYAR EBRAHIMI

A. Brazeau

Maria Cutumisu

2024-11-01

Canadian Journal of Diabetes (published)

doi.org

Predictive Modeling of Body Image Dissatisfaction in People With Type 1 Diabetes

COURTNEY SOUTH

SHAHRYAR EBRAHIMI

ANNE-SOPHIE BRAZEAU

Maria Cutumisu

2024-11-01

Canadian Journal of Diabetes (published)

doi.org

A protocol for trustworthy EEG decoding with neural networks

Davide Borra

Elisa Magosso

Mirco Ravanelli

2024-11-01

Neural Networks (published)

doi.org

Soft Condorcet Optimization for Ranking of General Agents

Marc Lanctot

Kate Larson

Michael Kaisers

Quentin Berthet

Ian Gemp

Manfred Diaz

Roberto-Rafael Maura-Rivero

Yoram Bachrach

Anna Koop

Doina Precup

A common way to drive progress of AI models and agents is to compare their performance on standardized benchmarks. Comparing the performance… (see more) of general agents requires aggregating their individual performances across a potentially wide variety of different tasks. In this paper, we describe a novel ranking scheme inspired by social choice frameworks, called Soft Condorcet Optimization (SCO), to compute the optimal ranking of agents: the one that makes the fewest mistakes in predicting the agent comparisons in the evaluation data. This optimal ranking is the maximum likelihood estimate when evaluation data (which we view as votes) are interpreted as noisy samples from a ground truth ranking, a solution to Condorcet's original voting system criteria. SCO ratings are maximal for Condorcet winners when they exist, which we show is not necessarily true for the classical rating system Elo. We propose three optimization algorithms to compute SCO ratings and evaluate their empirical performance. When serving as an approximation to the Kemeny-Young voting method, SCO rankings are on average 0 to 0.043 away from the optimal ranking in normalized Kendall-tau distance across 865 preference profiles from the PrefLib open ranking archive. In a simulated noisy tournament setting, SCO achieves accurate approximations to the ground truth ranking and the best among several baselines when 59\% or more of the preference data is missing. Finally, SCO ranking provides the best approximation to the optimal ranking, measured on held-out test sets, in a problem containing 52,958 human players across 31,049 games of the classic seven-player game of Diplomacy.

2024-11-01

arXiv (published)

doi.org

arxiv.org

SpeechBrain-MOABB: An open-source Python library for benchmarking deep neural networks applied to EEG signals

Davide Borra

Francesco Paissan

Mirco Ravanelli

2024-11-01

Computers in Biology and Medicine (published)

doi.org

Tracing Optimization for Performance Modeling and Regression Detection

Kaveh Shahedi

Heng Li

Maxime Lamothe

Foutse Khomh

Software performance modeling plays a crucial role in developing and maintaining software systems. A performance model analytically describe… (see more)s the relationship between the performance of a system and its runtime activities. This process typically examines various aspects of a system's runtime behavior, such as the execution frequency of functions or methods, to forecast performance metrics like program execution time. By using performance models, developers can predict expected performance and thereby effectively identify and address unexpected performance regressions when actual performance deviates from the model's predictions. One common and precise method for capturing performance behavior is software tracing, which involves instrumenting the execution of a program, either at the kernel level (e.g., system calls) or application level (e.g., function calls). However, due to the nature of tracing, it can be highly resource-intensive, making it impractical for production environments where resources are limited. In this work, we propose statistical approaches to reduce tracing overhead by identifying and excluding performance-insensitive code regions, particularly application-level functions, from tracing while still building accurate performance models that can capture performance degradations. By selecting an optimal set of functions to be traced, we can construct optimized performance models that achieve an R-2 score of up to 99% and, sometimes, outperform full tracing models (models using non-optimized tracing data), while significantly reducing the tracing overhead by more than 80% in most cases. Our optimized performance models can also capture performance regressions in our studied programs effectively, demonstrating their usefulness in real-world scenarios. Our approach is fully automated, making it ready to be used in production environments with minimal human effort.

2024-11-01

arXiv (published)

doi.org

arxiv.org

Unsupervised Object Discovery: A Comprehensive Survey and Unified Taxonomy

Jos'e-Fabian Villa-V'asquez

Marco Pedersoli

Unsupervised object discovery is commonly interpreted as the task of localizing and/or categorizing objects in visual data without the need … (see more)for labeled examples. While current object recognition methods have proven highly effective for practical applications, the ongoing demand for annotated data in real-world scenarios drives research into unsupervised approaches. Furthermore, existing literature in object discovery is both extensive and diverse, posing a significant challenge for researchers that aim to navigate and synthesize this knowledge. Motivated by the evidenced interest in this avenue of research, and the lack of comprehensive studies that could facilitate a holistic understanding of unsupervised object discovery, this survey conducts an in-depth exploration of the existing approaches and systematically categorizes this compendium based on the tasks addressed and the families of techniques employed. Additionally, we present an overview of common datasets and metrics, highlighting the challenges of comparing methods due to varying evaluation protocols. This work intends to provide practitioners with an insightful perspective on the domain, with the hope of inspiring new ideas and fostering a deeper understanding of object discovery approaches.

2024-11-01

arXiv (published)

doi.org

arxiv.org

Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation

Shambhavi Mishra

Julio Silva-Rodríguez

Ismail Ben Ayed

Marco Pedersoli

Jose Dolz

Vision-language foundation models, such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, … (see more)these models may be unreliable under distributional shifts, as their performance is significantly degraded. In this work, we explore how to efficiently leverage class text information to mitigate these distribution drifts encountered by large pre-trained vision-language models (VLMs) during test-time inference. In particular, we propose to generate pseudo-labels for the test-time samples by exploiting generic class text embeddings as fixed centroids of a label assignment problem, which is efficiently solved with Optimal Transport. Furthermore, the proposed adaptation method (CLIP-OT) integrates a multiple template knowledge distillation approach, which replicates multi-view contrastive learning strategies in unsupervised representation learning but without incurring additional computational complexity. Extensive experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of CLIP-OT, achieving performance gains of up to 7% over recent state-of-the-art methods, yet being computationally and memory efficient.

2024-11-01

arXiv (published)

doi.org

arxiv.org

AI Advantage

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

AI Advantage

Leveraging AI for a Sustainable Future

Publications

AI Advantage

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

AI Advantage

Leveraging AI for a Sustainable Future

Popular keywords:

Publications