Publications

Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML
Prakhar Ganeesh
Usman Gohar
Lu Cheng
With fairness concerns gaining significant attention in Machine Learning (ML), several bias mitigation techniques have been proposed, often … (see more)compared against each other to find the best method. These benchmarking efforts tend to use a common setup for evaluation under the assumption that providing a uniform environment ensures a fair comparison. However, bias mitigation techniques are sensitive to hyperparameter choices, random seeds, feature selection, etc., meaning that comparison on just one setting can unfairly favour certain algorithms. In this work, we show significant variance in fairness achieved by several algorithms and the influence of the learning pipeline on fairness scores. We highlight that most bias mitigation techniques can achieve comparable performance, given the freedom to perform hyperparameter optimization, suggesting that the choice of the evaluation parameters-rather than the mitigation technique itself-can sometimes create the perceived superiority of one method over another. We hope our work encourages future research on how various choices in the lifecycle of developing an algorithm impact fairness, and trends that guide the selection of appropriate algorithms.
Evaluating Generative AI Systems is a Social Science Measurement Challenge
Hanna Wallach
Meera Desai
Nicholas Pangakis
A. F. Cooper
Angelina Wang
Solon Barocas
Alexandra Chouldechova
Chad Atalla
Su Lin Blodgett
Emily Corvi
P. A. Dow
Jean Garcia-Gathright
Stefanie Reed
Emily Sheng
Dan Vann
Jennifer Wortman Vaughan
Matthew Vogel
Hannah Washington
Abigail Z. Jacobs … (see 1 more)
Microsoft Research
Across academia, industry, and government, there is an increasing awareness that the measurement tasks involved in evaluating generative AI … (see more)(GenAI) systems are especially difficult. We argue that these measurement tasks are highly reminiscent of measurement tasks found throughout the social sciences. With this in mind, we present a framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems. The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves. This four-level approach differs from the way measurement is typically done in ML, where researchers and practitioners appear to jump straight from background concepts to measurement instruments, with little to no explicit systematization in between. As well as surfacing assumptions, thereby making it easier to understand exactly what the resulting measurements do and do not mean, this framework has two important implications for evaluating evaluations: First, it can enable stakeholders from different worlds to participate in conceptual debates, broadening the expertise involved in evaluating GenAI systems. Second, it brings rigor to operational debates by offering a set of lenses for interrogating the validity of measurement instruments and their resulting measurements.
Evaluating Generative AI Systems is a Social Science Measurement Challenge
Hanna Wallach
Meera Desai
Nicholas Pangakis
A. F. Cooper
Angelina Wang
Solon Barocas
Alexandra Chouldechova
Chad Atalla
Su Lin Blodgett
Emily Corvi
P. A. Dow
Jean Garcia-Gathright
Stefanie Reed
Emily Sheng
Dan Vann
Jennifer Wortman Vaughan
Matthew Vogel
Hannah Washington
Abigail Z. Jacobs … (see 1 more)
Microsoft Research
Across academia, industry, and government, there is an increasing awareness that the measurement tasks involved in evaluating generative AI … (see more)(GenAI) systems are especially difficult. We argue that these measurement tasks are highly reminiscent of measurement tasks found throughout the social sciences. With this in mind, we present a framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems. The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves. This four-level approach differs from the way measurement is typically done in ML, where researchers and practitioners appear to jump straight from background concepts to measurement instruments, with little to no explicit systematization in between. As well as surfacing assumptions, thereby making it easier to understand exactly what the resulting measurements do and do not mean, this framework has two important implications for evaluating evaluations: First, it can enable stakeholders from different worlds to participate in conceptual debates, broadening the expertise involved in evaluating GenAI systems. Second, it brings rigor to operational debates by offering a set of lenses for interrogating the validity of measurement instruments and their resulting measurements.
Towards AI-designed genomes using a variational autoencoder
N.K. Dudek
Genomes encode elaborate networks of genes whose products must seamlessly interact to support living organisms. Humans’ capacity to unders… (see more)tand these biological systems is limited by their sheer size and complexity. In this work, we develop a proof of concept framework for training a machine learning algorithm to model bacterial genome composition. To achieve this, we create simplified representations of genomes in the form of binary vectors that indicate the encoded genes, henceforth referred to as genome vectors. A denoising variational autoencoder was trained to accept corrupted genome vectors, in which most genes had been masked, and reconstruct the original. The resulting model, DeepGenomeVector, effectively captures complex dependencies in genomic networks, as evaluated by both qualitative and quantitative metrics. An in-depth functional analysis of a generated genome vector shows that its encoded pathways are interconnected, near complete, and ecologically cohesive. On the test set, where the model’s ability to reconstruct uncorrupted genome vectors was evaluated, AUC and F1 scores of 0.98 and 0.83, respectively, support the model’s strong performance. This work showcases the power of machine learning approaches for synthetic biology and highlights the possibility that AI agents may one day be able to design genomes that animate carbon-based cells.
IntentGPT: Few-shot Intent Discovery with Large Language Models
Juan A. Rodriguez
Nicholas Botzer
David Vazquez
Marco Pedersoli
Issam Hadj Laradji
In today's digitally driven world, dialogue systems play a pivotal role in enhancing user interactions, from customer service to virtual ass… (see more)istants. In these dialogues, it is important to identify user's goals automatically to resolve their needs promptly. This has necessitated the integration of models that perform Intent Detection. However, users' intents are diverse and dynamic, making it challenging to maintain a fixed set of predefined intents. As a result, a more practical approach is to develop a model capable of identifying new intents as they emerge. We address the challenge of Intent Discovery, an area that has drawn significant attention in recent research efforts. Existing methods need to train on a substantial amount of data for correctly identifying new intents, demanding significant human effort. To overcome this, we introduce IntentGPT, a novel training-free method that effectively prompts Large Language Models (LLMs) such as GPT-4 to discover new intents with minimal labeled data. IntentGPT comprises an \textit{In-Context Prompt Generator}, which generates informative prompts for In-Context Learning, an \textit{Intent Predictor} for classifying and discovering user intents from utterances, and a \textit{Semantic Few-Shot Sampler} that selects relevant few-shot examples and a set of known intents to be injected into the prompt. Our experiments show that IntentGPT outperforms previous methods that require extensive domain-specific data and fine-tuning, in popular benchmarks, including CLINC and BANKING, among others.
IntentGPT: Few-shot Intent Discovery with Large Language Models
Juan A. Rodriguez
Nicholas Botzer
David Vazquez
Marco Pedersoli
Issam Hadj Laradji
In today's digitally driven world, dialogue systems play a pivotal role in enhancing user interactions, from customer service to virtual ass… (see more)istants. In these dialogues, it is important to identify user's goals automatically to resolve their needs promptly. This has necessitated the integration of models that perform Intent Detection. However, users' intents are diverse and dynamic, making it challenging to maintain a fixed set of predefined intents. As a result, a more practical approach is to develop a model capable of identifying new intents as they emerge. We address the challenge of Intent Discovery, an area that has drawn significant attention in recent research efforts. Existing methods need to train on a substantial amount of data for correctly identifying new intents, demanding significant human effort. To overcome this, we introduce IntentGPT, a novel training-free method that effectively prompts Large Language Models (LLMs) such as GPT-4 to discover new intents with minimal labeled data. IntentGPT comprises an \textit{In-Context Prompt Generator}, which generates informative prompts for In-Context Learning, an \textit{Intent Predictor} for classifying and discovering user intents from utterances, and a \textit{Semantic Few-Shot Sampler} that selects relevant few-shot examples and a set of known intents to be injected into the prompt. Our experiments show that IntentGPT outperforms previous methods that require extensive domain-specific data and fine-tuning, in popular benchmarks, including CLINC and BANKING, among others.
Towards a General Recipe for Combinatorial Optimization with Multi-Filter GNNs
Frederik Wenkel
Semih Cantürk
Stefan Horoi
Michael Perlmutter
UTG: Towards a Unified View of Snapshot and Event Based Models for Temporal Graphs
Shenyang Huang
Farimah Poursafaei
Emanuele Rossi
EDAI Framework for Integrating Equity, Diversity, and Inclusion Throughout the Lifecycle of AI to Improve Health and Oral Health Care: Qualitative Study
Richa Shrivastava
Anita Brown-Johnson
Pascale Caidor
Claire Davies
Amal Idrissi Janati
Pascaline Kengne Talla
Sreenath Madathil
Bettina M Willie
Elham Emami
Outcomes of guidelines from health technology assessment organizations in community-based primary care: a systematic mixed studies review
Ashkan Baradaran
Raymond Tolentino
Roland Grad
Isabelle Ganache
Genevieve Gore
Pierre Pluye
Abstract 4142894: Multimorbidity Trajectories Across the Lifespan in Patients with Congenital Heart Disease
Chao Li
Aihua Liu
Solomon Bendayan
Liming Guo
Judith Therrien
Robyn Tamblyn
Jay Brophy
Ariane Marelli
Background: Befitted from advances in medical care, patients with congenital heart disease (CHD) now survive to adulthood but face elevated… (see more) risks of both cardiac and non-cardiac complications. Understanding the trajectories of comorbidity development over a patient's lifespan is cornerstone to optimize care expected to improve long-term health outcomes. Research Aim: This study aims to investigate the temporal sequences and evolution of comorbidities in CHD patients across their lifespan. We hypothesize that multimorbidity trajectories in CHD patients are linked to CHD lesion severity and age at onset of specific comorbidities. Methods: Using the Quebec CHD database which comprised data in outpatient visits, hospitalization records and vital status from 1983 to 2017, we designed a longitudinal cohort study evaluating the development of 39 comorbidities coded using ICD-9/10. Temporal sequences were mapped using median age of onset. Associations between disease pairs were quantified by hazard ratios from Cox proportional hazard models adjusting for age, sex, genetic syndrome, competing risks of death, and taking into account the time-varying nature of the predictor diseases. Results: The cohort included 9,764 individuals with severe and 127,729 with non-severe CHD lesions. In severe CHD patients, most comorbidities developed between ages 25 and 40. Comorbidity progression began with childhood cardiovascular diseases, followed by systemic diseases such as diabetes, liver and kidney diseases, and advanced to heart failure and dementia in middle adulthood. In addition, mental disorders emerged in early adulthood and were associated with subsequent development of kidney diseases and dementia. Different trajectories were observed in non-severe CHD patients with 2-3 decades later disease onsets and non-differential onsets between cardiovascular and systemic complications (Figure). Conclusions: Distinct multimorbidity trajectories were observed in CHD patients by CHD lesion severity. In patients with severe CHD lesions, early systemic diseases significantly influenced subsequent complications. These findings highlight the need for well-timed surveillance guidelines and interventions to improve health outcomes.
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Khaoula Chehbouni
Jonathan Colacco-Carr
Yash More
Jackie Ck Cheung
In an effort to mitigate the harms of large language models (LLMs), learning from human feedback (LHF) has been used to steer LLMs towards o… (see more)utputs that are intended to be both less harmful and more helpful. Despite the widespread adoption of LHF in practice, the quality of this feedback and its effectiveness as a safety mitigation technique remain unclear. This study addresses these issues by auditing the widely-used Helpful and Harmless (HH) dataset by Anthropic. Our work includes: (1) a thorough investigation of the dataset's content through both manual and automated evaluation; (2) experiments demonstrating the dataset's impact on models' safety; and (3) an analysis of the 100 most influential papers citing this dataset. Through our audit, we showcase how conceptualization failures and quality issues identified in the HH dataset can create additional harms by leading to disparate safety behaviors across demographic groups. Our findings highlight the need for more nuanced, context-sensitive approaches to safety mitigation in LLMs.