Portrait of Alexandra Olteanu

Alexandra Olteanu

Associate Industry Member
Principal Researcher and founding members of the FATE Montréal Team, Microsoft Research, Montréal
Research Topics
Information Retrieval
Natural Language Processing

Publications

Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems
Emma Harvey
Emily Sheng
Su Lin Blodgett
Alexandra Chouldechova
Jean Garcia-Gathright
Hanna Wallach
To facilitate the measurement of representational harms caused by large language model (LLM)-based systems, the NLP research community has p… (see more)roduced and made publicly available numerous measurement instruments, including tools, datasets, metrics, benchmarks, annotation instructions, and other techniques. However, the research community lacks clarity about whether and to what extent these instruments meet the needs of practitioners tasked with developing and deploying LLM-based systems in the real world, and how these instruments could be improved. Via a series of semi-structured interviews with practitioners in a variety of roles in different organizations, we identify four types of challenges that prevent practitioners from effectively using publicly available instruments for measuring representational harms caused by LLM-based systems: (1) challenges related to using publicly available measurement instruments; (2) challenges related to doing measurement in practice; (3) challenges arising from measurement tasks involving LLM-based systems; and (4) challenges specific to measuring representational harms. Our goal is to advance the development of instruments for measuring representational harms that are well-suited to practitioner needs, thus better facilitating the responsible development and deployment of LLM-based systems.
"It was 80% me, 20% AI": Seeking Authenticity in Co-Writing with Large Language Models
Angel Hsing-Chi Hwang
Q. V. Liao
Su Lin Blodgett
Adam Trischler
Evaluating Generative AI Systems is a Social Science Measurement Challenge
Hanna Wallach
Meera Desai
Nicholas Pangakis
A. F. Cooper
Angelina Wang
Solon Barocas
Alexandra Chouldechova
Chad Atalla
Su Lin Blodgett
Emily Corvi
P. A. Dow
Jean Garcia-Gathright
Stefanie Reed
Emily Sheng
Dan Vann
Jennifer Wortman Vaughan
Matthew Vogel
Hannah Washington
Abigail Z. Jacobs … (see 1 more)
Microsoft Research
Across academia, industry, and government, there is an increasing awareness that the measurement tasks involved in evaluating generative AI … (see more)(GenAI) systems are especially difficult. We argue that these measurement tasks are highly reminiscent of measurement tasks found throughout the social sciences. With this in mind, we present a framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems. The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves. This four-level approach differs from the way measurement is typically done in ML, where researchers and practitioners appear to jump straight from background concepts to measurement instruments, with little to no explicit systematization in between. As well as surfacing assumptions, thereby making it easier to understand exactly what the resulting measurements do and do not mean, this framework has two important implications for evaluating evaluations: First, it can enable stakeholders from different worlds to participate in conceptual debates, broadening the expertise involved in evaluating GenAI systems. Second, it brings rigor to operational debates by offering a set of lenses for interrogating the validity of measurement instruments and their resulting measurements.
"I Am the One and Only, Your Cyber BFF": Understanding the Impact of GenAI Requires Understanding the Impact of Anthropomorphic AI
Myra Cheng
Alicia DeVrio
Lisa Egede
Su Lin Blodgett
Many state-of-the-art generative AI (GenAI) systems are increasingly prone to anthropomorphic behaviors, i.e., to generating outputs that ar… (see more)e perceived to be human-like. While this has led to scholars increasingly raising concerns about possible negative impacts such anthropomorphic AI systems can give rise to, anthropomorphism in AI development, deployment, and use remains vastly overlooked, understudied, and underspecified. In this perspective, we argue that we cannot thoroughly map the social impacts of generative AI without mapping the social impacts of anthropomorphic AI, and outline a call to action.
Investigating Failures to Generalize for Coreference Resolution Models
Ian Porada
Kaheer Suleman
Adam Trischler
Coreference resolution models are often evaluated on multiple datasets. Datasets vary, however, in how coreference is realized -- i.e., how … (see more)the theoretical concept of coreference is operationalized in the dataset -- due to factors such as the choice of corpora and annotation guidelines. We investigate the extent to which errors of current coreference resolution models are associated with existing differences in operationalization across datasets (OntoNotes, PreCo, and Winogrande). Specifically, we distinguish between and break down model performance into categories corresponding to several types of coreference, including coreferring generic mentions, compound modifiers, and copula predicates, among others. This break down helps us investigate how state-of-the-art models might vary in their ability to generalize across different coreference types. In our experiments, for example, models trained on OntoNotes perform poorly on generic mentions and copula predicates in PreCo. Our findings help calibrate expectations of current coreference resolution models; and, future work can explicitly account for those types of coreference that are empirically associated with poor generalization when developing models.
"One-Size-Fits-All"? Examining Expectations around What Constitute"Fair"or"Good"NLG System Behaviors
Li Lucy
Su Lin Blodgett
Milad Shokouhi
Hanna Wallach
Fairness-related assumptions about what constitute appropriate NLG system behaviors range from invariance, where systems are expected to beh… (see more)ave identically for social groups, to adaptation, where behaviors should instead vary across them. To illuminate tensions around invariance and adaptation, we conduct five case studies, in which we perturb different types of identity-related language features (names, roles, locations, dialect, and style) in NLG system inputs. Through these cases studies, we examine people's expectations of system behaviors, and surface potential caveats of these contrasting yet commonly held assumptions. We find that motivations for adaptation include social norms, cultural differences, feature-specific information, and accommodation; in contrast, motivations for invariance include perspectives that favor prescriptivism, view adaptation as unnecessary or too difficult for NLG systems to do appropriately, and are wary of false assumptions. Our findings highlight open challenges around what constitute"fair"or"good"NLG system behaviors.
How different mental models of AI-based writing assistants impact writers’ interactions with them
Shalaleh Rismani
Su Lin Blodgett
Q. Vera Liao
ECBD: Evidence-Centered Benchmark Design for NLP
Yu Lu Liu
Su Lin Blodgett
Jackie Chi
Kit Cheung
Q. Vera Liao
Ziang Xiao
Benchmarking is seen as critical to assessing progress in NLP. However, creating a benchmark involves many design decisions (e.g., which dat… (see more)asets to include, which metrics to use) that often rely on tacit, untested assumptions about what the benchmark is intended to measure or is actually measuring. There is currently no principled way of analyzing these decisions and how they impact the validity of the benchmark's measurements. To address this gap, we draw on evidence-centered design in educational assessments and propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules. ECBD specifies the role each module plays in helping practitioners collect evidence about capabilities of interest. Specifically, each module requires benchmark designers to describe, justify, and support benchmark design choices -- e.g., clearly specifying the capabilities the benchmark aims to measure or how evidence about those capabilities is collected from model responses. To demonstrate the use of ECBD, we conduct case studies with three benchmarks: BoolQ, SuperGLUE, and HELM. Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.
What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models
Emily M. Bender
Jeongrok Yu
Timnit Gebru
Seong Ug Kim
Angelina McMillan-642
Jacob Choi
Jinho D. Choi
Su Lin Blodgett
Solon Barocas
Hal Daumé III
Gilsinia Lopez
Robert Sim
Hanna Wallach. 2021
Stereotyp-657
Bias is a disproportionate prejudice in favor of one side against another. Due to the success of transformer-based Masked Language Models (M… (see more)LMs) and their impact on many NLP tasks, a systematic evaluation of bias in these models is needed more than ever. While many studies have evaluated gender bias in English MLMs, only a few works have been conducted for the task in other languages. This paper proposes a multilingual approach to estimate gender bias in MLMs from 5 languages: Chinese, English, German, Portuguese, and Spanish. Unlike previous work, our approach does not depend on parallel corpora coupled with English to detect gender bias in other languages using multilingual lexicons. Moreover, a novel model-based method is presented to generate sentence pairs for a more robust analysis of gender bias, compared to the traditional lexicon-based method. For each language, both the lexicon-based and model-based methods are applied to create two datasets respectively, which are used to evaluate gender bias in an MLM specifically trained for that language using one existing and 3 new scoring metrics. Our results show that the previous approach is data-sensitive and not stable as it does not remove contextual dependencies irrelevant to gender. In fact, the results often flip when different scoring metrics are used on the same dataset, suggesting that gender bias should be studied on a large dataset using multiple evaluation metrics for best practice.
Responsible AI Research Needs Impact Statements Too
Michael Ekstrand
Carlos Castillo
Jina Suh
All types of research, development, and policy work can have unintended, adverse consequences - work in responsible artificial intelligence … (see more)(RAI), ethical AI, or ethics in AI is no exception.
Responsible AI Considerations in Text Summarization Research: A Review of Current Practices
Yu Lu Liu
Meng Cao
Su Lin Blodgett
Adam Trischler
AI and NLP publication venues have increasingly encouraged researchers to reflect on possible ethical considerations, adverse impacts, and o… (see more)ther responsible AI issues their work might engender. However, for specific NLP tasks our understanding of how prevalent such issues are, or when and why these issues are likely to arise, remains limited. Focusing on text summarization—a common NLP task largely overlooked by the responsible AI community—we examine research and reporting practices in the current literature. We conduct a multi-round qualitative analysis of 333 summarization papers from the ACL Anthology published between 2020–2022. We focus on how, which, and when responsible AI issues are covered, which relevant stakeholders are considered, and mismatches between stated and realized research goals. We also discuss current evaluation practices and consider how authors discuss the limitations of both prior work and their own work. Overall, we find that relatively few papers engage with possible stakeholders or contexts of use, which limits their consideration of potential downstream adverse impacts or other responsible AI issues. Based on our findings, we make recommendations on concrete practices and research directions.
Sensing Wellbeing in the Workplace, Why and For Whom? Envisioning Impacts with Organizational Stakeholders
Anna Kawakami
Shreya Chowdhary
Shamsi T. Iqbal
Q. Vera Liao
Jina Suh
Koustuv Saha
With the heightened digitization of the workplace, alongside the rise of remote and hybrid work prompted by the pandemic, there is growing c… (see more)orporate interest in using passive sensing technologies for workplace wellbeing. Existing research on these technologies often focus on understanding or improving interactions between an individual user and the technology. Workplace settings can, however, introduce a range of complexities that challenge the potential impact and in-practice desirability of wellbeing sensing technologies. Today, there is an inadequate empirical understanding of how everyday workers---including those who are impacted by, and impact the deployment of workplace technologies--envision its broader socio-ecological impacts. In this study, we conduct storyboard-driven interviews with 33 participants across three stakeholder groups: organizational governors, AI builders, and worker data subjects. Overall, our findings surface how workers envisioned wellbeing sensing technologies may lead to cascading impacts on their broader organizational culture, interpersonal relationships with colleagues, and individual day-to-day lives. Participants anticipated harms arising from ambiguity and misalignment around scaled notions of "worker wellbeing,'' underlying technical limitations to workplace-situated sensing, and assumptions regarding how social structures and relationships may shape the impacts and use of these technologies. Based on our findings, we discuss implications for designing worker-centered data-driven wellbeing technologies.