Alexandra Olteanu

The Romantic Historicism and The Rise of the Historical Novel in the 19th Century Romanian Literature

2024-12-31

Philobiblon. Transylvanian Journal of Multidisciplinary Research in the Humanities (publié)

Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems

Emma Harvey

Emily Sheng

Su Lin Blodgett

Alexandra Chouldechova

Jean Garcia-Gathright

Hanna Wallach

To facilitate the measurement of representational harms caused by large language model (LLM)-based systems, the NLP research community has p… (voir plus)roduced and made publicly available numerous measurement instruments, including tools, datasets, metrics, benchmarks, annotation instructions, and other techniques. However, the research community lacks clarity about whether and to what extent these instruments meet the needs of practitioners tasked with developing and deploying LLM-based systems in the real world, and how these instruments could be improved. Via a series of semi-structured interviews with practitioners in a variety of roles in different organizations, we identify four types of challenges that prevent practitioners from effectively using publicly available instruments for measuring representational harms caused by LLM-based systems: (1) challenges related to using publicly available measurement instruments; (2) challenges related to doing measurement in practice; (3) challenges arising from measurement tasks involving LLM-based systems; and (4) challenges specific to measuring representational harms. Our goal is to advance the development of instruments for measuring representational harms that are well-suited to practitioner needs, thus better facilitating the responsible development and deployment of LLM-based systems.

2024-11-23

ArXiv (prépublication)

Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems

Emma Harvey

Emily Sheng

Su Lin Blodgett

Alexandra Chouldechova

Jean Garcia-Gathright

Hanna Wallach

To facilitate the measurement of representational harms caused by large language model (LLM)-based systems, the NLP research community has p… (voir plus)roduced and made publicly available numerous measurement instruments, including tools, datasets, metrics, benchmarks, annotation instructions, and other techniques. However, the research community lacks clarity about whether and to what extent these instruments meet the needs of practitioners tasked with developing and deploying LLM-based systems in the real world, and how these instruments could be improved. Via a series of semi-structured interviews with practitioners in a variety of roles in different organizations, we identify four types of challenges that prevent practitioners from effectively using publicly available instruments for measuring representational harms caused by LLM-based systems: (1) challenges related to using publicly available measurement instruments; (2) challenges related to doing measurement in practice; (3) challenges arising from measurement tasks involving LLM-based systems; and (4) challenges specific to measuring representational harms. Our goal is to advance the development of instruments for measuring representational harms that are well-suited to practitioner needs, thus better facilitating the responsible development and deployment of LLM-based systems.

2024-11-23

ArXiv (prépublication)

Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems

Emma Harvey

Emily Sheng

Su Lin Blodgett

Alexandra Chouldechova

Jean Garcia-Gathright

Hanna Wallach

To facilitate the measurement of representational harms caused by large language model (LLM)-based systems, the NLP research community has p… (voir plus)roduced and made publicly available numerous measurement instruments, including tools, datasets, metrics, benchmarks, annotation instructions, and other techniques. However, the research community lacks clarity about whether and to what extent these instruments meet the needs of practitioners tasked with developing and deploying LLM-based systems in the real world, and how these instruments could be improved. Via a series of semi-structured interviews with practitioners in a variety of roles in different organizations, we identify four types of challenges that prevent practitioners from effectively using publicly available instruments for measuring representational harms caused by LLM-based systems: (1) challenges related to using publicly available measurement instruments; (2) challenges related to doing measurement in practice; (3) challenges arising from measurement tasks involving LLM-based systems; and (4) challenges specific to measuring representational harms. Our goal is to advance the development of instruments for measuring representational harms that are well-suited to practitioner needs, thus better facilitating the responsible development and deployment of LLM-based systems.

2024-11-23

ArXiv (prépublication)

"It was 80% me, 20% AI": Seeking Authenticity in Co-Writing with Large Language Models

Angel Hsing-Chi Hwang

Q. V. Liao

Su Lin Blodgett

Adam Trischler

Given the rising proliferation and diversity of AI writing assistance tools, especially those powered by large language models (LLMs), both … (voir plus)writers and readers may have concerns about the impact of these tools on the authenticity of writing work. We examine whether and how writers want to preserve their authentic voice when co-writing with AI tools and whether personalization of AI writing support could help achieve this goal. We conducted semi-structured interviews with 19 professional writers, during which they co-wrote with both personalized and non-personalized AI writing-support tools. We supplemented writers' perspectives with opinions from 30 avid readers about the written work co-produced with AI collected through an online survey. Our findings illuminate conceptions of authenticity in human-AI co-creation, which focus more on the process and experience of constructing creators' authentic selves. While writers reacted positively to personalized AI writing tools, they believed the form of personalization needs to target writers' growth and go beyond the phase of text production. Overall, readers' responses showed less concern about human-AI co-writing. Readers could not distinguish AI-assisted work, personalized or not, from writers' solo-written work and showed positive attitudes toward writers experimenting with new technology for creative writing.

2024-11-20

ArXiv (prépublication)

Evaluating Generative AI Systems is a Social Science Measurement Challenge

Hanna Wallach

Meera Desai

Nicholas Pangakis

A. F. Cooper

Angelina Wang

Solon Barocas

Alexandra Chouldechova

Chad Atalla

Su Lin Blodgett

Emily Corvi

P. A. Dow

Jean Garcia-Gathright

Stefanie Reed

Emily Sheng

Dan Vann

Jennifer Wortman Vaughan

Matthew Vogel

Hannah Washington

Abigail Z. Jacobs … (voir 1 de plus)

Microsoft Research

Across academia, industry, and government, there is an increasing awareness that the measurement tasks involved in evaluating generative AI … (voir plus)(GenAI) systems are especially difficult. We argue that these measurement tasks are highly reminiscent of measurement tasks found throughout the social sciences. With this in mind, we present a framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems. The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves. This four-level approach differs from the way measurement is typically done in ML, where researchers and practitioners appear to jump straight from background concepts to measurement instruments, with little to no explicit systematization in between. As well as surfacing assumptions, thereby making it easier to understand exactly what the resulting measurements do and do not mean, this framework has two important implications for evaluating evaluations: First, it can enable stakeholders from different worlds to participate in conceptual debates, broadening the expertise involved in evaluating GenAI systems. Second, it brings rigor to operational debates by offering a set of lenses for interrogating the validity of measurement instruments and their resulting measurements.

2024-11-17

ArXiv (prépublication)

Evaluating Generative AI Systems is a Social Science Measurement Challenge

Hanna Wallach

Meera Desai

Nicholas Pangakis

A. Feder Cooper

Angelina Wang

Solon Barocas

Alexandra Chouldechova

Chad Atalla

Su Lin Blodgett

Emily Corvi

P. A. Dow

Jean Garcia-Gathright

Stefanie Reed

Emily Sheng

Dan Vann

Jennifer Wortman Vaughan

Matthew Vogel

Hannah Washington

Abigail Z. Jacobs … (voir 1 de plus)

Microsoft Research

Across academia, industry, and government, there is an increasing awareness that the measurement tasks involved in evaluating generative AI … (voir plus)(GenAI) systems are especially difficult. We argue that these measurement tasks are highly reminiscent of measurement tasks found throughout the social sciences. With this in mind, we present a framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems. The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves. This four-level approach differs from the way measurement is typically done in ML, where researchers and practitioners appear to jump straight from background concepts to measurement instruments, with little to no explicit systematization in between. As well as surfacing assumptions, thereby making it easier to understand exactly what the resulting measurements do and do not mean, this framework has two important implications for evaluating evaluations: First, it can enable stakeholders from different worlds to participate in conceptual debates, broadening the expertise involved in evaluating GenAI systems. Second, it brings rigor to operational debates by offering a set of lenses for interrogating the validity of measurement instruments and their resulting measurements.

2024-11-01

arXiv (publié)

"I Am the One and Only, Your Cyber BFF": Understanding the Impact of GenAI Requires Understanding the Impact of Anthropomorphic AI

Myra Cheng

Alicia DeVrio

Lisa Egede

Su Lin Blodgett

Many state-of-the-art generative AI (GenAI) systems are increasingly prone to anthropomorphic behaviors, i.e., to generating outputs that ar… (voir plus)e perceived to be human-like. While this has led to scholars increasingly raising concerns about possible negative impacts such anthropomorphic AI systems can give rise to, anthropomorphism in AI development, deployment, and use remains vastly overlooked, understudied, and underspecified. In this perspective, we argue that we cannot thoroughly map the social impacts of generative AI without mapping the social impacts of anthropomorphic AI, and outline a call to action.

2024-10-11

ArXiv (prépublication)

"I Am the One and Only, Your Cyber BFF": Understanding the Impact of GenAI Requires Understanding the Impact of Anthropomorphic AI

Myra Cheng

Alicia DeVrio

Lisa Egede

Su Lin Blodgett

Many state-of-the-art generative AI (GenAI) systems are increasingly prone to anthropomorphic behaviors, i.e., to generating outputs that ar… (voir plus)e perceived to be human-like. While this has led to scholars increasingly raising concerns about possible negative impacts such anthropomorphic AI systems can give rise to, anthropomorphism in AI development, deployment, and use remains vastly overlooked, understudied, and underspecified. In this perspective, we argue that we cannot thoroughly map the social impacts of generative AI without mapping the social impacts of anthropomorphic AI, and outline a call to action.

2024-10-11

ArXiv (prépublication)

Investigating Failures to Generalize for Coreference Resolution Models

Ian Porada

Kaheer Suleman

Adam Trischler

Jackie Cheung

Coreference resolution models are often evaluated on multiple datasets. Datasets vary, however, in how coreference is realized -- i.e., how … (voir plus)the theoretical concept of coreference is operationalized in the dataset -- due to factors such as the choice of corpora and annotation guidelines. We investigate the extent to which errors of current coreference resolution models are associated with existing differences in operationalization across datasets (OntoNotes, PreCo, and Winogrande). Specifically, we distinguish between and break down model performance into categories corresponding to several types of coreference, including coreferring generic mentions, compound modifiers, and copula predicates, among others. This break down helps us investigate how state-of-the-art models might vary in their ability to generalize across different coreference types. In our experiments, for example, models trained on OntoNotes perform poorly on generic mentions and copula predicates in PreCo. Our findings help calibrate expectations of current coreference resolution models; and, future work can explicitly account for those types of coreference that are empirically associated with poor generalization when developing models.

2024-08-01

Findings of the Association for Computational Linguistics ACL 2024 (publié)

"One-Size-Fits-All"? Examining Expectations around What Constitute"Fair"or"Good"NLG System Behaviors

Li Lucy

Su Lin Blodgett

Milad Shokouhi

Hanna Wallach

Fairness-related assumptions about what constitute appropriate NLG system behaviors range from invariance, where systems are expected to beh… (voir plus)ave identically for social groups, to adaptation, where behaviors should instead vary across them. To illuminate tensions around invariance and adaptation, we conduct five case studies, in which we perturb different types of identity-related language features (names, roles, locations, dialect, and style) in NLG system inputs. Through these cases studies, we examine people's expectations of system behaviors, and surface potential caveats of these contrasting yet commonly held assumptions. We find that motivations for adaptation include social norms, cultural differences, feature-specific information, and accommodation; in contrast, motivations for invariance include perspectives that favor prescriptivism, view adaptation as unnecessary or too difficult for NLG systems to do appropriately, and are wary of false assumptions. Our findings highlight open challenges around what constitute"fair"or"good"NLG system behaviors.

2024-06-01

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (publié)