Alexandra Olteanu

Nicholas J Pangakis

Stefanie Reed

Emily Sheng

Dan Vann

Jennifer Wortman Vaughan

Matthew Vogel

Hannah Washington

Abigail Z. Jacobs

The measurement tasks involved in evaluating generative AI (GenAI) systems lack sufficient scientific rigor, leading to what has been descri… (see more)bed as "a tangle of sloppy tests [and] apples-to-oranges comparisons" (Roose, 2024). In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating GenAI systems. Specifically, our position is that evaluating GenAI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of GenAI systems. This framework has two important implications: First, it can broaden the expertise involved in evaluating GenAI systems by enabling stakeholders with different perspectives to participate in conceptual debates. Second, it brings rigor to both conceptual and operational debates by offering a set of lenses for interrogating validity.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Position: Evaluating Generative AI Systems is a Social Science Measurement Challenge

Hanna Wallach

Meera Desai

A. Feder Cooper

Angelina Wang

Chad Atalla

Solon Barocas

Su Lin Blodgett

Alexandra Chouldechova

Emily Corvi

P. A. Dow

Jean Garcia-Gathright

Nicholas Pangakis

Stefanie Reed

Emily Sheng

Dan Vann

Jennifer Wortman Vaughan

Matthew Vogel

Hannah Washington

Abigail Z. Jacobs

The measurement tasks involved in evaluating generative AI (GenAI) systems are especially difficult, leading to what has been described as"a… (see more) tangle of sloppy tests [and] apples-to-oranges comparisons"(Roose, 2024). In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating GenAI systems. Specifically, our position is that evaluating GenAI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of GenAI. This framework has two important implications for designing and evaluating evaluations: First, it can broaden the expertise involved in evaluating GenAI systems by enabling stakeholders with different perspectives to participate in conceptual debates. Second, it brings rigor to both conceptual and operational debates by offering a set of lenses for interrogating the validity of measurement instruments and their resulting measurements.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

Rigor in AI: Doing Rigorous AI Work Requires a Broader, Responsible AI-Informed Conception of Rigor

Su Lin Blodgett

Agathe Balayn

Angelina Wang

Fernando Diaz

F. Calmon

Margaret Mitchell

Michael Ekstrand

Reuben Daniel Binns

Solon Barocas

In AI research and practice, rigor remains largely understood in terms of methodological rigor -- such as whether mathematical, statistical,… (see more) or computational methods are correctly applied. We argue that this narrow conception of rigor has contributed to the concerns raised by the responsible AI community, including overblown claims about AI capabilities. Our position is that a broader conception of what rigorous AI research and practice should entail is needed. We believe such a conception -- in addition to a more expansive understanding of (1) methodological rigor -- should include aspects related to (2) what background knowledge informs what to work on (epistemic rigor); (3) how disciplinary, community, or personal norms, standards, or beliefs influence the work (normative rigor); (4) how clearly articulated the theoretical constructs under use are (conceptual rigor); (5) what is reported and how (reporting rigor); and (6) how well-supported the inferences from existing evidence are (interpretative rigor). In doing so, we also aim to provide useful language and a framework for much-needed dialogue about the AI community's work by researchers, policymakers, journalists, and other stakeholders.

2025-06-17

ArXiv (preprint)

Rigor in AI: Doing Rigorous AI Work Requires a Broader, Responsible AI-Informed Conception of Rigor

Su Lin Blodgett

Agathe Balayn

Angelina Wang

Fernando Diaz

F. Calmon

Margaret Mitchell

Michael Ekstrand

Reuben Daniel Binns

Solon Barocas

In AI research and practice, rigor remains largely understood in terms of methodological rigor -- such as whether mathematical, statistical,… (see more) or computational methods are correctly applied. We argue that this narrow conception of rigor has contributed to the concerns raised by the responsible AI community, including overblown claims about AI capabilities. Our position is that a broader conception of what rigorous AI research and practice should entail is needed. We believe such a conception -- in addition to a more expansive understanding of (1) methodological rigor -- should include aspects related to (2) what background knowledge informs what to work on (epistemic rigor); (3) how disciplinary, community, or personal norms, standards, or beliefs influence the work (normative rigor); (4) how clearly articulated the theoretical constructs under use are (conceptual rigor); (5) what is reported and how (reporting rigor); and (6) how well-supported the inferences from existing evidence are (interpretative rigor). In doing so, we also aim to provide useful language and a framework for much-needed dialogue about the AI community's work by researchers, policymakers, journalists, and other stakeholders.

2025-06-17

ArXiv (preprint)

Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems

Emma Harvey

Emily Sheng

Su Lin Blodgett

Alexandra Chouldechova

Jean Garcia-Gathright

Hanna Wallach

The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language mo… (see more)del (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments - even useful instruments - are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.

2025-06-04

ArXiv (preprint)

"It was 80% me, 20% AI": Seeking Authenticity in Co-Writing with Large Language Models

Angel Hsing-Chi Hwang

Q. Vera Liao

Su Lin Blodgett

Adam Trischler

2025-05-02

Proceedings of the ACM on Human-Computer Interaction (published)

AI Automatons: AI Systems Intended to Imitate Humans

Solon Barocas

Su Lin Blodgett

Lisa Egede

Alicia DeVrio

Myra Cheng

There is a growing proliferation of AI systems designed to mimic people's behavior, work, abilities, likenesses, or humanness -- systems we … (see more)dub AI automatons. Individuals, groups, or generic humans are being simulated to produce creative work in their styles, to respond to surveys in their places, to probe how they would use a new system before deployment, to provide users with assistance and companionship, and to anticipate their possible future behavior and interactions with others, just to name a few applications. The research, design, deployment, and availability of such AI systems have, however, also prompted growing concerns about a wide range of possible legal, ethical, and other social impacts. To both 1) facilitate productive discussions about whether, when, and how to design and deploy such systems, and 2) chart the current landscape of existing and prospective AI automatons, we need to tease apart determinant design axes and considerations that can aid our understanding of whether and how various design choices along these axes could mitigate -- or instead exacerbate -- potential adverse impacts that the development and use of AI automatons could give rise to. In this paper, through a synthesis of related literature and extensive examples of existing AI systems intended to mimic humans, we develop a conceptual framework to help foreground key axes of design variations and provide analytical scaffolding to foster greater recognition of the design choices available to developers, as well as the possible ethical implications these choices might have.

2025-03-04

ArXiv (preprint)

AI Automatons: AI Systems Intended to Imitate Humans

Solon Barocas

Su Lin Blodgett

Lisa Egede

Alicia DeVrio

Myra Cheng

There is a growing proliferation of AI systems designed to mimic people's behavior, work, abilities, likenesses, or humanness -- systems we … (see more)dub AI automatons. Individuals, groups, or generic humans are being simulated to produce creative work in their styles, to respond to surveys in their places, to probe how they would use a new system before deployment, to provide users with assistance and companionship, and to anticipate their possible future behavior and interactions with others, just to name a few applications. The research, design, deployment, and availability of such AI systems have, however, also prompted growing concerns about a wide range of possible legal, ethical, and other social impacts. To both 1) facilitate productive discussions about whether, when, and how to design and deploy such systems, and 2) chart the current landscape of existing and prospective AI automatons, we need to tease apart determinant design axes and considerations that can aid our understanding of whether and how various design choices along these axes could mitigate -- or instead exacerbate -- potential adverse impacts that the development and use of AI automatons could give rise to. In this paper, through a synthesis of related literature and extensive examples of existing AI systems intended to mimic humans, we develop a conceptual framework to help foreground key axes of design variations and provide analytical scaffolding to foster greater recognition of the design choices available to developers, as well as the possible ethical implications these choices might have.

2025-03-01

arXiv (published)

The In-Situ Effect of Offensive Ads on Search Engine Users

Elad Yom-Tov

Liat Levontin

2025-02-25

ACM Transactions on Information Systems (published)

Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems

Myra Cheng

Su Lin Blodgett

Alicia DeVrio

Lisa Egede

As text generation systems' outputs are increasingly anthropomorphic -- perceived as human-like -- scholars have also increasingly raised co… (see more)ncerns about how such outputs can lead to harmful outcomes, such as users over-relying or developing emotional dependence on these systems. How to intervene on such system outputs to mitigate anthropomorphic behaviors and their attendant harmful outcomes, however, remains understudied. With this work, we aim to provide empirical and theoretical grounding for developing such interventions. To do so, we compile an inventory of interventions grounded both in prior literature and a crowdsourcing study where participants edited system outputs to make them less human-like. Drawing on this inventory, we also develop a conceptual framework to help characterize the landscape of possible interventions, articulate distinctions between different types of interventions, and provide a theoretical basis for evaluating the effectiveness of different interventions.

2025-02-19

ArXiv (preprint)

Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems

Myra Cheng

Su Lin Blodgett

Alicia DeVrio

Lisa Egede

As text generation systems' outputs are increasingly anthropomorphic -- perceived as human-like -- scholars have also raised increasing conc… (see more)erns about how such outputs can lead to harmful outcomes, such as users over-relying or developing emotional dependence on these systems. How to intervene on such system outputs to mitigate anthropomorphic behaviors and their attendant harmful outcomes, however, remains understudied. With this work, we aim to provide empirical and theoretical grounding for developing such interventions. To do so, we compile an inventory of interventions grounded both in prior literature and a crowdsourced study where participants edited system outputs to make them less human-like. Drawing on this inventory, we also develop a conceptual framework to help characterize the landscape of possible interventions, articulate distinctions between different types of interventions, and provide a theoretical basis for evaluating the effectiveness of different interventions.

2025-02-19

ArXiv (preprint)

A Taxonomy of Linguistic Expressions That Contribute To Anthropomorphism of Language Technologies

Alicia DeVrio

Myra Cheng

Lisa Egede

Su Lin Blodgett

Recent attention to anthropomorphism -- the attribution of human-like qualities to non-human objects or entities -- of language technologies… (see more) like LLMs has sparked renewed discussions about potential negative impacts of anthropomorphism. To productively discuss the impacts of this anthropomorphism and in what contexts it is appropriate, we need a shared vocabulary for the vast variety of ways that language can be anthropomorphic. In this work, we draw on existing literature and analyze empirical cases of user interactions with language technologies to develop a taxonomy of textual expressions that can contribute to anthropomorphism. We highlight challenges and tensions involved in understanding linguistic anthropomorphism, such as how all language is fundamentally human and how efforts to characterize and shift perceptions of humanness in machines can also dehumanize certain humans. We discuss ways that our taxonomy supports more precise and effective discussions of and decisions about anthropomorphism of language technologies.

2025-02-14

ArXiv (preprint)