Alexandra Olteanu

Investigating Failures to Generalize for Coreference Resolution Models

Kaheer Suleman

Adam Trischler

Coreference resolution models are often evaluated on multiple datasets. Datasets vary, however, in how coreference is realized -- i.e., how … (see more)the theoretical concept of coreference is operationalized in the dataset -- due to factors such as the choice of corpora and annotation guidelines. We investigate the extent to which errors of current coreference resolution models are associated with existing differences in operationalization across datasets (OntoNotes, PreCo, and Winogrande). Specifically, we distinguish between and break down model performance into categories corresponding to several types of coreference, including coreferring generic mentions, compound modifiers, and copula predicates, among others. This break down helps us investigate how state-of-the-art models might vary in their ability to generalize across different coreference types. In our experiments, for example, models trained on OntoNotes perform poorly on generic mentions and copula predicates in PreCo. Our findings help calibrate expectations of current coreference resolution models; and, future work can explicitly account for those types of coreference that are empirically associated with poor generalization when developing models.

2024-08-01

Findings of the Association for Computational Linguistics ACL 2024 (published)

doi.org

arxiv.org

"One-Size-Fits-All"? Examining Expectations around What Constitute"Fair"or"Good"NLG System Behaviors

Li Lucy

Su Lin Blodgett

Milad Shokouhi

Hanna Wallach

Alexandra Olteanu

Fairness-related assumptions about what constitute appropriate NLG system behaviors range from invariance, where systems are expected to beh… (see more)ave identically for social groups, to adaptation, where behaviors should instead vary across them. To illuminate tensions around invariance and adaptation, we conduct five case studies, in which we perturb different types of identity-related language features (names, roles, locations, dialect, and style) in NLG system inputs. Through these cases studies, we examine people's expectations of system behaviors, and surface potential caveats of these contrasting yet commonly held assumptions. We find that motivations for adaptation include social norms, cultural differences, feature-specific information, and accommodation; in contrast, motivations for invariance include perspectives that favor prescriptivism, view adaptation as unnecessary or too difficult for NLG systems to do appropriately, and are wary of false assumptions. Our findings highlight open challenges around what constitute"fair"or"good"NLG system behaviors.

2024-06-01

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (published)

doi.org

arxiv.org

How different mental models of AI-based writing assistants impact writers’ interactions with them

Shalaleh Rismani

Su Lin Blodgett

Alexandra Olteanu

Q. Vera Liao

AJung Moon

2024-05-11

Proceedings of the Third Workshop on Intelligent and Interactive Writing Assistants (published)

doi.org

ECBD: Evidence-Centered Benchmark Design for NLP

Yu Lu Liu

Su Lin Blodgett

Jackie Chi

Jackie Cheung

Kit Cheung

Q. Vera Liao

Alexandra Olteanu

Ziang Xiao

Benchmarking is seen as critical to assessing progress in NLP. However, creating a benchmark involves many design decisions (e.g., which dat… (see more)asets to include, which metrics to use) that often rely on tacit, untested assumptions about what the benchmark is intended to measure or is actually measuring. There is currently no principled way of analyzing these decisions and how they impact the validity of the benchmark's measurements. To address this gap, we draw on evidence-centered design in educational assessments and propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules. ECBD specifies the role each module plays in helping practitioners collect evidence about capabilities of interest. Specifically, each module requires benchmark designers to describe, justify, and support benchmark design choices -- e.g., clearly specifying the capabilities the benchmark aims to measure or how evidence about those capabilities is collected from model responses. To demonstrate the use of ECBD, we conduct case studies with three benchmarks: BoolQ, SuperGLUE, and HELM. Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.

2024-01-01

ACL (1) (published)

doi.org

arxiv.org

What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models

Emily M. Bender

Jeongrok Yu

Timnit Gebru

Seong Ug Kim

Angelina McMillan-642

Jacob Choi

Jinho D. Choi

Su Lin Blodgett

Solon Barocas

Hal Daumé III

Gilsinia Lopez

Alexandra Olteanu

Robert Sim

Hanna Wallach. 2021

Stereotyp-657

Bias is a disproportionate prejudice in favor of one side against another. Due to the success of transformer-based Masked Language Models (M… (see more)LMs) and their impact on many NLP tasks, a systematic evaluation of bias in these models is needed more than ever. While many studies have evaluated gender bias in English MLMs, only a few works have been conducted for the task in other languages. This paper proposes a multilingual approach to estimate gender bias in MLMs from 5 languages: Chinese, English, German, Portuguese, and Spanish. Unlike previous work, our approach does not depend on parallel corpora coupled with English to detect gender bias in other languages using multilingual lexicons. Moreover, a novel model-based method is presented to generate sentence pairs for a more robust analysis of gender bias, compared to the traditional lexicon-based method. For each language, both the lexicon-based and model-based methods are applied to create two datasets respectively, which are used to evaluate gender bias in an MLM specifically trained for that language using one existing and 3 new scoring metrics. Our results show that the previous approach is data-sensitive and not stable as it does not remove contextual dependencies irrelevant to gender. In fact, the results often flip when different scoring metrics are used on the same dataset, suggesting that gender bias should be studied on a large dataset using multiple evaluation metrics for best practice.

2024-01-01

Inf. (published)

doi.org

arxiv.org

Responsible AI Research Needs Impact Statements Too

Alexandra Olteanu

Michael Ekstrand

Carlos Castillo

Jina Suh

All types of research, development, and policy work can have unintended, adverse consequences - work in responsible artificial intelligence … (see more)(RAI), ethical AI, or ethics in AI is no exception.

2023-11-20

ArXiv (preprint)

doi.org

arxiv.org

Responsible AI Considerations in Text Summarization Research: A Review of Current Practices

Yu Lu Liu

Meng Cao

Su Lin Blodgett

Jackie Cheung

Alexandra Olteanu

Adam Trischler

AI and NLP publication venues have increasingly encouraged researchers to reflect on possible ethical considerations, adverse impacts, and o… (see more)ther responsible AI issues their work might engender. However, for specific NLP tasks our understanding of how prevalent such issues are, or when and why these issues are likely to arise, remains limited. Focusing on text summarization—a common NLP task largely overlooked by the responsible AI community—we examine research and reporting practices in the current literature. We conduct a multi-round qualitative analysis of 333 summarization papers from the ACL Anthology published between 2020–2022. We focus on how, which, and when responsible AI issues are covered, which relevant stakeholders are considered, and mismatches between stated and realized research goals. We also discuss current evaluation practices and consider how authors discuss the limitations of both prior work and their own work. Overall, we find that relatively few papers engage with possible stakeholders or contexts of use, which limits their consideration of potential downstream adverse impacts or other responsible AI issues. Based on our findings, we make recommendations on concrete practices and research directions.

2023-10-07

EMNLP/2023/Conference (published)

openreview.net

Sensing Wellbeing in the Workplace, Why and For Whom? Envisioning Impacts with Organizational Stakeholders

Anna Kawakami

Shreya Chowdhary

Shamsi T. Iqbal

Q. Vera Liao

Alexandra Olteanu

Jina Suh

Koustuv Saha

With the heightened digitization of the workplace, alongside the rise of remote and hybrid work prompted by the pandemic, there is growing c… (see more)orporate interest in using passive sensing technologies for workplace wellbeing. Existing research on these technologies often focus on understanding or improving interactions between an individual user and the technology. Workplace settings can, however, introduce a range of complexities that challenge the potential impact and in-practice desirability of wellbeing sensing technologies. Today, there is an inadequate empirical understanding of how everyday workers---including those who are impacted by, and impact the deployment of workplace technologies--envision its broader socio-ecological impacts. In this study, we conduct storyboard-driven interviews with 33 participants across three stakeholder groups: organizational governors, AI builders, and worker data subjects. Overall, our findings surface how workers envisioned wellbeing sensing technologies may lead to cascading impacts on their broader organizational culture, interpersonal relationships with colleagues, and individual day-to-day lives. Participants anticipated harms arising from ambiguity and misalignment around scaled notions of "worker wellbeing,'' underlying technical limitations to workplace-situated sensing, and assumptions regarding how social structures and relationships may shape the impacts and use of these technologies. Based on our findings, we discuss implications for designing worker-centered data-driven wellbeing technologies.

2023-10-04

Proceedings of the ACM on Human-Computer Interaction (published)

doi.org

arxiv.org

FairPrism: Evaluating Fairness-Related Harms in Text Generation

Eve Fleisig

Aubrie Amstutz

Chad Atalla

Su Lin Blodgett

Hal Daumé III

Alexandra Olteanu

Emily Sheng

Dan Vann

Hanna Wallach

It is critical to measure and mitigate fairness-related harms caused by AI text generation systems, including stereotyping and demeaning har… (see more)ms. To that end, we introduce FairPrism, a dataset of 5,000 examples of AI-generated English text with detailed human annotations covering a diverse set of harms relating to gender and sexuality. FairPrism aims to address several limitations of existing datasets for measuring and mitigating fairness-related harms, including improved transparency, clearer specification of dataset coverage, and accounting for annotator disagreement and harms that are context-dependent. FairPrism’s annotations include the extent of stereotyping and demeaning harms, the demographic groups targeted, and appropriateness for different applications. The annotations also include specific harms that occur in interactive contexts and harms that raise normative concerns when the “speaker” is an AI system. Due to its precision and granularity, FairPrism can be used to diagnose (1) the types of fairness-related harms that AI text generation systems cause, and (2) the potential limitations of mitigation methods, both of which we illustrate through case studies. Finally, the process we followed to develop FairPrism offers a recipe for building improved datasets for measuring and mitigating harms caused by AI systems.

2023-07-01

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (published)

doi.org

AHA!: Facilitating AI Impact Assessment by Generating Examples of Harms

Zana Buçinca

Chau Minh Pham

Maurice Jakesch

Marco Túlio Ribeiro

Alexandra Olteanu

Saleema Amershi

While demands for change and accountability for harmful AI consequences mount, foreseeing the downstream effects of deploying AI systems rem… (see more)ains a challenging task. We developed AHA! (Anticipating Harms of AI), a generative framework to assist AI practitioners and decision-makers in anticipating potential harms and unintended consequences of AI systems prior to development or deployment. Given an AI deployment scenario, AHA! generates descriptions of possible harms for different stakeholders. To do so, AHA! systematically considers the interplay between common problematic AI behaviors as well as their potential impacts on different stakeholders, and narrates these conditions through vignettes. These vignettes are then filled in with descriptions of possible harms by prompting crowd workers and large language models. By examining 4113 harms surfaced by AHA! for five different AI deployment scenarios, we found that AHA! generates meaningful examples of harms, with different problematic AI behaviors resulting in different types of harms. Prompting both crowds and a large language model with the vignettes resulted in more diverse examples of harms than those generated by either the crowd or the model alone. To gauge AHA!'s potential practical utility, we also conducted semi-structured interviews with responsible AI professionals (N=9). Participants found AHA!'s systematic approach to surfacing harms important for ethical reflection and discovered meaningful stakeholders and harms they believed they would not have thought of otherwise. Participants, however, differed in their opinions about whether AHA! should be used upfront or as a secondary-check and noted that AHA! may shift harm anticipation from an ideation problem to a potentially demanding review problem. Drawing on our results, we discuss design implications of building tools to help practitioners envision possible harms.

2023-06-05

ArXiv (preprint)

doi.org

arxiv.org

Can Workers Meaningfully Consent to Workplace Wellbeing Technologies?

Shreya Chowdhary

Anna Kawakami

Jina Suh

Mary L Gray

Alexandra Olteanu

Koustuv Saha

2023-01-01

FAccT (published)

doi.org

arxiv.org

Human-Centered Responsible Artificial Intelligence: Current & Future Trends

Mohammad Tahaei

Marios Constantinides

Daniele Quercia

Sean Kennedy

Michael Muller

Simone Stumpf

Q. Vera Liao

Ricardo Baeza-Yates

Lora Aroyo

Jess Holbrook

Ewa Luger

Michael Madaio

Ilana Golbin Blumenfeld

Maria De-Arteaga

Jessica Vitak

Alexandra Olteanu

2023-01-01

CHI Extended Abstracts (published)

doi.org

arxiv.org

Hackathon | Building safer AI for youth mental health

Indigenous Pathfinders in AI

AI Advantage

Alexandra Olteanu

Publications

Hackathon | Building safer AI for youth mental health

Indigenous Pathfinders in AI

AI Advantage

Popular keywords:

Alexandra Olteanu

Publications