Portrait de Alexandra Olteanu

Alexandra Olteanu

Membre industriel associé
Chercheuse principale et co-fondatrice de l'équipe FATE, apprentissage profond et automatisé, Microsoft Research, Montréal
Sujets de recherche
Recherche d'information
Traitement du langage naturel

Publications

"I Am the One and Only, Your Cyber BFF": Understanding the Impact of GenAI Requires Understanding the Impact of Anthropomorphic AI
Myra Cheng
Alicia DeVrio
Lisa Egede
Su Lin Blodgett
Many state-of-the-art generative AI (GenAI) systems are increasingly prone to anthropomorphic behaviors, i.e., to generating outputs that ar… (voir plus)e perceived to be human-like. While this has led to scholars increasingly raising concerns about possible negative impacts such anthropomorphic AI systems can give rise to, anthropomorphism in AI development, deployment, and use remains vastly overlooked, understudied, and underspecified. In this perspective, we argue that we cannot thoroughly map the social impacts of generative AI without mapping the social impacts of anthropomorphic AI, and outline a call to action.
Investigating Failures to Generalize for Coreference Resolution Models
Ian Porada
Kaheer Suleman
Adam Trischler
Coreference resolution models are often evaluated on multiple datasets. Datasets vary, however, in how coreference is realized -- i.e., how … (voir plus)the theoretical concept of coreference is operationalized in the dataset -- due to factors such as the choice of corpora and annotation guidelines. We investigate the extent to which errors of current coreference resolution models are associated with existing differences in operationalization across datasets (OntoNotes, PreCo, and Winogrande). Specifically, we distinguish between and break down model performance into categories corresponding to several types of coreference, including coreferring generic mentions, compound modifiers, and copula predicates, among others. This break down helps us investigate how state-of-the-art models might vary in their ability to generalize across different coreference types. In our experiments, for example, models trained on OntoNotes perform poorly on generic mentions and copula predicates in PreCo. Our findings help calibrate expectations of current coreference resolution models; and, future work can explicitly account for those types of coreference that are empirically associated with poor generalization when developing models.
"One-Size-Fits-All"? Examining Expectations around What Constitute"Fair"or"Good"NLG System Behaviors
Li Lucy
Su Lin Blodgett
Milad Shokouhi
Hanna Wallach
Fairness-related assumptions about what constitute appropriate NLG system behaviors range from invariance, where systems are expected to beh… (voir plus)ave identically for social groups, to adaptation, where behaviors should instead vary across them. To illuminate tensions around invariance and adaptation, we conduct five case studies, in which we perturb different types of identity-related language features (names, roles, locations, dialect, and style) in NLG system inputs. Through these cases studies, we examine people's expectations of system behaviors, and surface potential caveats of these contrasting yet commonly held assumptions. We find that motivations for adaptation include social norms, cultural differences, feature-specific information, and accommodation; in contrast, motivations for invariance include perspectives that favor prescriptivism, view adaptation as unnecessary or too difficult for NLG systems to do appropriately, and are wary of false assumptions. Our findings highlight open challenges around what constitute"fair"or"good"NLG system behaviors.
How different mental models of AI-based writing assistants impact writers’ interactions with them
Shalaleh Rismani
Su Lin Blodgett
Q. Vera Liao
ECBD: Evidence-Centered Benchmark Design for NLP
Yu Lu Liu
Su Lin Blodgett
Jackie Chi
Q. Vera Liao
Kit Cheung
Ziang Xiao
Benchmarking is seen as critical to assessing progress in NLP. However, creating a benchmark involves many design decisions (e.g., which dat… (voir plus)asets to include, which metrics to use) that often rely on tacit, untested assumptions about what the benchmark is intended to measure or is actually measuring. There is currently no principled way of analyzing these decisions and how they impact the validity of the benchmark's measurements. To address this gap, we draw on evidence-centered design in educational assessments and propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules. ECBD specifies the role each module plays in helping practitioners collect evidence about capabilities of interest. Specifically, each module requires benchmark designers to describe, justify, and support benchmark design choices -- e.g., clearly specifying the capabilities the benchmark aims to measure or how evidence about those capabilities is collected from model responses. To demonstrate the use of ECBD, we conduct case studies with three benchmarks: BoolQ, SuperGLUE, and HELM. Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.
What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models
Emily M. Bender
Jeongrok Yu
Timnit Gebru
Seong Ug Kim
Angelina McMillan-642
Jacob Choi
Jinho D. Choi
Su Lin Blodgett
Solon Barocas
Hal Daumé III
Gilsinia Lopez
Robert Sim
Hanna Wallach. 2021
Stereotyp-657
Bias is a disproportionate prejudice in favor of one side against another. Due to the success of transformer-based Masked Language Models (M… (voir plus)LMs) and their impact on many NLP tasks, a systematic evaluation of bias in these models is needed more than ever. While many studies have evaluated gender bias in English MLMs, only a few works have been conducted for the task in other languages. This paper proposes a multilingual approach to estimate gender bias in MLMs from 5 languages: Chinese, English, German, Portuguese, and Spanish. Unlike previous work, our approach does not depend on parallel corpora coupled with English to detect gender bias in other languages using multilingual lexicons. Moreover, a novel model-based method is presented to generate sentence pairs for a more robust analysis of gender bias, compared to the traditional lexicon-based method. For each language, both the lexicon-based and model-based methods are applied to create two datasets respectively, which are used to evaluate gender bias in an MLM specifically trained for that language using one existing and 3 new scoring metrics. Our results show that the previous approach is data-sensitive and not stable as it does not remove contextual dependencies irrelevant to gender. In fact, the results often flip when different scoring metrics are used on the same dataset, suggesting that gender bias should be studied on a large dataset using multiple evaluation metrics for best practice.
Responsible AI Research Needs Impact Statements Too
Michael Ekstrand
Carlos Castillo
Jina Suh
All types of research, development, and policy work can have unintended, adverse consequences - work in responsible artificial intelligence … (voir plus)(RAI), ethical AI, or ethics in AI is no exception.
Responsible AI Considerations in Text Summarization Research: A Review of Current Practices
Yu Lu Liu
Meng Cao
Su Lin Blodgett
Adam Trischler
AI and NLP publication venues have increasingly encouraged researchers to reflect on possible ethical considerations, adverse impacts, and o… (voir plus)ther responsible AI issues their work might engender. However, for specific NLP tasks our understanding of how prevalent such issues are, or when and why these issues are likely to arise, remains limited. Focusing on text summarization—a common NLP task largely overlooked by the responsible AI community—we examine research and reporting practices in the current literature. We conduct a multi-round qualitative analysis of 333 summarization papers from the ACL Anthology published between 2020–2022. We focus on how, which, and when responsible AI issues are covered, which relevant stakeholders are considered, and mismatches between stated and realized research goals. We also discuss current evaluation practices and consider how authors discuss the limitations of both prior work and their own work. Overall, we find that relatively few papers engage with possible stakeholders or contexts of use, which limits their consideration of potential downstream adverse impacts or other responsible AI issues. Based on our findings, we make recommendations on concrete practices and research directions.
Sensing Wellbeing in the Workplace, Why and For Whom? Envisioning Impacts with Organizational Stakeholders
Anna Kawakami
Shreya Chowdhary
Shamsi T. Iqbal
Q. Vera Liao
Jina Suh
Koustuv Saha
With the heightened digitization of the workplace, alongside the rise of remote and hybrid work prompted by the pandemic, there is growing c… (voir plus)orporate interest in using passive sensing technologies for workplace wellbeing. Existing research on these technologies often focus on understanding or improving interactions between an individual user and the technology. Workplace settings can, however, introduce a range of complexities that challenge the potential impact and in-practice desirability of wellbeing sensing technologies. Today, there is an inadequate empirical understanding of how everyday workers---including those who are impacted by, and impact the deployment of workplace technologies--envision its broader socio-ecological impacts. In this study, we conduct storyboard-driven interviews with 33 participants across three stakeholder groups: organizational governors, AI builders, and worker data subjects. Overall, our findings surface how workers envisioned wellbeing sensing technologies may lead to cascading impacts on their broader organizational culture, interpersonal relationships with colleagues, and individual day-to-day lives. Participants anticipated harms arising from ambiguity and misalignment around scaled notions of "worker wellbeing,'' underlying technical limitations to workplace-situated sensing, and assumptions regarding how social structures and relationships may shape the impacts and use of these technologies. Based on our findings, we discuss implications for designing worker-centered data-driven wellbeing technologies.
FairPrism: Evaluating Fairness-Related Harms in Text Generation
Eve Fleisig
Aubrie Amstutz
Chad Atalla
Su Lin Blodgett
Hal Daumé III
Emily Sheng
Dan Vann
Hanna Wallach
It is critical to measure and mitigate fairness-related harms caused by AI text generation systems, including stereotyping and demeaning har… (voir plus)ms. To that end, we introduce FairPrism, a dataset of 5,000 examples of AI-generated English text with detailed human annotations covering a diverse set of harms relating to gender and sexuality. FairPrism aims to address several limitations of existing datasets for measuring and mitigating fairness-related harms, including improved transparency, clearer specification of dataset coverage, and accounting for annotator disagreement and harms that are context-dependent. FairPrism’s annotations include the extent of stereotyping and demeaning harms, the demographic groups targeted, and appropriateness for different applications. The annotations also include specific harms that occur in interactive contexts and harms that raise normative concerns when the “speaker” is an AI system. Due to its precision and granularity, FairPrism can be used to diagnose (1) the types of fairness-related harms that AI text generation systems cause, and (2) the potential limitations of mitigation methods, both of which we illustrate through case studies. Finally, the process we followed to develop FairPrism offers a recipe for building improved datasets for measuring and mitigating harms caused by AI systems.
AHA!: Facilitating AI Impact Assessment by Generating Examples of Harms
Zana Buçinca
Chau Minh Pham
Maurice Jakesch
Marco Túlio Ribeiro
Saleema Amershi
While demands for change and accountability for harmful AI consequences mount, foreseeing the downstream effects of deploying AI systems rem… (voir plus)ains a challenging task. We developed AHA! (Anticipating Harms of AI), a generative framework to assist AI practitioners and decision-makers in anticipating potential harms and unintended consequences of AI systems prior to development or deployment. Given an AI deployment scenario, AHA! generates descriptions of possible harms for different stakeholders. To do so, AHA! systematically considers the interplay between common problematic AI behaviors as well as their potential impacts on different stakeholders, and narrates these conditions through vignettes. These vignettes are then filled in with descriptions of possible harms by prompting crowd workers and large language models. By examining 4113 harms surfaced by AHA! for five different AI deployment scenarios, we found that AHA! generates meaningful examples of harms, with different problematic AI behaviors resulting in different types of harms. Prompting both crowds and a large language model with the vignettes resulted in more diverse examples of harms than those generated by either the crowd or the model alone. To gauge AHA!'s potential practical utility, we also conducted semi-structured interviews with responsible AI professionals (N=9). Participants found AHA!'s systematic approach to surfacing harms important for ethical reflection and discovered meaningful stakeholders and harms they believed they would not have thought of otherwise. Participants, however, differed in their opinions about whether AHA! should be used upfront or as a secondary-check and noted that AHA! may shift harm anticipation from an ideation problem to a potentially demanding review problem. Drawing on our results, we discuss design implications of building tools to help practitioners envision possible harms.
Can Workers Meaningfully Consent to Workplace Wellbeing Technologies?
Shreya Chowdhary
Anna Kawakami
Jina Suh
Mary L Gray
Koustuv Saha