Su Lin Blodgett

ECBD: Evidence-Centered Benchmark Design for NLP

Yu Lu Liu

Su Lin Blodgett

Jackie Chi

Jackie CK Cheung

Kit Cheung

Q. Vera Liao

A.R. Olteanu

Ziang Xiao

Benchmarking is seen as critical to assessing progress in NLP. However, creating a benchmark involves many design decisions (e.g., which dat… (voir plus)asets to include, which metrics to use) that often rely on tacit, untested assumptions about what the benchmark is intended to measure or is actually measuring. There is currently no principled way of analyzing these decisions and how they impact the validity of the benchmark's measurements. To address this gap, we draw on evidence-centered design in educational assessments and propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules. ECBD specifies the role each module plays in helping practitioners collect evidence about capabilities of interest. Specifically, each module requires benchmark designers to describe, justify, and support benchmark design choices -- e.g., clearly specifying the capabilities the benchmark aims to measure or how evidence about those capabilities is collected from model responses. To demonstrate the use of ECBD, we conduct case studies with three benchmarks: BoolQ, SuperGLUE, and HELM. Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.

2023-12-31

ACL (1) (publié)

doi.org

arxiv.org

What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models

Emily M. Bender

Jeongrok Yu

Timnit Gebru

Seong Ug Kim

Angelina McMillan-642

Jacob Choi

Jinho D. Choi

Su Lin Blodgett

Solon Barocas

Hal Daumé III

Gilsinia Lopez

A.R. Olteanu

Robert Sim

Hanna Wallach. 2021

Stereotyp-657

Bias is a disproportionate prejudice in favor of one side against another. Due to the success of transformer-based Masked Language Models (M… (voir plus)LMs) and their impact on many NLP tasks, a systematic evaluation of bias in these models is needed more than ever. While many studies have evaluated gender bias in English MLMs, only a few works have been conducted for the task in other languages. This paper proposes a multilingual approach to estimate gender bias in MLMs from 5 languages: Chinese, English, German, Portuguese, and Spanish. Unlike previous work, our approach does not depend on parallel corpora coupled with English to detect gender bias in other languages using multilingual lexicons. Moreover, a novel model-based method is presented to generate sentence pairs for a more robust analysis of gender bias, compared to the traditional lexicon-based method. For each language, both the lexicon-based and model-based methods are applied to create two datasets respectively, which are used to evaluate gender bias in an MLM specifically trained for that language using one existing and 3 new scoring metrics. Our results show that the previous approach is data-sensitive and not stable as it does not remove contextual dependencies irrelevant to gender. In fact, the results often flip when different scoring metrics are used on the same dataset, suggesting that gender bias should be studied on a large dataset using multiple evaluation metrics for best practice.

2023-12-31

Inf. (publié)

doi.org

arxiv.org

FairPrism: Evaluating Fairness-Related Harms in Text Generation

Eve Fleisig

Aubrie Amstutz

Chad Atalla

Su Lin Blodgett

Hal Daumé III

A.R. Olteanu

Emily Sheng

Dan Vann

Hanna Wallach

It is critical to measure and mitigate fairness-related harms caused by AI text generation systems, including stereotyping and demeaning har… (voir plus)ms. To that end, we introduce FairPrism, a dataset of 5,000 examples of AI-generated English text with detailed human annotations covering a diverse set of harms relating to gender and sexuality. FairPrism aims to address several limitations of existing datasets for measuring and mitigating fairness-related harms, including improved transparency, clearer specification of dataset coverage, and accounting for annotator disagreement and harms that are context-dependent. FairPrism’s annotations include the extent of stereotyping and demeaning harms, the demographic groups targeted, and appropriateness for different applications. The annotations also include specific harms that occur in interactive contexts and harms that raise normative concerns when the “speaker” is an AI system. Due to its precision and granularity, FairPrism can be used to diagnose (1) the types of fairness-related harms that AI text generation systems cause, and (2) the potential limitations of mitigation methods, both of which we illustrate through case studies. Finally, the process we followed to develop FairPrism offers a recipe for building improved datasets for measuring and mitigating harms caused by AI systems.

2023-06-30

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (publié)

doi.org

Responsible AI Considerations in Text Summarization Research: A Review of Current Practices

Yu Lu Liu

Meng Cao

Su Lin Blodgett

Jackie CK Cheung

A.R. Olteanu

Adam Trischler

2022-12-31

EMNLP (Findings) (publié)