Fernando Diaz

The global use of artificial intelligence has increased interest in assessing the ability to generate culturally localized content, includin… (voir plus)g stories. Cultural localization in stories often occurs through either templated localization -- the use of cultural markers (e.g., names, locations) in a generic narrative -- or holistic localization -- the variation of plots, values, and themes, in addition to cultural markers. We propose a method to measure the degree to which content was generated through templated localization. Specifically, we identify the lexical tokens that distinguish stories across nationalities and measure the similarity of the narratives that remain after removing them. In stories generated by five models on 125 topics for 193 nationalities, our method is able to detect that only a small subset (9-17%) of the vocabulary accounts for the variation across nationalities and that the narratives that remain after removing them contain repeated multi-word sequences, suggesting the presence of a shared culturally-agnostic narrative template. Finally, we characterize the cultural markers for their stereotypicality and offensiveness, finding that markers from 19 countries, mostly located in the Global South, are on average offensive.

2026-06-11

arXiv (prépublication)

Recall, Robustness, and Lexicographic Evaluation

Michael D. Ekstrand

Bhaskar Mitra

2025-12-31

Trans. Recomm. Syst. (publié)

Rigor in AI: Doing Rigorous AI Work Requires a Broader, Responsible AI-Informed Conception of Rigor

A.R. Olteanu

Su Lin Blodgett

Agathe Balayn

Angelina Wang

Flavio Calmon

Margaret Mitchell

Michael Ekstrand

Reuben Binns

Solon Barocas

In AI research and practice, rigor remains largely understood in terms of methodological rigor -- such as whether mathematical, statistical,… (voir plus) or computational methods are correctly applied. We argue that this narrow conception of rigor has contributed to the concerns raised by the responsible AI community, including overblown claims about AI capabilities. Our position is that a broader conception of what rigorous AI research and practice should entail is needed. We believe such a conception -- in addition to a more expansive understanding of (1) methodological rigor -- should include aspects related to (2) what background knowledge informs what to work on (epistemic rigor); (3) how disciplinary, community, or personal norms, standards, or beliefs influence the work (normative rigor); (4) how clearly articulated the theoretical constructs under use are (conceptual rigor); (5) what is reported and how (reporting rigor); and (6) how well-supported the inferences from existing evidence are (interpretative rigor). In doing so, we also aim to provide useful language and a framework for much-needed dialogue about the AI community's work by researchers, policymakers, journalists, and other stakeholders.

2025-09-24

NeurIPS.cc/2025/Position_Paper_Track (accepté)

openreview.net

A Survey of Diversification Techniques in Search and Recommendation

Haolun Wu

Yansen Zhang

Chen Ma

Fuyuan Lyu

Bowei He

Bhaskar Mitra

Xue Liu

Diversifying search results is an important research topic in retrieval systems in order to satisfy both the various interests of customers … (voir plus)and the equal market exposure of providers. There has been a growing attention on diversity-aware research during recent years, accompanied by a proliferation of literature on methods to promote diversity in search and recommendation. However, the diversity-aware studies in retrieval systems lack a systematic organization and are rather fragmented. In this survey, we are the first to propose a unified taxonomy for classifying the metrics and approaches of diversification in both search and recommendation, which are two of the most extensively researched fields of retrieval systems. We begin the survey with a brief discussion of why diversity is important in retrieval systems, followed by a summary of the various diversity concerns in search and recommendation, highlighting their relationship and differences. For the survey’s main body, we present a unified taxonomy of diversification metrics and approaches in retrieval systems, from both the search and recommendation perspectives. In the later part of the survey, we discuss the openness research questions of diversity-aware research in search and recommendation in an effort to inspire future innovations and encourage the implementation of diversity in real-world systems.

2024-09-30

IEEE Transactions on Knowledge and Data Engineering (publié)

Density-based User Representation using Gaussian Process Regression for Multi-interest Personalized Retrieval

Haolun Wu

Ofer Meshi

Masrour Zoghi

Xue Liu

Craig Boutilier

Maryam Karimzadehgan

Accurate modeling of the diverse and dynamic interests of users remains a significant challenge in the design of personalized recommender sy… (voir plus)stems. Existing user modeling methods, like single-point and multi-point representations, have limitations w.r.t.\ accuracy, diversity, and adaptability. To overcome these deficiencies, we introduce density-based user representations (DURs), a novel method that leverages Gaussian process regression (GPR) for effective multi-interest recommendation and retrieval. Our approach, GPR4DUR, exploits DURs to capture user interest variability without manual tuning, incorporates uncertainty-awareness, and scales well to large numbers of users. Experiments using real-world offline datasets confirm the adaptability and efficiency of GPR4DUR, while online experiments with simulated users demonstrate its ability to address the exploration-exploitation trade-off by effectively utilizing model uncertainty.

2024-09-24

NeurIPS.cc/2024/Conference (poster)

openreview.net

Group Membership Bias

Ali Vardasbi

Maarten de Rijke

Mostafa Dehghani

2024-07-10

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (publié)

Global AI Cultures

Rida Qadri

Arjun Subramonian

Sunipa Dev

Georgina Emma Born

Mary L. Gray

Jessica Quaye

Rachel Bergmann

2024-03-07

ICLR.cc/2024/Workshop_Proposals (publié)

openreview.net

Fairness Through Domain Awareness: Mitigating Popularity Bias For Music Discovery

Rebecca Salganik

Golnoosh Farnadi

As online music platforms grow, music recommender systems play a vital role in helping users navigate and discover content within their vast… (voir plus) musical databases. At odds with this larger goal, is the presence of popularity bias, which causes algorithmic systems to favor mainstream content over, potentially more relevant, but niche items. In this work we explore the intrinsic relationship between music discovery and popularity bias. To mitigate this issue we propose a domain-aware, individual fairness-based approach which addresses popularity bias in graph neural network (GNNs) based recommender systems. Our approach uses individual fairness to reflect a ground truth listening experience, i.e., if two songs sound similar, this similarity should be reflected in their representations. In doing so, we facilitate meaningful music discovery that is robust to popularity bias and grounded in the music domain. We apply our BOOST methodology to two discovery based tasks, performing recommendations at both the playlist level and user level. Then, we ground our evaluation in the cold start setting, showing that our approach outperforms existing fairness benchmarks in both performance and recommendation of lesser-known content. Finally, our analysis explains why our proposed methodology is a novel and promising approach to mitigating popularity bias and improving the discovery of new and niche content in music recommender systems.

2023-12-31

ECIR (4) (publié)

Scaling Laws Do Not Scale

Michael Madaio

Recent work has proposed a power law relationship, referred to as ``scaling laws,'' between the performance of artificial intelligence (AI) … (voir plus)models and aspects of those models' design (e.g., dataset size). In other words, as the size of a dataset (or model parameters, etc) increases, the performance of a given model trained on that dataset will correspondingly increase. However, while compelling in the aggregate, this scaling law relationship overlooks the ways that metrics used to measure performance may be precarious and contested, or may not correspond with how different groups of people may perceive the quality of models' output. In this paper, we argue that as the size of datasets used to train large AI models grows, the number of distinct communities (including demographic groups) whose data is included in a given dataset is likely to grow, each of whom may have different values. As a result, there is an increased risk that communities represented in a dataset may have values or preferences not captured by (or in the worst case, at odds with) the metrics used to evaluate model performance for scaling laws. We end the paper with implications for AI scaling laws -- that models may not, in fact, continue to improve as the datasets get larger -- at least not for all people or communities impacted by those models.

2023-12-31

AIES (1) (publié)

Commonality in Recommender Systems: Evaluating Recommender Systems to Enhance Cultural Citizenship

Andres Ferraro

Gustavo Ferreira

Georgina Born

2023-02-21

ArXiv (prépublication)

Best-Case Retrieval Evaluation: Improving the Sensitivity of Reciprocal Rank with Lexicographic Precision

Across a variety of ranking tasks, researchers use reciprocal rank to measure the effectiveness for users interested in exactly one relevant… (voir plus) item. Despite its widespread use, evidence suggests that reciprocal rank is brittle when discriminating between systems. This brittleness, in turn, is compounded in modern evaluation settings where current, high-precision systems may be difficult to distinguish. We address the lack of sensitivity of reciprocal rank by introducing and connecting it to the concept of best-case retrieval, an evaluation method focusing on assessing the quality of a ranking for the most satisfied possible user across possible recall requirements. This perspective allows us to generalize reciprocal rank and define a new preference-based evaluation we call lexicographic precision or lexiprecision. By mathematical construction, we ensure that lexiprecision preserves differences detected by reciprocal rank, while empirically improving sensitivity and robustness across a broad set of retrieval and recommendation tasks.

2022-12-31

EVIA (publié)

Preference-Based Offline Evaluation

C. Clarke

Negar Arabzadeh

A core step in production model research and development involves the offline evaluation of a system before production deployment. Tradition… (voir plus)al offline evaluation of search, recommender, and other systems involves gathering item relevance labels from human editors. These labels can then be used to assess system performance using offline evaluation metrics. Unfortunately, this approach does not work when evaluating highly effective ranking systems, such as those emerging from the advances in machine learning. Recent work demonstrates that moving away from pointwise item and metric evaluation can be a more effective approach to the offline evaluation of systems. This tutorial, intended for both researchers and practitioners, reviews early work in preference-based evaluation and covers recent developments in detail.

2022-12-31

WSDM (publié)