Fernando Diaz

F. Calmon

Margaret Mitchell

Michael Ekstrand

Reuben Daniel Binns

Solon Barocas

In AI research and practice, rigor remains largely understood in terms of methodological rigor -- such as whether mathematical, statistical,… (see more) or computational methods are correctly applied. We argue that this narrow conception of rigor has contributed to the concerns raised by the responsible AI community, including overblown claims about AI capabilities. Our position is that a broader conception of what rigorous AI research and practice should entail is needed. We believe such a conception -- in addition to a more expansive understanding of (1) methodological rigor -- should include aspects related to (2) what background knowledge informs what to work on (epistemic rigor); (3) how disciplinary, community, or personal norms, standards, or beliefs influence the work (normative rigor); (4) how clearly articulated the theoretical constructs under use are (conceptual rigor); (5) what is reported and how (reporting rigor); and (6) how well-supported the inferences from existing evidence are (interpretative rigor). In doing so, we also aim to provide useful language and a framework for much-needed dialogue about the AI community's work by researchers, policymakers, journalists, and other stakeholders.

2025-06-17

ArXiv (preprint)

A Survey of Diversification Techniques in Search and Recommendation

Haolun Wu

Yansen Zhang

Chen Ma

Fuyuan Lyu

Bowei He

Bhaskar Mitra

Xue (Steve) Liu

Diversifying search results is an important research topic in retrieval systems in order to satisfy both the various interests of customers … (see more)and the equal market exposure of providers. There has been a growing attention on diversity-aware research during recent years, accompanied by a proliferation of literature on methods to promote diversity in search and recommendation. However, the diversity-aware studies in retrieval systems lack a systematic organization and are rather fragmented. In this survey, we are the first to propose a unified taxonomy for classifying the metrics and approaches of diversification in both search and recommendation, which are two of the most extensively researched fields of retrieval systems. We begin the survey with a brief discussion of why diversity is important in retrieval systems, followed by a summary of the various diversity concerns in search and recommendation, highlighting their relationship and differences. For the survey’s main body, we present a unified taxonomy of diversification metrics and approaches in retrieval systems, from both the search and recommendation perspectives. In the later part of the survey, we discuss the openness research questions of diversity-aware research in search and recommendation in an effort to inspire future innovations and encourage the implementation of diversity in real-world systems.

2024-10-01

IEEE Transactions on Knowledge and Data Engineering (published)

Density-based User Representation using Gaussian Process Regression for Multi-interest Personalized Retrieval

Haolun Wu

Ofer Meshi

Masrour Zoghi

Xue (Steve) Liu

Craig Boutilier

MARYAM KARIMZADEHGAN

2024-09-25

NeurIPS.cc/2024/Conference (poster)

openreview.net

Group Membership Bias

Ali Vardasbi

Maarten de Rijke

Mostafa Dehghani

2024-07-11

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (published)

Global AI Cultures

Rida Qadri

Arjun Subramonian

Sunipa Dev

Georgina Emma Born

Mary L. Gray

Jessica Quaye

Rachel Bergmann

2024-03-08

ICLR.cc/2024/Workshop_Proposals (published)

openreview.net

Fairness Through Domain Awareness: Mitigating Popularity Bias For Music Discovery

Rebecca Salganik

Golnoosh Farnadi

As online music platforms grow, music recommender systems play a vital role in helping users navigate and discover content within their vast… (see more) musical databases. At odds with this larger goal, is the presence of popularity bias, which causes algorithmic systems to favor mainstream content over, potentially more relevant, but niche items. In this work we explore the intrinsic relationship between music discovery and popularity bias. To mitigate this issue we propose a domain-aware, individual fairness-based approach which addresses popularity bias in graph neural network (GNNs) based recommender systems. Our approach uses individual fairness to reflect a ground truth listening experience, i.e., if two songs sound similar, this similarity should be reflected in their representations. In doing so, we facilitate meaningful music discovery that is robust to popularity bias and grounded in the music domain. We apply our BOOST methodology to two discovery based tasks, performing recommendations at both the playlist level and user level. Then, we ground our evaluation in the cold start setting, showing that our approach outperforms existing fairness benchmarks in both performance and recommendation of lesser-known content. Finally, our analysis explains why our proposed methodology is a novel and promising approach to mitigating popularity bias and improving the discovery of new and niche content in music recommender systems.

2024-01-01

ECIR (4) (published)

Scaling Laws Do Not Scale

Michael Madaio

Recent work has proposed a power law relationship, referred to as ``scaling laws,'' between the performance of artificial intelligence (AI) … (see more)models and aspects of those models' design (e.g., dataset size). In other words, as the size of a dataset (or model parameters, etc) increases, the performance of a given model trained on that dataset will correspondingly increase. However, while compelling in the aggregate, this scaling law relationship overlooks the ways that metrics used to measure performance may be precarious and contested, or may not correspond with how different groups of people may perceive the quality of models' output. In this paper, we argue that as the size of datasets used to train large AI models grows, the number of distinct communities (including demographic groups) whose data is included in a given dataset is likely to grow, each of whom may have different values. As a result, there is an increased risk that communities represented in a dataset may have values or preferences not captured by (or in the worst case, at odds with) the metrics used to evaluate model performance for scaling laws. We end the paper with implications for AI scaling laws -- that models may not, in fact, continue to improve as the datasets get larger -- at least not for all people or communities impacted by those models.

2024-01-01

AIES (1) (published)

Best-Case Retrieval Evaluation: Improving the Sensitivity of Reciprocal Rank with Lexicographic Precision

Across a variety of ranking tasks, researchers use reciprocal rank to measure the effectiveness for users interested in exactly one relevant… (see more) item. Despite its widespread use, evidence suggests that reciprocal rank is brittle when discriminating between systems. This brittleness, in turn, is compounded in modern evaluation settings where current, high-precision systems may be difficult to distinguish. We address the lack of sensitivity of reciprocal rank by introducing and connecting it to the concept of best-case retrieval, an evaluation method focusing on assessing the quality of a ranking for the most satisfied possible user across possible recall requirements. This perspective allows us to generalize reciprocal rank and define a new preference-based evaluation we call lexicographic precision or lexiprecision. By mathematical construction, we ensure that lexiprecision preserves differences detected by reciprocal rank, while empirically improving sensitivity and robustness across a broad set of retrieval and recommendation tasks.

2023-06-13

ArXiv (preprint)

Commonality in Recommender Systems: Evaluating Recommender Systems to Enhance Cultural Citizenship

Andres Ferraro

Gustavo Ferreira

Georgina Born

2023-02-22

ArXiv (preprint)

Recall, Robustness, and Lexicographic Evaluation

Bhaskar Mitra

2023-02-22

ArXiv (preprint)

Preference-Based Offline Evaluation

C. Clarke

Negar Arabzadeh

A core step in production model research and development involves the offline evaluation of a system before production deployment. Tradition… (see more)al offline evaluation of search, recommender, and other systems involves gathering item relevance labels from human editors. These labels can then be used to assess system performance using offline evaluation metrics. Unfortunately, this approach does not work when evaluating highly effective ranking systems, such as those emerging from the advances in machine learning. Recent work demonstrates that moving away from pointwise item and metric evaluation can be a more effective approach to the offline evaluation of systems. This tutorial, intended for both researchers and practitioners, reviews early work in preference-based evaluation and covers recent developments in detail.

2023-01-01

WSDM (published)

Recall as a Measure of Ranking Robustness

Bhaskar Mitra

2023-01-01

arXiv.org (preprint)