Jonathan Colaço Carr

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

Benjamin Van Roy

Reinforcement learning problems typically define the goal as maximizing the expected value of a scalar reward function. But, pairwise prefer… (see more)ences are often easier to specify than scalar rewards, and they express certain goals that scalar rewards cannot. Methods for reinforcement learning with pairwise preferences have thus received growing interest. Unfortunately, these methods are inefficient in problems with long time horizons, and they lack guarantees on the performance of Markov policies relative to history-dependent policies, which bridge the theory and practice of reinforcement learning. We therefore propose the \textit{Markov decision contest} as a new problem model for reinforcement learning with pairwise preferences. We prove that stationary Markov policies are optimal among all history-dependent policies, that solving a Markov decision contest exactly is in P, and that a simple iterative algorithm converges to an optimal policy at a sublinear rate. Lastly, in a set of high-dimensional decision problems with long time horizons, we show that our approximate algorithm is significantly more learning-efficient than prior work.

2026-05-28

arXiv (preprint)

doi.org

arxiv.org

Learning from Pairwise Preferences in Long-Term Decision Problems

Jonathan Colaço Carr

Prakash Panangaden

Doina Precup

Benjamin Van Roy

Agents that can beat or tie any other under a model of pairwise preference have strong guarantees for both user satisfaction and overall soc… (see more)ial welfare. However, searching for these agents in long-term decision problems is not computationally tractable with current approaches, which require the size of an agent's policy to increase with the problem length. We introduce the \textit{Markov decision contest}, a model of learning from general preferences in long-term (infinite-horizon) decision problems. Within this model, we prove that agents only need a stationary Markov policy in order to be optimal (that is, to beat or tie any agent with a history-dependent policy); that the problem of finding an optimal policy is in P; and that a simple iterative algorithm (which we call Hedged Policy Iteration) converges to an optimal policy at a sublinear rate. In a suite of high-dimensional experiments, we demonstrate that Hedged Policy Iteration scales well to function approximation. Lastly, we present a near approximation of Hedged Policy Iteration, called HPI-Clip, which both matches the performance of Proximal Policy Optimization on reward-based tasks while also outperforming it on tasks with non-transitive preferences. These results show that learning from pairwise preferences in long-term decision problems can be far more tractable than what is known from prior work.

2025-12-31

International Conference on Machine Learning (Accept (regular))

openreview.net

Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset

Khaoula Chehbouni

Jonathan Colaço Carr

Yash More

Jackie CK Cheung

Golnoosh Farnadi

In an effort to mitigate the harms of large language models (LLMs), learning from human feedback (LHF) has been used to steer LLMs towards o… (see more)utputs that are intended to be both less harmful and more helpful. Despite the widespread adoption of LHF in practice, the quality of this feedback and its effectiveness as a safety mitigation technique remain unclear. This study addresses these issues by auditing the widely-used Helpful and Harmless (HH) dataset by Anthropic. Our work includes: (1) a thorough investigation of the dataset's content through both manual and automated evaluation; (2) experiments demonstrating the dataset's impact on models' safety; and (3) an analysis of the 100 most influential papers citing this dataset. Through our audit, we showcase how conceptualization failures and quality issues identified in the HH dataset can create additional harms by leading to disparate safety behaviors across demographic groups. Our findings highlight the need for more nuanced, context-sensitive approaches to safety mitigation in LLMs.

2024-11-11

ArXiv (preprint)

doi.org

arxiv.org

Conditions on Preference Relations that Guarantee the Existence of Optimal Policies

Jonathan Colaço Carr

Prakash Panangaden

Doina Precup

Learning from Preferential Feedback (LfPF) plays an essential role in training Large Language Models, as well as certain types of interactiv… (see more)e learning agents. However, a substantial gap exists between the theory and application of LfPF algorithms. Current results guaranteeing the existence of optimal policies in LfPF problems assume that both the preferences and transition dynamics are determined by a Markov Decision Process. We introduce the Direct Preference Process, a new framework for analyzing LfPF problems in partially-observable, non-Markovian environments. Within this framework, we establish conditions that guarantee the existence of optimal policies by considering the ordinal structure of the preferences. We show that a decision-making problem can have optimal policies -- that are characterized by recursive optimality equations -- even when no reward function can express the learning goal. These findings underline the need to explore preference-based learning strategies which do not assume that preferences are generated by reward.

2024-04-17

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (published)

doi.org

proceedings.mlr.press

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Jonathan Colaço Carr

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Jonathan Colaço Carr

Publications