Prakash Panangaden

Maîtrise recherche - McGill

Co-superviseur⋅e :

Publications

Quantitative Equational Logic

Giorgio Bacci

Radu Mardare

Gordon Plotkin

We develop a quantitative analogue of equational reasoning, which we call quantitative equational logic. The quantitative equations use, ins… (voir plus)tead of classical equality, quantitative equalities, which are equalities indexed with nonnegative reals. Thus, s = ε t means that “ s and t are points in a metric space and their distance is less than ε”. Quantitative equalities will be used to encode behavioural distances, with ε being an upper bound on the measure of dissimilarity between two terms. We develop the metatheory of this subject. We define a notion of quantitative algebra, which is the quantitative analogue of universal algebra. We prove a completeness theorem for quantitative equational logic, and we show that we obtain monads on suitable categories of metric spaces. We present a set of examples where the free algebra of a quantitative equational theory corresponds to some well-known structure. These examples are: Hausdorff metrics from quantitative semilattices; p -Wasserstein metrics (hence also the Kantorovich metric), and the total variation metric.

2026-05-28

Journal of the ACM (publié)

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

Benjamin Van Roy

Reinforcement learning problems typically define the goal as maximizing the expected value of a scalar reward function. But, pairwise prefer… (voir plus)ences are often easier to specify than scalar rewards, and they express certain goals that scalar rewards cannot. Methods for reinforcement learning with pairwise preferences have thus received growing interest. Unfortunately, these methods are inefficient in problems with long time horizons, and they lack guarantees on the performance of Markov policies relative to history-dependent policies, which bridge the theory and practice of reinforcement learning. We therefore propose the \textit{Markov decision contest} as a new problem model for reinforcement learning with pairwise preferences. We prove that stationary Markov policies are optimal among all history-dependent policies, that solving a Markov decision contest exactly is in P, and that a simple iterative algorithm converges to an optimal policy at a sublinear rate. Lastly, in a set of high-dimensional decision problems with long time horizons, we show that our approximate algorithm is significantly more learning-efficient than prior work.

2026-05-28

arXiv (prépublication)

arxiv.org

Learning from Pairwise Preferences in Long-Term Decision Problems

Benjamin Van Roy

Agents that can beat or tie any other under a model of pairwise preference have strong guarantees for both user satisfaction and overall soc… (voir plus)ial welfare. However, searching for these agents in long-term decision problems is not computationally tractable with current approaches, which require the size of an agent's policy to increase with the problem length. We introduce the \textit{Markov decision contest}, a model of learning from general preferences in long-term (infinite-horizon) decision problems. Within this model, we prove that agents only need a stationary Markov policy in order to be optimal (that is, to beat or tie any agent with a history-dependent policy); that the problem of finding an optimal policy is in P; and that a simple iterative algorithm (which we call Hedged Policy Iteration) converges to an optimal policy at a sublinear rate. In a suite of high-dimensional experiments, we demonstrate that Hedged Policy Iteration scales well to function approximation. Lastly, we present a near approximation of Hedged Policy Iteration, called HPI-Clip, which both matches the performance of Proximal Policy Optimization on reward-based tasks while also outperforming it on tasks with non-transitive preferences. These results show that learning from pairwise preferences in long-term decision problems can be far more tractable than what is known from prior work.

2025-12-31

International Conference on Machine Learning (Accept (regular))

openreview.net

Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning

Samuel Garcin

Trevor McInroe

Pablo Samuel Castro

Christopher G. Lucas

David Abel

Stefano V Albrecht

2025-01-21

ICLR.cc/2025/Conference (poster)

openreview.net

Conditions on Preference Relations that Guarantee the Existence of Optimal Policies

Learning from Preferential Feedback (LfPF) plays an essential role in training Large Language Models, as well as certain types of interactiv… (voir plus)e learning agents. However, a substantial gap exists between the theory and application of LfPF algorithms. Current results guaranteeing the existence of optimal policies in LfPF problems assume that both the preferences and transition dynamics are determined by a Markov Decision Process. We introduce the Direct Preference Process, a new framework for analyzing LfPF problems in partially-observable, non-Markovian environments. Within this framework, we establish conditions that guarantee the existence of optimal policies by considering the ordinal structure of the preferences. We show that a decision-making problem can have optimal policies -- that are characterized by recursive optimality equations -- even when no reward function can express the learning goal. These findings underline the need to explore preference-based learning strategies which do not assume that preferences are generated by reward.

2024-04-17

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (publié)

proceedings.mlr.press

Polynomial Lawvere Logic

Giorgio Bacci

Radu Mardare

Gordon D. Plotkin

2024-02-04

ArXiv (prépublication)

arxiv.org

Optimal Approximate Minimization of One-Letter Weighted Finite Automata

Clara Lacroce

Borja Balle

Guillaume Rabusseau

In this paper, we study the approximate minimization problem of weighted finite automata (WFAs): to compute the best possible approximation … (voir plus)of a WFA given a bound on the number of states. By reformulating the problem in terms of Hankel matrices, we leverage classical results on the approximation of Hankel operators, namely the celebrated Adamyan-Arov-Krein (AAK) theory. We solve the optimal spectral-norm approximate minimization problem for irredundant WFAs with real weights, defined over a one-letter alphabet. We present a theoretical analysis based on AAK theory, and bounds on the quality of the approximation in the spectral norm and

2023-12-31

Mathematical Structures in Computer Science (publié)

openreview.net

Policy Gradient Methods in the Presence of Symmetries and State Abstractions

Sahand Rezaei-Shoshtari

Rosie Zhao

David Meger

Reinforcement learning (RL) on high-dimensional and complex problems relies on abstraction for improved efficiency and generalization. In th… (voir plus)is paper, we study abstraction in the continuous-control setting, and extend the definition of Markov decision process (MDP) homomorphisms to the setting of continuous state and action spaces. We derive a policy gradient theorem on the abstract MDP for both stochastic and deterministic policies. Our policy gradient results allow for leveraging approximate symmetries of the environment for policy optimization. Based on these theorems, we propose a family of actor-critic algorithms that are able to learn the policy and the MDP homomorphism map simultaneously, using the lax bisimulation metric. Finally, we introduce a series of environments with continuous symmetries to further demonstrate the ability of our algorithm for action abstraction in the presence of such symmetries. We demonstrate the effectiveness of our method on our environments, as well as on challenging visual control tasks from the DeepMind Control Suite. Our method's ability to utilize MDP homomorphisms for representation learning leads to improved performance, and the visualizations of the latent space clearly demonstrate the structure of the learned abstraction.

2023-12-31

Journal of Machine Learning Research (publié)

arxiv.org

Sum and Tensor of Quantitative Effects

Giorgio Bacci

Radu Mardare

Gordon Plotkin

2023-12-31

Log. Methods Comput. Sci. (publié)