Portrait de Prakash Panangaden

Prakash Panangaden

Membre académique principal
Sujets de recherche
Apprentissage par renforcement
Modèles probabilistes
Raisonnement
Théorie de l'apprentissage automatique
Théorie de l'information quantique

Biographie

Prakash Panangaden a étudié la physique à l'Indian Institute of Technology de Kanpur, en Inde. Il a obtenu une maîtrise en physique de l'Université de Chicago, où il a étudié l'émission stimulée des trous noirs. Il a ensuite obtenu un doctorat en physique de l'Université du Wisconsin-Milwaukee, dans lequel il s’est penché sur la théorie quantique des champs dans un espace-temps courbe.

Il a été professeur adjoint d'informatique à l'Université Cornell, où il a principalement travaillé sur la sémantique des langages de programmation concurrents. Depuis 1990, il travaillait à l'Université McGill. Au cours des 25 dernières années, il s'est intéressé à de nombreux aspects des processus de Markov : équivalence des processus, caractérisation logique, approximation et métrique.

Récemment, il a travaillé sur l'utilisation des métriques pour améliorer l'apprentissage des représentations. Il a également publié des articles sur la physique, l'information quantique et les mathématiques pures. Il est membre de la Société royale du Canada et de l'Association for Computing Machinery (ACM).

Étudiants actuels

Maîtrise recherche - McGill
Co-superviseur⋅e :

Publications

Quantitative Equational Logic
Giorgio Bacci
Radu Mardare
Gordon Plotkin
We develop a quantitative analogue of equational reasoning, which we call quantitative equational logic. The quantitative equations use, ins… (voir plus)tead of classical equality, quantitative equalities, which are equalities indexed with nonnegative reals. Thus, s = ε t means that “ s and t are points in a metric space and their distance is less than ε”. Quantitative equalities will be used to encode behavioural distances, with ε being an upper bound on the measure of dissimilarity between two terms. We develop the metatheory of this subject. We define a notion of quantitative algebra, which is the quantitative analogue of universal algebra. We prove a completeness theorem for quantitative equational logic, and we show that we obtain monads on suitable categories of metric spaces. We present a set of examples where the free algebra of a quantitative equational theory corresponds to some well-known structure. These examples are: Hausdorff metrics from quantitative semilattices; p -Wasserstein metrics (hence also the Kantorovich metric), and the total variation metric.
Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems
Reinforcement learning problems typically define the goal as maximizing the expected value of a scalar reward function. But, pairwise prefer… (voir plus)ences are often easier to specify than scalar rewards, and they express certain goals that scalar rewards cannot. Methods for reinforcement learning with pairwise preferences have thus received growing interest. Unfortunately, these methods are inefficient in problems with long time horizons, and they lack guarantees on the performance of Markov policies relative to history-dependent policies, which bridge the theory and practice of reinforcement learning. We therefore propose the \textit{Markov decision contest} as a new problem model for reinforcement learning with pairwise preferences. We prove that stationary Markov policies are optimal among all history-dependent policies, that solving a Markov decision contest exactly is in P, and that a simple iterative algorithm converges to an optimal policy at a sublinear rate. Lastly, in a set of high-dimensional decision problems with long time horizons, we show that our approximate algorithm is significantly more learning-efficient than prior work.
Learning from Pairwise Preferences in Long-Term Decision Problems
Agents that can beat or tie any other under a model of pairwise preference have strong guarantees for both user satisfaction and overall soc… (voir plus)ial welfare. However, searching for these agents in long-term decision problems is not computationally tractable with current approaches, which require the size of an agent's policy to increase with the problem length. We introduce the \textit{Markov decision contest}, a model of learning from general preferences in long-term (infinite-horizon) decision problems. Within this model, we prove that agents only need a stationary Markov policy in order to be optimal (that is, to beat or tie any agent with a history-dependent policy); that the problem of finding an optimal policy is in P; and that a simple iterative algorithm (which we call Hedged Policy Iteration) converges to an optimal policy at a sublinear rate. In a suite of high-dimensional experiments, we demonstrate that Hedged Policy Iteration scales well to function approximation. Lastly, we present a near approximation of Hedged Policy Iteration, called HPI-Clip, which both matches the performance of Proximal Policy Optimization on reward-based tasks while also outperforming it on tasks with non-transitive preferences. These results show that learning from pairwise preferences in long-term decision problems can be far more tractable than what is known from prior work.
Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning
Trevor McInroe
Christopher G. Lucas
David Abel
Stefano V Albrecht
Conditions on Preference Relations that Guarantee the Existence of Optimal Policies
Learning from Preferential Feedback (LfPF) plays an essential role in training Large Language Models, as well as certain types of interactiv… (voir plus)e learning agents. However, a substantial gap exists between the theory and application of LfPF algorithms. Current results guaranteeing the existence of optimal policies in LfPF problems assume that both the preferences and transition dynamics are determined by a Markov Decision Process. We introduce the Direct Preference Process, a new framework for analyzing LfPF problems in partially-observable, non-Markovian environments. Within this framework, we establish conditions that guarantee the existence of optimal policies by considering the ordinal structure of the preferences. We show that a decision-making problem can have optimal policies -- that are characterized by recursive optimality equations -- even when no reward function can express the learning goal. These findings underline the need to explore preference-based learning strategies which do not assume that preferences are generated by reward.
Polynomial Lawvere Logic
Giorgio Bacci
Radu Mardare
Gordon D. Plotkin
Optimal Approximate Minimization of One-Letter Weighted Finite Automata
In this paper, we study the approximate minimization problem of weighted finite automata (WFAs): to compute the best possible approximation … (voir plus)of a WFA given a bound on the number of states. By reformulating the problem in terms of Hankel matrices, we leverage classical results on the approximation of Hankel operators, namely the celebrated Adamyan-Arov-Krein (AAK) theory. We solve the optimal spectral-norm approximate minimization problem for irredundant WFAs with real weights, defined over a one-letter alphabet. We present a theoretical analysis based on AAK theory, and bounds on the quality of the approximation in the spectral norm and
Policy Gradient Methods in the Presence of Symmetries and State Abstractions
Reinforcement learning (RL) on high-dimensional and complex problems relies on abstraction for improved efficiency and generalization. In th… (voir plus)is paper, we study abstraction in the continuous-control setting, and extend the definition of Markov decision process (MDP) homomorphisms to the setting of continuous state and action spaces. We derive a policy gradient theorem on the abstract MDP for both stochastic and deterministic policies. Our policy gradient results allow for leveraging approximate symmetries of the environment for policy optimization. Based on these theorems, we propose a family of actor-critic algorithms that are able to learn the policy and the MDP homomorphism map simultaneously, using the lax bisimulation metric. Finally, we introduce a series of environments with continuous symmetries to further demonstrate the ability of our algorithm for action abstraction in the presence of such symmetries. We demonstrate the effectiveness of our method on our environments, as well as on challenging visual control tasks from the DeepMind Control Suite. Our method's ability to utilize MDP homomorphisms for representation learning leads to improved performance, and the visualizations of the latent space clearly demonstrate the structure of the learned abstraction.
Sum and Tensor of Quantitative Effects
Giorgio Bacci
Radu Mardare
Gordon Plotkin
Behavioural pseudometrics for continuous-time diffusions
Propositional Logics for the Lawvere Quantale
Giorgio Bacci
Radu Mardare
Gordon Plotkin
Behavioural equivalences for continuous-time Markov processes