Maximilian Puelma Touzel

Performance-gated deliberation: A context-adapted strategy in which urgency is opportunity cost

Paul Cisek

Finding the right amount of deliberation, between insufficient and excessive, is a hard decision making problem that depends on the value we… (see more) place on our time. Average-reward, putatively encoded by tonic dopamine, serves in existing reinforcement learning theory as the opportunity cost of time, including deliberation time. Importantly, this cost can itself vary with the environmental context and is not trivial to estimate. Here, we propose how the opportunity cost of deliberation can be estimated adaptively on multiple timescales to account for non-stationary contextual factors. We use it in a simple decision-making heuristic based on average-reward reinforcement learning (AR-RL) that we call Performance-Gated Deliberation (PGD). We propose PGD as a strategy used by animals wherein deliberation cost is implemented directly as urgency, a previously characterized neural signal effectively controlling the speed of the decision-making process. We show PGD outperforms AR-RL solutions in explaining behaviour and urgency of non-human primates in a context-varying random walk prediction task and is consistent with relative performance and urgency in a context-varying random dot motion task. We make readily testable predictions for both neural activity and behaviour.

2022-04-30

PLoS Comput. Biol. (published)

doi.org

Continual Learning In Environments With Polynomial Mixing Times

Matthew Riemer

Sharath Chandra Raparthy

Ignacio Cases

Gopeshh Subbaraj

Maximilian Puelma Touzel

Irina Rish

The mixing time of the Markov chain induced by a policy limits performance in real-world continual learning scenarios. Yet, the effect of mi… (see more)xing times on learning in continual reinforcement learning (RL) remains underexplored. In this paper, we characterize problems that are of long-term interest to the development of continual RL, which we call scalable MDPs, through the lens of mixing times. In particular, we theoretically establish that scalable MDPs have mixing times that scale polynomially with the size of the problem. We go on to demonstrate that polynomial mixing times present significant difficulties for existing approaches, which suffer from myopic bias and stale bootstrapped estimates. To validate our theory, we study the empirical scaling behavior of mixing times with respect to the number of tasks and task duration for high performing policies deployed across multiple Atari games. Our analysis demonstrates both that polynomial mixing times do emerge in practice and how their existence may lead to unstable learning behavior like catastrophic forgetting in continual learning settings.

2021-12-31

Advances in Neural Information Processing Systems 35 (NeurIPS 2022) (published)

doi.org

openreview.net

Summarizing Societies: Agent Abstraction in Multi-Agent Reinforcement Learning

Maximilian Puelma Touzel

Matthew D Riemer

Rupali Bhati

Irina Rish

Agents cannot make sense of many-agent societies through direct consideration of small-scale, low-level agent identities, but instead must r… (see more)ecognize emergent collective identities. Here, we take a first step towards a framework for recognizing this structure in large groups of low-level agents so that they can be modeled as a much smaller number of high-level agents—a process that we call agent abstraction. We illustrate this process by extending bisimulation metrics for state abstraction in reinforcement learning to the setting of multi-agent reinforcement learning and analyze a straightforward, if crude, abstraction based on experienced joint actions. It addresses non-stationarity due to other learning agents by improving minimax regret by a intuitive factor. To test if this compression factor provides signal for higher-level agency, we applied it to a large dataset of human play of the popular social dilemma game Diplomacy. We find that it correlates strongly with the degree of ground-truth abstraction of low-level units into the human players.

2021-12-31

(published)

openreview.net

Non-normal Recurrent Neural Network (nnRNN): learning long time dependencies while improving expressivity with transient dynamics

Giancarlo Kerg

Kyle Goyette

Maximilian Puelma Touzel

A recent strategy to circumvent the exploding and vanishing gradient problem in RNNs, and to allow the stable propagation of signals over lo… (see more)ng time scales, is to constrain recurrent connectivity matrices to be orthogonal or unitary. This ensures eigenvalues with unit norm and thus stable dynamics and training. However this comes at the cost of reduced expressivity due to the limited variety of orthogonal transformations. We propose a novel connectivity structure based on the Schur decomposition and a splitting of the Schur form into normal and non-normal parts. This allows to parametrize matrices with unit-norm eigenspectra without orthogonality constraints on eigenbases. The resulting architecture ensures access to a larger space of spectrally constrained matrices, of which orthogonal matrices are a subset. This crucial difference retains the stability advantages and training speed of orthogonal RNNs while enhancing expressivity, especially on tasks that require computations over ongoing input sequences.

2019-09-05

NeurIPS.cc/2019/Reproducibility_Challenge (unknown)

doi.org

openreview.net

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Maximilian Puelma Touzel

Publications

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Maximilian Puelma Touzel

Publications