This program is designed to provide decision-makers, policymakers and professional working in policy with a foundational understanding of AI technology.
We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) through predictive and adapt… (see more)ive directions of (functional) policy ascent.
Leveraging the connection between policy iteration and policy gradient methods, we view policy optimization algorithms as iteratively solving a sequence of surrogate objectives, local lower bounds on the original objective. We define optimism as predictive modelling of the future behavior of a policy, and hindsight adaptation as taking immediate and anticipatory corrective actions to mitigate accumulating errors from overshooting predictions or delayed responses to change.
We use this shared lens to jointly express other well-known algorithms, including model-based policy improvement based on forward search, and optimistic meta-learning algorithms.
We show connections with Anderson acceleration, Nesterov's accelerated gradient, extra-gradient methods, and linear extrapolation in the update rule.
We analyze properties of the formulation, design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
Estimating value functions is a core component of reinforcement learning algorithms. Temporal difference (TD) learning algorithms use bootst… (see more)rapping, i.e. they update the value function toward a learning target using value estimates at subsequent time-steps. Alternatively, the value function can be updated toward a learning target constructed by separately predicting successor features (SF)—a policy-dependent model—and linearly combining them with instantaneous rewards.
We focus on bootstrapping targets used when estimating value functions, and propose a new backup target, the ?-return mixture, which implicitly combines value-predictive knowledge (used by TD methods) with (successor) feature-predictive knowledge—with a parameter ? capturing how much to rely on each. We illustrate that incorporating predictive knowledge through an ??-discounted SF model makes more efficient use of sampled experience, compared to either extreme, i.e. bootstrapping entirely on the value function estimate, or bootstrapping on the product of separately estimated successor features and instantaneous reward models. We empirically show this approach leads to faster policy evaluation and better control performance, for tabular and nonlinear function approximations, indicating scalability and generality.