Portrait of Homayoun Honari

Homayoun Honari

Collaborating researcher - Université de Montréal
Supervisor
Research Topics
AGI (Artificial General Intelligence)
Brain-inspired AI
Causality
Causality-Inspired Methods
Cognition
Consciousness
Generalization
Machine Learning Theory
Reasoning
Reinforcement Learning
Representation Learning
Robotics

Publications

Align and Filter: Improving Performance in Asynchronous On-Policy RL
Distributed training and increasing the gradient update frequency are practical strategies to accelerate learning and improve performance, b… (see more)ut both exacerbate a central challenge: \textit{policy lag}, which is the mismatch between the behavior policy generating data and the learning policy being updated. Policy lag can hinder the scaling of on-policy learning algorithms to larger problems. In this paper, we identify the sources of policy lag caused by distributed learning and high update frequency. We use the findings to propose \textit{total Variation-based Advantage aligned Constrained policy Optimization (\methodacronym)} as a practical approach to mitigate policy lag. We empirically validate our method and show that it offers better robustness to policy lag in classic RL tasks and a modern RL for LLM math reasoning task.
Training PPO-Clip with Parallelized Data Generation: A Case of Fixed-Point Convergence
In recent years, with the increase in the compute power of GPUs, parallelized data collection has become the dominant approach for training … (see more)reinforcement learning (RL) agents. Proximal Policy Optimization (PPO) is one of the widely-used on-policy methods for training RL agents. In this paper, we focus on the training behavior of PPO-Clip with the increase in the number of parallel environments. In particular, we show that as we increase the amount of data used to train PPO-Clip, the optimized policy would converge to a fixed distribution. We use the results to study the behavior of PPO-Clip in two case studies: the effect of change in the minibatch size and the effect of increase in the number of parallel environments versus the increase in the rollout lengths. The experiments show that settings with high-return PPO runs result in slower convergence to the fixed-distribution and higher consecutive KL divergence changes. Our results aim to offer a better understanding for the prediction of the performance of PPO with the scaling of the parallel environments.