Editor’s Note: This post was originally published on the Google AI Blog November 17, 2021 and is based on this paper which was awarded Best Paper at NeurIPS 2021. This work was done in collaboration with Max Schwarzer, Aaron Courville and Marc G. Bellemare of Mila.
Reinforcement learning (RL) is an area of machine learning that focuses on learning from experiences to solve decision-making tasks. The field of RL has made significant progress resulting in impressive empirical results on complex tasks, such as playing video games, flying stratospheric balloons, and designing hardware chips. Yet it is becoming increasingly apparent that the current standards for empirical evaluation might give a false sense of fast scientific progress while slowing it down.
To that end, we present here “Deep RL at the Edge of the Statistical Precipice”, which won the Outstanding Paper Award at NeurIPS 2021. We discuss how statistical uncertainty of results should be considered, especially when using only a few training runs, to ensure reliable evaluation in deep RL. Specifically, the predominant practice of reporting point estimates ignores this uncertainty and hinders the reproducibility of results. Similarly, tables with per-task scores, as are commonly reported, can be overwhelming beyond a few tasks and often omit standard deviations. Furthermore, simple performance metrics like the mean can be dominated by a few outlier tasks, while the median score would remain unaffected even if up to half of the tasks had performance scores of zero. Thus, to increase the field’s confidence in reported results with a handful of runs, we propose various statistical tools, including stratified bootstrap confidence intervals, performance profiles, and better metrics, such as interquartile mean and probability of improvement. To help researchers incorporate these tools, we also released an easy-to-use Python library RLiable with a quickstart colab.
Statistical Uncertaintyin RL Evaluation
Empirical research in RL relies on evaluating performance on a diverse suite of tasks such as Atari 2600 video games to assess progress. Published results on deep RL benchmarks typically compare point estimates of the mean and median scores aggregated across tasks. These scores are typically relative to some defined baseline and optimal performance (e.g., random agent and “average” human performance on Atari games, respectively) to make scores comparable across different tasks.
Most RL experiments have randomness in the scores obtained from different training runs, so reporting only point estimates does not reveal whether similar results would be obtained with new independent runs. A small number of training runs, coupled with the high variability in the performance of deep RL algorithms, often leads to large statistical uncertainty in such point estimates.
As benchmarks become increasingly complex, evaluating more than a few runs will be progressively demanding due to the greater compute and data needed to solve such tasks. For example, five runs on 50 Atari games for 200 million frames takes 1000+ GPU days. Thus, evaluating more runs is not a feasible solution for reducing statistical uncertainty on computationally demanding benchmarks. While prior work has recommended statistical significance tests as a solution, such tests are dichotomous in nature (either “significant” or “not significant”), so they often lack the granularity needed to yield meaningful insights and are widely misinterpreted.
Tools for reliable evaluation
Any aggregate metric based on a finite number of runs is a random variable. To take this into account, we advocate for reporting stratified bootstrap confidence intervals (CIs), which predict the likely values of aggregate metrics if the same experiment was repeated with different runs. These CIs allow us to understand the statistical uncertainty and reproducibility of results. Such CIs use the scores on combined runs across tasks. For example, evaluating three runs each on Atari 100k, which contains 26 tasks, results in 78 sample scores for uncertainty estimation.
Most deep RL algorithms often perform better on some tasks and training runs, but aggregate performance metrics can conceal this variability, as shown below.
Instead, we recommend performance profiles, which are typically used for comparing solve times of optimization software. These profiles plot the score distribution across all runs and tasks with uncertainty estimates using stratified bootstrap confidence bands. These plots show the total runs across all tasks that obtain a score above a threshold (𝝉) as a function of the threshold.
Such profiles allow for qualitative comparisons at a glance. For example, the curve for one algorithm above another means that one algorithm is better than the other. We can also read any score percentile, e.g., the profiles intersect y = 0.5 (dotted line above) at the median score. Furthermore, the area under the profile corresponds to the mean score.
While performance profiles are useful for qualitative comparisons, algorithms rarely outperform other algorithms on all tasks and thus their profiles often intersect, so finer quantitative comparisons require aggregate performance metrics. However, existing metrics have limitations: (1) a single high-performing task may dominate the task mean score, while (2) the task median is unaffected by zero scores on nearly half of the tasks and requires a large number of training runs for small statistical uncertainty. To address the above limitations, we recommend two alternatives based on robust statistics: the interquartile mean (IQM) and the optimality gap, both of which can be read as areas under the performance profile below.
As an alternative to median and mean, IQM corresponds to the mean score of the middle 50% of the runs combined across all tasks. It is more robust to outliers than mean, a better indicator of overall performance than median, and results in smaller CIs, requiring fewer runs to claim improvements. As another alternative to the mean, the optimality gap measures how far an algorithm is from optimal performance.
For directly comparing two algorithms, another metric to consider is the average probability of improvement, which describes how likely an improvement over baseline is, regardless of its size. This metric is computed using the Mann-Whitney U-statistic, averaged across tasks.
Re-evaluating Evaluation
Using the above tools for evaluation, we revisit performance evaluations of existing algorithms on widely used RL benchmarks, revealing inconsistencies in prior evaluation. For example, in the Arcade Learning Environment (ALE), a widely recognized RL benchmark, the performance ranking of algorithms changes depending on the choice of aggregate metric. Since performance profiles capture the full picture, they often illustrate why such inconsistencies exist.
On DM Control, a popular continuous control benchmark, there are large overlaps in 95% CIs of mean normalized scores for most algorithms.
Finally, on Procgen, a benchmark for evaluating generalization in RL, the average probability of improvement shows that some claimed improvements are only 50-70% likely, suggesting that some reported improvements could be spurious.
Conclusion
Our findings on widely-used deep RL benchmarks show that statistical issues can have a large influence on previously reported results. In this work, we take a fresh look at evaluation to improve the interpretation of reported results and standardize experimental reporting. We’d like to emphasize the importance of published papers providing results for all runs to allow for future statistical analyses. To build confidence in your results, please check out our open-source library RLiable and the quickstart colab.
Acknowledgments
This work was done in collaboration with Max Schwarzer, Aaron Courville and Marc G. Bellemare. We’d like to thank Tom Small for an animated figure used in this post. We are also grateful for feedback by several members of the Google Research, Brain Team and DeepMind.