Joshua Romoff

CA2: Code-Aware Agent for Automated Game Testing

Vincent Martineau

Automated game testing is important for verifying game functionality, but it remains a costly and time-consuming process. Manual testing oft… (voir plus)en misses edge cases, and current automated methods struggle to provide full code coverage. Prior work has explored reinforcement learning (RL) for game testing, but without leveraging internal code signals such as the call stack. We present Code Aware Agent (CA2), which uses call stack information to learn effective testing strategies. The agent receives the current function call trace along with the game state and learns to reach specific target functions. We instrument two types of environments, 1) State-based and 2) Image-based, with support for efficient call stack extraction. Through experimental evaluation, we find that CA2 achieves consistent improvement over the non-code aware baselines, which does not leverage call stack information. Our results show that incorporating code signals like the call stack enables more effective and targeted game testing.

2026-05-12

arXiv (prépublication)

doi.org

arxiv.org

Improving Intrinsic Exploration by Creating Stationary Objectives

Roger Creus Castanyer

Joshua Romoff

Glen Berseth

2024-01-15

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

Minimax Exploiter: A Data Efficient Approach for Competitive Self-Play

Gabriel Robert

Recent advances in Competitive Self-Play (CSP) have achieved, or even surpassed, human level performance in complex game environments such a… (voir plus)s Dota 2 and StarCraft II using Distributed Multi-Agent Reinforcement Learning (MARL). One core component of these methods relies on creating a pool of learning agents -- consisting of the Main Agent, past versions of this agent, and Exploiter Agents -- where Exploiter Agents learn counter-strategies to the Main Agents. A key drawback of these approaches is the large computational cost and physical time that is required to train the system, making them impractical to deploy in highly iterative real-life settings such as video game productions. In this paper, we propose the Minimax Exploiter, a game theoretic approach to exploiting Main Agents that leverages knowledge of its opponents, leading to significant increases in data efficiency. We validate our approach in a diversity of settings, including simple turn based games, the arcade learning environment, and For Honor, a modern video game. The Minimax Exploiter consistently outperforms strong baselines, demonstrating improved stability and data efficiency, leading to a robust CSP-MARL method that is both flexible and easy to deploy.

2023-12-31

AAMAS (publié)

doi.org

arxiv.org

Direct Behavior Specification via Constrained Reinforcement Learning

Christopher Pal

The standard formulation of Reinforcement Learning lacks a practical way of specifying what are admissible and forbidden behaviors. Most oft… (voir plus)en, practitioners go about the task of behavior specification by manually engineering the reward function, a counter-intuitive process that requires several iterations and is prone to reward hacking by the agent. In this work, we argue that constrained RL, which has almost exclusively been used for safe RL, also has the potential to significantly reduce the amount of work spent for reward specification in applied RL projects. To this end, we propose to specify behavioral preferences in the CMDP framework and to use Lagrangian methods to automatically weigh each of these behavioral constraints. Specifically, we investigate how CMDPs can be adapted to solve goal-based tasks while adhering to several constraints simultaneously. We evaluate this framework on a set of continuous control tasks relevant to the application of Reinforcement Learning for NPC design in video games.

2022-07-18

International Conference on Machine Learning (Accept for Short Presentation)

doi.org

proceedings.mlr.press

TDprop: Does Adaptive Optimization With Jacobi Preconditioning Help Temporal Difference Learning?

We investigate whether Jacobi preconditioning, accounting for the bootstrap term in temporal difference (TD) learning, can help boost perfor… (voir plus)mance of adaptive optimizers. Our method, TDprop, computes a per-parameter learning rate based on the diagonal preconditioning of the TD update rule. We show how this can be used in both n-step returns and TD(λ). Our theoretical findings demonstrate that including this additional preconditioning information is comparable to normal semi-gradient TD if the optimal learning rate is found for both via a hyperparameter search. This matches our experimental results. In Deep RL experiments using Expected SARSA, TDprop meets or exceeds the performance of Adam in all tested games under near-optimal learning rates, but a well-tuned SGD can yield similar performance in most settings. Our findings suggest that Jacobi preconditioning may improve upon Adam in Deep RL, but despite incorporating additional information from the TD bootstrap term, may not always be better than SGD. Moreover, they suggest that more theoretical investigations are needed to understand adaptive optimizers under optimal hyperparameter regimes in TD learning: simpler methods may, surprisingly, be theoretically comparable after a hyperparameter search.

2021-05-02

Adaptive Agents and Multi-Agent Systems (publié)

doi.org

Randomized Value Functions via Multiplicative Normalizing Flows

Randomized value functions offer a promising approach towards the challenge of efficient exploration in complex environments with high dimen… (voir plus)sional state and action spaces. Unlike traditional point estimate methods, randomized value functions maintain a posterior distribution over action-space values. This prevents the agent's behavior policy from prematurely exploiting early estimates and falling into local optima. In this work, we leverage recent advances in variational Bayesian neural networks and combine these with traditional Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG) to achieve randomized value functions for high-dimensional domains. In particular, we augment DQN and DDPG with multiplicative normalizing flows in order to track a rich approximate posterior distribution over the parameters of the value function. This allows the agent to perform approximate Thompson sampling in a computationally efficient manner via stochastic gradient methods. We demonstrate the benefits of our approach through an empirical comparison in high dimensional environments.

2020-08-05

Conference on Uncertainty in Artificial Intelligence (publié)

doi.org

proceedings.mlr.press

TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?

We investigate whether Jacobi preconditioning, accounting for the bootstrap term in temporal difference (TD) learning, can help boost perfor… (voir plus)mance of adaptive optimizers. Our method, TDprop, computes a per parameter learning rate based on the diagonal preconditioning of the TD update rule. We show how this can be used in both

2019-12-31

arXiv (prépublication)

doi.org

arxiv.org

Separating value functions across time-scales

Emma Brunskill

Yann Ollivier

In many finite horizon episodic reinforcement learning (RL) settings, it is desirable to optimize for the undiscounted return - in settings … (voir plus)like Atari, for instance, the goal is to collect the most points while staying alive in the long run. Yet, it may be difficult (or even intractable) mathematically to learn with this target. As such, temporal discounting is often applied to optimize over a shorter effective planning horizon. This comes at the risk of potentially biasing the optimization target away from the undiscounted goal. In settings where this bias is unacceptable - where the system must optimize for longer horizons at higher discounts - the target of the value function approximator may increase in variance leading to difficulties in learning. We present an extension of temporal difference (TD) learning, which we call TD(

2019-02-04

ArXiv (prépublication)

doi.org

proceedings.mlr.press

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Joshua Romoff

Publications

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Mots-clés populaires:

Joshua Romoff

Publications