Amir-massoud Farahmand

Igor Gilitschenski

The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (see more)ent learning. The MuZero loss, which penalizes a model’s value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

Calibrated Value-Aware Model Learning with Probabilistic Environment Models

Claas Voelcker

Anastasiia Pedan

Arash Ahmadian

Igor Gilitschenski

The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (see more)ent learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics

Tyler Kastner

Mark Rowland

Yunhao Tang

Murat A Erdogdu

We study the problem of distributional reinforcement learning using categorical parametrisations and a KL divergence loss. Previous work ana… (see more)lyzing categorical distributional RL has done so using a Cramér distance-based loss, simplifying the analysis but creating a theory-practice gap. We introduce a preconditioned version of the algorithm, and prove that it is guaranteed to converge. We further derive the asymptotic variance of the categorical estimates under different learning rate regimes, and compare to that of classical reinforcement learning. We finally empirically validate our theoretical results and perform an empirical investigation into the relative strengths of using KL losses, and derive a number of actionable insights for practitioners.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics

Tyler Kastner

Mark Rowland

Yunhao Tang

Murat A Erdogdu

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling

Avery Ma

Yangchen Pan

Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this… (see more), the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt’s topic. We also introduce ManyHarm, a dataset of harmful question–answer pairs, and demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling

Avery Ma

Yangchen Pan

Many-shot jailbreaking circumvents the safety alignment of large language models by exploiting their ability to process long input sequences… (see more). To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational turns between the user and the model. These fabricated exchanges are randomly sampled from a pool of malicious questions and responses, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with positive affirmations, negative demonstrations, and an optimized adaptive sampling method tailored to the target prompt's topic. Extensive experiments on AdvBench and HarmBench, using state-of-the-art LLMs, demonstrate that PANDAS significantly outperforms baseline methods in long-context scenarios. Through an attention analysis, we provide insights on how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

Relative Entropy Pathwise Policy Optimization

Claas Voelcker

Axel Brunnbauer

Marcel Hussing

Michal Nauman

Pieter Abbeel

Eric R. Eaton

Radu Grosu

Igor Gilitschenski

Score-function policy gradients have delivered strong results in game-playing, robotics and language-model fine-tuning. Yet its high-varianc… (see more)e often undermines training stability. On the other hand, pathwise policy gradients alleviate the training variance, but are reliable only when driven by an accurate action-conditioned value function which is notoriously hard to train without relying on past off-policy data. In this paper, we discuss how to construct a value-gradient driven, on-policy algorithm that allow training Q-value models purely from on-policy data, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to balance stochastic policies for exploration with constrained policy updates for stable training, and evaluate important architectural components that facilitate accurate value function learning. Building on these insights, we propose Relative Entropy Pathwise Policy Optimization (REPPO), an efficient on-policy algorithm that combines the sample-efficiency of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. We demonstrate that REPPO provides strong empirical performance at decreased sample requirements, wall-clock time, memory footprint as well as high hyperparameter robustness in a set of experiments on two standard GPU-parallelized benchmarks.

2025-07-15

ArXiv (preprint)

Relative Entropy Pathwise Policy Optimization

Claas Voelcker

Axel Brunnbauer

Marcel Hussing

Michal Nauman

Pieter Abbeel

Eric R. Eaton

Radu Grosu

Igor Gilitschenski

Score-function based methods for policy learning, such as REINFORCE and PPO, have delivered strong results in game-playing and robotics, yet… (see more) their high variance often undermines training stability. Using pathwise policy gradients, i.e. computing a derivative by differentiating the objective function, alleviates the variance issues. However, they require an accurate action-conditioned value function, which is notoriously hard to learn without relying on replay buffers for reusing past off-policy data. We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to combine stochastic policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning. The result, Relative Entropy Pathwise Policy Optimization (REPPO), is an efficient on-policy algorithm that combines the stability of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. Compared to state-of-the-art on two standard GPU-parallelized benchmarks, REPPO provides strong empirical performance at superior sample efficiency, wall-clock time, memory footprint, and hyperparameter robustness.

2025-07-15

ArXiv (preprint)

Calibrated Value-Aware Model Learning with Stochastic Environment Models

Claas Voelcker

Anastasiia Pedan

Arash Ahmadian

Igor Gilitschenski

The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (see more)ent learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.

2025-05-28

ArXiv (preprint)

Calibrated Value-Aware Model Learning with Probabilistic Environment Models

Claas Voelcker

Anastasiia Pedan

Arash Ahmadian

Igor Gilitschenski

The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (see more)ent learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.

2025-05-28

ArXiv (preprint)

A Truncated Newton Method for Optimal Transport

Mete Kemertas

Allan D. Jepson

2025-04-02

ArXiv (preprint)

A Truncated Newton Method for Optimal Transport

Mete Kemertas

Allan D. Jepson

2025-04-02

ArXiv (preprint)