Vineet Jain

From Static Policies to Adaptive Priors in Offline Reinforcement Learning

Offline reinforcement learning (RL) has traditionally focused on learning policies for direct deployment under conservative objectives, wher… (see more)e uncertainty outside the offline dataset is treated pessimistically to ensure robustness. We argue that this formulation becomes incomplete when an offline-trained policy is subsequently updated through online interaction, as increasingly occurs in modern intelligent systems through test-time adaptation and online fine-tuning. This position paper argues that, in such settings, the objective of offline RL should extend beyond immediate deployment and instead prioritize learning *adaptive policy priors*: policies that preserve the capacity to improve during subsequent interaction through memory, exploration, and self-correction. We formalize this perspective as *adaptive offline reinforcement learning* (AORL), distinguish it from offline-to-online RL, and explain why adaptability becomes important under distributional shift, limited dataset coverage, and changing test-time conditions. We further discuss Bayesian offline RL as one principled direction for constructing adaptive policy priors by preserving epistemic uncertainty over plausible environments. Finally, we outline connections, open challenges, and research directions for treating offline RL as preparation for future experience rather than as a static deployment problem.

2026-05-24

DEMO @ International Conference on Machine Learning (poster)

openreview.net

A Comedy of Estimators: On KL Regularization in RL Training of LLMs

Vedant Shah

Johan Obando-Ceron

Vineet Jain

Brian Bartoldson

Bhavya Kailkhura

Nikolay Malkin

The reasoning performance of large language models (LLMs) can be substantially improved by training them with reinforcement learning (RL). T… (see more)he RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy. Since computing the KL divergence exactly is intractable, various estimators are used in practice to estimate it from on-policy samples. Despite its wide adoption, including in several open-source libraries, there is no systematic study analyzing the numerous ways of incorporating KL estimators in the objective and their effect on the downstream performance of RL-trained models. Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation. In this paper, we further analyze these practices and study the gradients of several estimators configurations, revealing how design choices shape gradient bias. We substantiate these findings with empirical observations by RL fine-tuning \texttt{Qwen2.5-7B}, \texttt{Llama-3.1-8B-Instruct} and \texttt{Qwen3-4B-Instruct-2507} with different configurations and evaluating their performance on both in- and out-of-distribution tasks. Through our analysis, we observe that, in on-policy settings: (1) estimator configurations with biased gradients can result in training instabilities; and (2) using estimator configurations resulting in unbiased gradients leads to better performance on in-domain as well as out-of-domain tasks. We also investigate the performance resulting from different KL configurations in off-policy settings and observe that KL regularization can help stabilize off-policy RL training resulting from asynchronous setups.

2025-12-25

ArXiv (preprint)

doi.org

arxiv.org

Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism

2025-12-03

arXiv (preprint)

doi.org

openreview.net

Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models

Adapting a pretrained diffusion model to new objectives at inference time remains an open problem in generative modeling. Existing steering … (see more)methods suffer from inaccurate value estimation, especially at high noise levels, which biases guidance. Moreover, information from past runs is not reused to improve sample quality, resulting in inefficient use of compute. Inspired by the success of Monte Carlo Tree Search, we address these limitations by casting inference-time alignment as a search problem that reuses past computations. We introduce a tree-based approach that samples from the reward-aligned target density by propagating terminal rewards back through the diffusion chain and iteratively refining value estimates with each additional generation. Our proposed method, Diffusion Tree Sampling (DTS), produces asymptotically exact samples from the target distribution in the limit of infinite rollouts, and its greedy variant, Diffusion Tree Search (DTS

2025-12-02

Neural Information Processing Systems (Accept (poster))

doi.org

openreview.net

Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models

Johan Samir Obando Ceron

Yoshua Bengio

Brian R. Bartoldson

Bhavya Kailkhura

Guillaume Lajoie

Glen Berseth

Nikolay Malkin

Moksh J. Jain

Test-time scaling methods improve the capabilities of large language models (LLMs) by increasing the amount of compute used during inference… (see more) to make a prediction. Inference-time compute can be scaled in parallel by choosing among multiple independent solutions or sequentially through self-refinement. We propose Recursive Self-Aggregation (RSA), a test-time scaling method inspired by evolutionary methods that combines the benefits of both parallel and sequential scaling. Each step of RSA refines a population of candidate reasoning chains through aggregation of subsets to yield a population of improved solutions, which are then used as the candidate pool for the next iteration. RSA exploits the rich information embedded in the reasoning chains -- not just the final answers -- and enables bootstrapping from partially correct intermediate steps within different chains of thought. Empirically, RSA delivers substantial performance gains with increasing compute budgets across diverse tasks, model families and sizes. Notably, RSA enables Qwen3-4B-Instruct-2507 to achieve competitive performance with larger reasoning models, including DeepSeek-R1 and o3-mini (high), while outperforming purely parallel and sequential scaling strategies across AIME-25, HMMT-25, Reasoning Gym, LiveCodeBench-v6, and SuperGPQA. We further demonstrate that training the model to combine solutions via a novel aggregation-aware reinforcement learning approach yields significant performance gains. Code available at https://github.com/HyperPotatoNeo/RSA.

2025-09-29

ArXiv (preprint)

doi.org

arxiv.org

Sampling from Energy-based Policies using Diffusion

Vineet Jain

Tara Akhound-Sadegh

Siamak Ravanbakhsh

2025-05-08

rl-conference.cc/RLC/2025/Conference (accepted)

doi.org

openreview.net

Learning to Reach Goals via Diffusion

Vineet Jain

Siamak Ravanbakhsh

We present a novel perspective on goal-conditioned reinforcement learning by framing it within the context of denoising diffusion models. An… (see more)alogous to the diffusion process, where Gaussian noise is used to create random trajectories that walk away from the data manifold, we construct trajectories that move away from potential goal states. We then learn a goal-conditioned policy to reverse these deviations, analogous to the score function. This approach, which we call Merlin, can reach specified goals from arbitrary initial states without learning a separate value function. In contrast to recent works utilizing diffusion models in offline RL, Merlin stands out as the first method to perform diffusion in the state space, requiring only one ``denoising" iteration per environment step. We experimentally validate our approach in various offline goal-reaching tasks, demonstrating substantial performance enhancements compared to state-of-the-art methods while improving computational efficiency over other diffusion-based RL methods by an order of magnitude. Our results suggest that this perspective on diffusion for RL is a simple and scalable approach for sequential decision making.

2024-07-07

Proceedings of the 41st International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

On Diffusion Modeling for Anomaly Detection

Known for their impressive performance in generative modeling, diffusion models are attractive candidates for density-based anomaly detectio… (see more)n. This paper investigates different variations of diffusion modeling for unsupervised and semi-supervised anomaly detection. In particular, we find that Denoising Diffusion Probability Models (DDPM) are performant on anomaly detection benchmarks yet computationally expensive. By simplifying DDPM in application to anomaly detection, we are naturally led to an alternative approach called Diffusion Time Estimation (DTE). DTE estimates the distribution over diffusion time for a given input and uses the mode or mean of this distribution as the anomaly score. We derive an analytical form for this density and leverage a deep neural network to improve inference efficiency. Through empirical evaluations on the ADBench benchmark, we demonstrate that all diffusion-based anomaly detection methods perform competitively for both semi-supervised and unsupervised settings. Notably, DTE achieves orders of magnitude faster inference time than DDPM, while outperforming it on this benchmark. These results establish diffusion-based anomaly detection as a scalable alternative to traditional methods and recent deep-learning techniques for standard unsupervised and semi-supervised anomaly detection settings.

2024-05-06

International Conference on Learning Representations (Accept (spotlight))

doi.org

openreview.net

EqR: Equivariant Representations for Data-Efficient Reinforcement Learning

Arnab Kumar Mondal

Vineet Jain

Kaleem Siddiqi

Siamak Ravanbakhsh

2022-06-27

Proceedings of the 39th International Conference on Machine Learning (published)

proceedings.mlr.press