Moksh Jain

Navigating ternary doping in Li-ion cathodes with closed-loop multi-objective Bayesian optimization

Nooshin Zeinali Galabi

Cheng-Hao Liu

Marc Kamel

Shipeng Jia

Eric McCalla

To further improve secondary battery materials, we are increasingly exploring highly complex composition spaces in attempts to optimize mult… (see more)iple properties simultaneously. While our past work has done this in systematic manners using high-throughput experimentation, the exponential increase in the search space with triple doping makes grid search prohibitively expensive. Here, we demonstrate a closed-loop, multi-objective machine learning approach to guide the high-throughput workflow to efficiently navigate a space with approximately 14 million unique combinations. The test system is LiCoPO4 which we have previously explored using systematic codoping that was effective in optimizing one property only: energy density. To learn multiple electrochemical metrics, we first pretrain a set transformer on the public Materials Project database as a feature extractor, then attach a multi-task Gaussian process head and finetune the entire model on our high-throughput data. Through 3 rounds of active learning, we demonstrate that with a very small number of samples (as few as 125 random compositions and 63 predicted) we are able to simultaneously optimize four key electrochemical properties. Relative to the undoped system, the best composition raises our composite figure of merit by up to five times. This establishes an end-to-end workflow for accelerated battery materials design to be used in the rapidly growing field of autonomous materials discovery.

2026-02-11

Advances in Materials (published)

doi.org

A Comedy of Estimators: On KL Regularization in RL Training of LLMs

Vedant Shah

Johan Obando-Ceron

Vineet Jain

Brian Bartoldson

Bhavya Kailkhura

Nikolay Malkin

The reasoning performance of large language models (LLMs) can be substantially improved by training them with reinforcement learning (RL). T… (see more)he RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy. Since computing the KL divergence exactly is intractable, various estimators are used in practice to estimate it from on-policy samples. Despite its wide adoption, including in several open-source libraries, there is no systematic study analyzing the numerous ways of incorporating KL estimators in the objective and their effect on the downstream performance of RL-trained models. Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation. In this paper, we further analyze these practices and study the gradients of several estimators configurations, revealing how design choices shape gradient bias. We substantiate these findings with empirical observations by RL fine-tuning \texttt{Qwen2.5-7B}, \texttt{Llama-3.1-8B-Instruct} and \texttt{Qwen3-4B-Instruct-2507} with different configurations and evaluating their performance on both in- and out-of-distribution tasks. Through our analysis, we observe that, in on-policy settings: (1) estimator configurations with biased gradients can result in training instabilities; and (2) using estimator configurations resulting in unbiased gradients leads to better performance on in-domain as well as out-of-domain tasks. We also investigate the performance resulting from different KL configurations in off-policy settings and observe that KL regularization can help stabilize off-policy RL training resulting from asynchronous setups.

2025-12-25

ArXiv (preprint)

doi.org

arxiv.org

Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

Brian Bartoldson

James Diffenderfer

Tal Ben-Nun

Johan Obando-Ceron

Bhavya Kailkhura

Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, on-policy algorithms used for post… (see more)-training are not naturally robust to a diversified content of experience replay buffers, which asynchronous off-policy actors can efficiently populate in parallel to training. We propose efficiently learning on such off-policy data via Trajectory Balance with Asynchrony (TBA), an approach to asynchronous RL for LLMs that leverages the principled off-policy TB objective. On math, preference-tuning, and automated red-teaming tasks, we post-train models ranging from Pythia 410M to Qwen 2.5 7B, finding TBA offers speed and performance boosts over strong baselines like Online DPO and Dr. GRPO. Beyond TBA's performance benefits (high accuracy even as asynchrony grows) and speedups (

2025-09-17

NeurIPS.cc/2025/Conference (poster)

doi.org

openreview.net

Solving Bayesian Inverse Problems with Diffusion Priors and Off-Policy RL

Laurence Perreault-Levasseur

Yoshua Bengio

Glen Berseth

Nikolay Malkin

This paper presents a practical application of Relative Trajectory Balance (RTB), a recently introduced off-policy reinforcement learning (R… (see more)L) objective that can asymptotically solve Bayesian inverse problems optimally. We extend the original work by using RTB to train conditional diffusion model posteriors from pretrained unconditional priors for challenging linear and non-linear inverse problems in vision, and science. We use the objective alongside techniques such as off-policy backtracking exploration to improve training. Importantly, our results show that existing training-free diffusion posterior methods struggle to perform effective posterior inference in latent space due to inherent biases.

2025-03-05

ICLR.cc/2025/Workshop/DeLTa (poster)

doi.org

openreview.net

Action Abstractions for Amortized Sampling

Lena Nehale Ezzine

Nikolay Malkin

As trajectories sampled by policies used by reinforcement learning (RL) and generative flow networks (GFlowNets) grow longer, credit assignm… (see more)ent and exploration become more challenging, and the long planning horizon hinders mode discovery and generalization. The challenge is particularly pronounced in entropy-seeking RL methods, such as generative flow networks, where the agent must learn to sample from a structured distribution and discover multiple high-reward states, each of which take many steps to reach. To tackle this challenge, we propose an approach to incorporate the discovery of action abstractions, or high-level actions, into the policy optimization process. Our approach involves iteratively extracting action subsequences commonly used across many high-reward trajectories and `chunking' them into a single action that is added to the action space. In empirical evaluation on synthetic and real-world environments, our approach demonstrates improved sample efficiency performance in discovering diverse high-reward objects, especially on harder exploration problems. We also observe that the abstracted high-order actions are interpretable, capturing the latent structure of the reward landscape of the action space. This work provides a cognitively motivated approach to action abstraction in RL and is the first demonstration of hierarchical planning in amortized sequential sampling.

2025-01-21

International Conference on Learning Representations (poster)

doi.org

openreview.net

Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning

Juho Lee

Sung Ju Hwang

Nikolay Malkin

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of lar… (see more)ge language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

2025-01-21

International Conference on Learning Representations (poster)

doi.org

openreview.net

Path-filtering in path-integral simulations of open quantum systems using GFlowNets

Jeremy Lackman-Mincoff

Moksh Jain

Nikolay Malkin

Yoshua Bengio

Lena Simine

An important class of methods for modeling dynamics in open quantum systems is based on the well-known influence functional (IF) approach to… (see more) solving path-integral equations of motion. Within this paradigm, path-filtering schemes based on the removal of IF elements that fall below a certain threshold aim to reduce the effort needed to calculate and store the influence functional, making very challenging simulations possible. A filtering protocol of this type is considered acceptable as long as the simulation remains mathematically stable. This, however, does not guarantee that the approximated dynamics preserve the physics of the simulated process. In this paper, we explore the possibility of training Generative Flow Networks (GFlowNets) to produce filtering protocols while optimizing for mathematical stability and for physical accuracy. Trained using the trajectory balance objective, the model produces sets of paths to be added to a truncated initial set; it is rewarded if the combined set of paths gives rise to solutions in which the trace of the density matrix is conserved, the populations remain real, and the dynamics approach the exact reference. Using a simple two-level system coupled to a dissipative reservoir, we perform proof-of-concept simulations and demonstrate the elegant and surprising filtering solutions proposed by the GFlowNet.

2024-10-07

Journal of Chemical Physics (published)

doi.org

Amortizing intractable inference in diffusion models for vision, language, and control

Nikolay Malkin

Diffusion models have emerged as effective distribution estimators in vision, language, and reinforcement learning, but their use as priors … (see more)in downstream tasks poses an intractable posterior inference problem. This paper studies amortized sampling of the posterior over data,

2024-09-24

Neural Information Processing Systems (poster)

doi.org

openreview.net

Towards DNA-Encoded Library Generation with GFlowNets

Michał Koziarski

Mohammed Abukalam

Vedant Shah

Louis Vaillancourt

Doris Alexandra Schuetz

Moksh Jain

Almer Van Der Sloot

Mathieu Bourgey

Anne Marinier

Yoshua Bengio

2024-03-03

GEM @ International Conference on Learning Representations (poster)

doi.org

openreview.net

Amortizing Intractable Inference in Large Language Models

Edward J. Hu

Nikolay Malkin

Autoregressive large language models (LLMs) compress knowledge from their training data through next-token conditional distributions. This l… (see more)imits tractable querying of this knowledge to start-to-end autoregressive sampling. However, many tasks of interest -- including sequence continuation, infilling, and other forms of constrained generation -- involve sampling from intractable posterior distributions. We address this limitation by using amortized Bayesian inference to sample from these intractable posteriors. Such amortization is algorithmically achieved by fine-tuning LLMs via diversity-seeking reinforcement learning algorithms: generative flow networks (GFlowNets). We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training and reward-maximizing policy optimization. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem and demonstrate that our approach enables data-efficient adaptation of LLMs to tasks that require multi-step rationalization and tool use.

2024-01-15

ICLR.cc/2024/Conference (oral)

doi.org

openreview.net

PhyloGFN: Phylogenetic Inference with Generative Flow Networks

Mingyang Zhou

Nikolay Malkin

Phylogenetics is a branch of computational biology that studies the evolutionary relationships among biological entities. Its long history a… (see more)nd numerous applications notwithstanding, inference of phylogenetic trees from sequence data remains challenging: the high complexity of tree space poses a significant obstacle for the current combinatorial and probabilistic techniques. In this paper, we adopt the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference. Because GFlowNets are well-suited for sampling complex combinatorial structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies and evolutionary distances. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets. PhyloGFN is competitive with prior works in marginal likelihood estimation and achieves a closer fit to the target distribution than state-of-the-art variational inference methods. Our code is available at https://github.com/zmy1116/phylogfn.

2024-01-15

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

Pre-Training and Fine-Tuning Generative Flow Networks

Generative Flow Networks (GFlowNets) are amortized samplers that learn stochastic policies to sequentially generate compositional objects fr… (see more)om a given unnormalized reward distribution. They can generate diverse sets of high-reward objects, which is an important consideration in scientific discovery tasks. However, as they are typically trained from a given extrinsic reward function, it remains an important open challenge about how to leverage the power of pre-training and train GFlowNets in an unsupervised fashion for efficient adaptation to downstream tasks. Inspired by recent successes of unsupervised pre-training in various domains, we introduce a novel approach for reward-free pre-training of GFlowNets. By framing the training as a self-supervised problem, we propose an outcome-conditioned GFlowNet (OC-GFN) that learns to explore the candidate space. Specifically, OC-GFN learns to reach any targeted outcomes, akin to goal-conditioned policies in reinforcement learning. We show that the pre-trained OC-GFN model can allow for a direct extraction of a policy capable of sampling from any new reward functions in downstream tasks. Nonetheless, adapting OC-GFN on a downstream task-specific reward involves an intractable marginalization over possible outcomes. We propose a novel way to approximate this marginalization by learning an amortized predictor enabling efficient fine-tuning. Extensive experimental results validate the efficacy of our approach, demonstrating the effectiveness of pre-training the OC-GFN, and its ability to swiftly adapt to downstream tasks and discover modes more efficiently. This work may serve as a foundation for further exploration of pre-training strategies in the context of GFlowNets.

2024-01-15

ICLR.cc/2024/Conference (spotlight)

doi.org

openreview.net

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Moksh Jain

Publications