Minsu Kim

Generative Recursive Reasoning Models

Mingyu Jo

We introduce Generative Recursive reAsoning Models (GRAM), a recursion-based generative model that is effective for complex planning and rea… (voir plus)soning problems. GRAM reformulates recent latent recursive architectures as a stochastic generative process with probabilistic latent transitions, enabling efficient and stable computation entirely in latent space without relying on token-level sequences as in chain-of-thought (CoT) prompting. We optimize this generative recursion via amortized variational inference, allowing the model to represent and explore multiple plausible latent trajectories conditioned on the input. This formulation supports both conditional reasoning through

2026-03-04

RSI @ International Conference on Learning Representations (poster)

openreview.net

Synthesizable Molecular Generation via Soft-constrained GFlowNets with Rich Chemical Priors

Louis Vaillancourt

Yves V. Brun

Yoshua Bengio

Alex Hernández-García

2026-02-03

arXiv (prépublication)

doi.org

openreview.net

Improved Off-policy Reinforcement Learning in Biological Sequence Design

Alex Hernández-García

Jinkyoo Park

Designing biological sequences with desired properties is challenging due to vast search spaces and limited evaluation budgets. Although rei… (voir plus)nforcement learning methods use proxy models for rapid reward evaluation, insufficient training data can cause proxy misspecification on out-of-distribution inputs. To address this, we propose a novel off-policy search,

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (publié)

doi.org

proceedings.mlr.press

Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung YUN

Pierre-Luc St-Charles

Jinkyoo Park

Yoshua Bengio

Minsu Kim

2025-09-25

ArXiv (prépublication)

doi.org

arxiv.org

Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

Brian Bartoldson

James Diffenderfer

Tal Ben-Nun

Johan Obando-Ceron

Bhavya Kailkhura

Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, on-policy algorithms used for post… (voir plus)-training are not naturally robust to a diversified content of experience replay buffers, which asynchronous off-policy actors can efficiently populate in parallel to training. We propose efficiently learning on such off-policy data via Trajectory Balance with Asynchrony (TBA), an approach to asynchronous RL for LLMs that leverages the principled off-policy TB objective. On math, preference-tuning, and automated red-teaming tasks, we post-train models ranging from Pythia 410M to Qwen 2.5 7B, finding TBA offers speed and performance boosts over strong baselines like Online DPO and Dr. GRPO. Beyond TBA's performance benefits (high accuracy even as asynchrony grows) and speedups (

2025-09-17

NeurIPS.cc/2025/Conference (poster)

doi.org

openreview.net

Adaptive Inference-Time Scaling via Cyclic Diffusion Search

Gyubin Lee

Truong Nhat Nguyen Bao

Jaesik Yoon

Dongwoo Lee

Minsu Kim

Yoshua Bengio

Sungjin Ahn

Diffusion models have demonstrated strong generative capabilities across domains ranging from image synthesis to complex reasoning tasks. Ho… (voir plus)wever, most inference-time scaling methods rely on fixed denoising schedules, limiting their ability to allocate computation based on instance difficulty or task-specific demands adaptively. We introduce the challenge of adaptive inference-time scaling-dynamically adjusting computational effort during inference-and propose Adaptive Bi-directional Cyclic Diffusion (ABCD), a flexible, search-based inference framework. ABCD refines outputs through bi-directional diffusion cycles while adaptively controlling exploration depth and termination. It comprises three components: Cyclic Diffusion Search, Automatic Exploration-Exploitation Balancing, and Adaptive Thinking Time. Experiments show that ABCD improves performance across diverse tasks while maintaining computational efficiency.

2025-05-19

ArXiv (prépublication)

doi.org

openreview.net

Self-Evolving Curriculum for LLM Reasoning

Nicolas Gontier

Ehsan Kamalloo

2025-05-19

ArXiv (prépublication)

doi.org

arxiv.org

Latent Veracity Inference for Identifying Errors in Stepwise Reasoning

Minsu Kim

Jean-Pierre R. Falet

Oliver E. Richardson

Xiaoyin Chen

Moksh J. Jain

Sungjin Ahn

Sungsoo Ahn

Yoshua Bengio

Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can cont… (voir plus)ain inaccurate statements that reduce performance and trustworthiness. To address this, we propose to augment each reasoning step in a CoT with a latent veracity (or correctness) variable. To efficiently explore this expanded space, we introduce Veracity Search (VS), a discrete search algorithm over veracity assignments. It performs otherwise intractable inference in the posterior distribution over latent veracity values by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time verification method facilitates supervised fine-tuning of an Amortized Veracity Inference (AVI) machine by providing pseudo-labels for veracity. AVI generalizes VS, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that VS reliably identifies errors in logical (ProntoQA), mathematical (GSM8K), and commonsense (CommonsenseQA) reasoning benchmarks, with AVI achieving comparable zero-shot accuracy. Finally, we demonstrate the utility of latent veracity inference for providing feedback during self-correction and self-improvement.

2025-05-16

ArXiv (prépublication)

doi.org

arxiv.org

Outsourced Diffusion Sampling: Efficient Posterior Inference in Latent Spaces of Generative Models

Nikolay Malkin

Any well-behaved generative model over a variable …

2025-04-30

International Conference on Machine Learning (poster)

doi.org

proceedings.mlr.press

Adaptive Teachers for Amortized Samplers

Sungsoo Ahn

Jinkyoo Park

Nikolay Malkin

Yoshua Bengio

Amortized inference is the task of training a parametric model, such as a neural network, to approximate a distribution with a given unnorma… (voir plus)lized density where exact sampling is intractable. When sampling is implemented as a sequential decision-making process, reinforcement learning (RL) methods, such as generative flow networks, can be used to train the sampling policy. Off-policy RL training facilitates the discovery of diverse, high-reward candidates, but existing methods still face challenges in efficient exploration. We propose to use an adaptive training distribution (the \teacher) to guide the training of the primary amortized sampler (the \student). The \teacher, an auxiliary behavior model, is trained to sample high-loss regions of the \student and can generalize across unexplored modes, thereby enhancing mode coverage by providing an efficient training curriculum. We validate the effectiveness of this approach in a synthetic environment designed to present an exploration challenge, two diffusion-based sampling tasks, and four biochemical discovery tasks demonstrating its ability to improve sample efficiency and mode coverage. Source code is available at https://github.com/alstn12088/adaptive-teacher.

2025-04-22

International Conference on Learning Representations (Accept (Poster))

doi.org

openreview.net

Offline Model-Based Optimization: Comprehensive Review

Jiayao Gu

Zixuan Liu

Can Chen

2025-03-20

ArXiv (prépublication)

doi.org

arxiv.org

Solving Bayesian Inverse Problems with Diffusion Priors and Off-Policy RL

Laurence Perreault-Levasseur

Yoshua Bengio

Glen Berseth

Nikolay Malkin

This paper presents a practical application of Relative Trajectory Balance (RTB), a recently introduced off-policy reinforcement learning (R… (voir plus)L) objective that can asymptotically solve Bayesian inverse problems optimally. We extend the original work by using RTB to train conditional diffusion model posteriors from pretrained unconditional priors for challenging linear and non-linear inverse problems in vision, and science. We use the objective alongside techniques such as off-policy backtracking exploration to improve training. Importantly, our results show that existing training-free diffusion posterior methods struggle to perform effective posterior inference in latent space due to inherent biases.

2025-03-05

ICLR.cc/2025/Workshop/DeLTa (poster)

doi.org

openreview.net

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Publications

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Mots-clés populaires:

Minsu Kim

Publications