Minsu Kim

Improved Off-policy Reinforcement Learning in Biological Sequence Design

Jinkyoo Park

Designing biological sequences with desired properties is a significant challenge due to the combinatorially vast search space and the high … (see more)cost of evaluating each candidate sequence. To address these challenges, reinforcement learning (RL) methods, such as GFlowNets, utilize proxy models for rapid reward evaluation and annotated data for policy training. Although these approaches have shown promise in generating diverse and novel sequences, the limited training data relative to the vast search space often leads to the misspecification of proxy for out-of-distribution inputs. We introduce

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

doi.org

arxiv.org

Improved Off-policy Reinforcement Learning in Biological Sequence Design

Alex Hernandez-Garcia

Jinkyoo Park

Designing biological sequences with desired properties is challenging due to vast search spaces and limited evaluation budgets. Although rei… (see more)nforcement learning methods use proxy models for rapid reward evaluation, insufficient training data can cause proxy misspecification on out-of-distribution inputs. To address this, we propose a novel off-policy search,

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Outsourced Diffusion Sampling: Efficient Posterior Inference in Latent Spaces of Generative Models

Any well-behaved generative model over a variable …

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Outsourced diffusion sampling: Efficient posterior inference in latent spaces of generative models

Any well-behaved generative model over a variable …

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

doi.org

openreview.net

Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung YUN

Pierre-Luc St-Charles

Jinkyoo Park

Yoshua Bengio

Minsu Kim

We address the challenge of generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults,… (see more) sexual content) and are used for safety fine-tuning. Rather than relying on manual prompt engineering, attacker LLMs can be trained with reinforcement learning (RL) to automatically generate such prompts using only a toxicity classifier as a reward. However, capturing a wide range of harmful behaviors is a significant challenge that requires explicit diversity objectives. Existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce \textit{Active Attacks}, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, rewards in exploited regions diminish, which forces the attacker to seek unexplored vulnerabilities. This process naturally induces an easy-to-hard exploration curriculum, where the attacker progresses beyond easy modes toward increasingly difficult ones. As a result, Active Attacks uncovers a wide range of local attack modes step by step, and their combination achieves wide coverage of the multi-mode distribution. Active Attacks, a simple plug-and-play module that seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods -- including GFlowNets, PPO, and REINFORCE -- by improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07% to 31.28% (a relative gain greater than

2025-09-26

ArXiv (preprint)

arxiv.org

Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung YUN

Pierre-Luc St-Charles

Jinkyoo Park

Yoshua Bengio

Minsu Kim

2025-09-26

ArXiv (preprint)

doi.org

arxiv.org

Adaptive Cyclic Diffusion for Inference Scaling

Gyubin Lee

Truong Nhat Nguyen Bao

Jaesik Yoon

Dongwoo Lee

Minsu Kim

Yoshua Bengio

Sungjin Ahn

2025-05-20

ArXiv (preprint)

arxiv.org

Adaptive Inference-Time Scaling via Cyclic Diffusion Search

Gyubin Lee

Truong Nhat Nguyen Bao

Jaesik Yoon

Dongwoo Lee

Minsu Kim

Yoshua Bengio

Sungjin Ahn

Diffusion models have demonstrated strong generative capabilities across domains ranging from image synthesis to complex reasoning tasks. Ho… (see more)wever, most inference-time scaling methods rely on fixed denoising schedules, limiting their ability to allocate computation based on instance difficulty or task-specific demands adaptively. We introduce the challenge of adaptive inference-time scaling-dynamically adjusting computational effort during inference-and propose Adaptive Bi-directional Cyclic Diffusion (ABCD), a flexible, search-based inference framework. ABCD refines outputs through bi-directional diffusion cycles while adaptively controlling exploration depth and termination. It comprises three components: Cyclic Diffusion Search, Automatic Exploration-Exploitation Balancing, and Adaptive Thinking Time. Experiments show that ABCD improves performance across diverse tasks while maintaining computational efficiency.

2025-05-20

ArXiv (preprint)

arxiv.org

Adaptive Inference-Time Scaling via Cyclic Diffusion Search

Gyubin Lee

Truong Nhat Nguyen Bao

Jaesik Yoon

Dongwoo Lee

Minsu Kim

Yoshua Bengio

Sungjin Ahn

Diffusion models have demonstrated strong generative capabilities across domains ranging from image synthesis to complex reasoning tasks. Ho… (see more)wever, most inference-time scaling methods rely on fixed denoising schedules, limiting their ability to allocate computation based on instance difficulty or task-specific demands adaptively. We introduce the challenge of adaptive inference-time scaling-dynamically adjusting computational effort during inference-and propose Adaptive Bi-directional Cyclic Diffusion (ABCD), a flexible, search-based inference framework. ABCD refines outputs through bi-directional diffusion cycles while adaptively controlling exploration depth and termination. It comprises three components: Cyclic Diffusion Search, Automatic Exploration-Exploitation Balancing, and Adaptive Thinking Time. Experiments show that ABCD improves performance across diverse tasks while maintaining computational efficiency.

2025-05-20

ArXiv (preprint)

arxiv.org

Adaptive Inference-Time Scaling via Cyclic Diffusion Search

Gyubin Lee

Truong Nhat Nguyen Bao

Jaesik Yoon

Dongwoo Lee

Minsu Kim

Yoshua Bengio

Sungjin Ahn

Diffusion models have demonstrated strong generative capabilities across domains ranging from image synthesis to complex reasoning tasks. Ho… (see more)wever, most inference-time scaling methods rely on fixed denoising schedules, limiting their ability to allocate computation based on instance difficulty or task-specific demands adaptively. We introduce the challenge of adaptive inference-time scaling-dynamically adjusting computational effort during inference-and propose Adaptive Bi-directional Cyclic Diffusion (ABCD), a flexible, search-based inference framework. ABCD refines outputs through bi-directional diffusion cycles while adaptively controlling exploration depth and termination. It comprises three components: Cyclic Diffusion Search, Automatic Exploration-Exploitation Balancing, and Adaptive Thinking Time. Experiments show that ABCD improves performance across diverse tasks while maintaining computational efficiency.

2025-05-20

ArXiv (preprint)

arxiv.org

Self-Evolving Curriculum for LLM Reasoning

Nicolas Gontier

Ehsan Kamalloo

Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abili… (see more)ties in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models'reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.

2025-05-20

ArXiv (preprint)

doi.org

arxiv.org

Self-Evolving Curriculum for LLM Reasoning

Alex Pich'e

Nicolas Gontier