Dinghuai Zhang

Self-Evolving Curriculum for LLM Reasoning

Nicolas Gontier

Ehsan Kamalloo

Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abili… (see more)ties in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models'reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.

2025-05-20

ArXiv (preprint)

Self-Evolving Curriculum for LLM Reasoning

Alex Pich'e

Nicolas Gontier

Ehsan Kamalloo

2025-05-20

ArXiv (preprint)

Self-Evolving Curriculum for LLM Reasoning

Nicolas Gontier

Ehsan Kamalloo

Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abili… (see more)ties in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models'reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.

2025-05-20

ArXiv (preprint)

Self-Evolving Curriculum for LLM Reasoning

Nicolas Gontier

Ehsan Kamalloo

2025-05-01

arXiv (published)

Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets

Tim Z. Xiao

Weiyang Liu

While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetun… (see more)e pretrained diffusion models on some reward functions that are either designed by experts or learned from small-scale datasets. Existing methods for finetuning diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. Inspired by recent successes in generative flow networks (GFlowNets), a class of probabilistic models that sample with the unnormalized density of a reward function, we propose a novel GFlowNet method dubbed Nabla-GFlowNet (abbreviated as

2025-01-22

ICLR.cc/2025/Conference (poster)

openreview.net

Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets

Tim Z. Xiao

Weiyang Liu

While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetun… (see more)e pretrained diffusion models with some reward functions that are either designed by experts or learned from small-scale datasets. Existing post-training methods for reward finetuning of diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. In response to this challenge, we take inspiration from recent successes in generative flow networks (GFlowNets) and propose a reinforcement learning method for diffusion model finetuning, dubbed Nabla-GFlowNet (abbreviated as

2024-12-10

ArXiv (preprint)

Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets

Tim Z. Xiao

Weiyang Liu

While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetun… (see more)e pretrained diffusion models on some reward functions that are either designed by experts or learned from small-scale datasets. Existing methods for finetuning diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. Inspired by recent successes in generative flow networks (GFlowNets), a class of probabilistic models that sample with the unnormalized density of a reward function, we propose a novel GFlowNet method dubbed Nabla-GFlowNet (abbreviated as

2024-12-10

ArXiv (preprint)

Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets

Tim Z. Xiao

Weiyang Liu

While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetun… (see more)e pretrained diffusion models with some reward functions that are either designed by experts or learned from small-scale datasets. Existing post-training methods for reward finetuning of diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. In response to this challenge, we take inspiration from recent successes in generative flow networks (GFlowNets) and propose a reinforcement learning method for diffusion model finetuning, dubbed Nabla-GFlowNet (abbreviated as

2024-12-10

ArXiv (preprint)

Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets

Tim Z. Xiao

Weiyang Liu

While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetun… (see more)e pretrained diffusion models on some reward functions that are either designed by experts or learned from small-scale datasets. Existing methods for finetuning diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. Inspired by recent successes in generative flow networks (GFlowNets), a class of probabilistic models that sample with the unnormalized density of a reward function, we propose a novel GFlowNet method dubbed Nabla-GFlowNet (abbreviated as

2024-12-10

ArXiv (preprint)

EnzymeFlow: Generating Reaction-specific Enzyme Catalytic Pockets through Flow Matching and Co-Evolutionary Dynamics

Yang Liu

Odin Zhang

Kevin K Yang

Shuangjia Zheng

2024-10-13

NeurIPS.cc/2024/Workshop/AIDrugX (poster)

openreview.net

Baking Symmetry into GFlowNets

George Ma

Emmanuel Bengio

GFlowNets have exhibited promising performance in generating diverse candidates with high rewards. These networks generate objects increment… (see more)ally and aim to learn a policy that assigns probability of sampling objects in proportion to rewards. However, the current training pipelines of GFlowNets do not consider the presence of isomorphic actions, which are actions resulting in symmetric or isomorphic states. This lack of symmetry increases the amount of samples required for training GFlowNets and can result in inefficient and potentially incorrect flow functions. As a consequence, the reward and diversity of the generated objects decrease. In this study, our objective is to integrate symmetries into GFlowNets by identifying equivalent actions during the generation process. Experimental results using synthetic data demonstrate the promising performance of our proposed approaches.

2024-06-08

ArXiv (preprint)