Publications

Accelerating Training with Neuron Interaction and Nowcasting Networks

Boris Knyazev

Abhinav Moudgil

Guillaume Lajoie

Eugene Belilovsky

Simon Lacoste-Julien

Neural network training can be accelerated when a learnable update rule is used in lieu of classic adaptive optimizers (e.g. Adam). However,… (see more) learnable update rules can be costly and unstable to train and use. A simpler recently proposed approach to accelerate training is to use Adam for most of the optimization steps and periodically, only every few steps, nowcast (predict future) parameters. We improve this approach by Neuron interaction and Nowcasting (NiNo) networks. NiNo leverages neuron connectivity and graph neural networks to more accurately nowcast parameters by learning in a supervised way from a set of training trajectories over multiple tasks. We show that in some networks, such as Transformers, neuron connectivity is non-trivial. By accurately modeling neuron connectivity, we allow NiNo to accelerate Adam training by up to 50\% in vision and language tasks.

2025-01-22

ICLR.cc/2025/Conference (poster)

Action abstractions for amortized sampling

Oussama Boussif

Lena Nehale Ezzine

Joseph D Viviano

Michał Koziarski

Moksh J. Jain

Nikolay Malkin

Emmanuel Bengio

Rim Assouel

2025-01-22

ICLR.cc/2025/Conference (poster)

AdaFisher: Adaptive Second Order Optimization via Fisher Information

Damien MARTINS GOMES

Yanlei Zhang

Eugene Belilovsky

Guy Wolf

Mahdi S. Hosseini

First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limi… (see more)ted curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs are still limited due to increased per-iteration computations and suboptimal accuracy compared to the first order methods. We present AdaFisher--an adaptive second-order optimizer that leverages a block-diagonal approximation to the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced convergence capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modelling and stand out for its stability and robustness in hyperparameter tuning. We demonstrate that AdaFisher outperforms the SOTA optimizers in terms of both accuracy and convergence speed. Code available from \href{https://github.com/AtlasAnalyticsLab/AdaFisher}{https://github.com/AtlasAnalyticsLab/AdaFisher}

2025-01-22

ICLR.cc/2025/Conference (poster)

Adaptive teachers for amortized samplers

Minsu Kim

Sanghyeok Choi

Taeyoung Yun

Emmanuel Bengio

Leo Feng

Jarrid Rector-Brooks

Sungsoo Ahn

Jinkyoo Park

Nikolay Malkin

Amortized inference is the task of training a parametric model, such as a neural network, to approximate a distribution with a given unnorma… (see more)lized density where exact sampling is intractable. When sampling is implemented as a sequential decision-making process, reinforcement learning (RL) methods, such as generative flow networks, can be used to train the sampling policy. Off-policy RL training facilitates the discovery of diverse, high-reward candidates, but existing methods still face challenges in efficient exploration. We propose to use an adaptive training distribution (the Teacher) to guide the training of the primary amortized sampler (the Student) by prioritizing high-loss regions. The Teacher, an auxiliary behavior model, is trained to sample high-error regions of the Student and can generalize across unexplored modes, thereby enhancing mode coverage by providing an efficient training curriculum. We validate the effectiveness of this approach in a synthetic environment designed to present an exploration challenge, two diffusion-based sampling tasks, and four biochemical discovery tasks demonstrating its ability to improve sample efficiency and mode coverage.

2025-01-22

ICLR.cc/2025/Conference (poster)

Advantage Alignment Algorithms

Juan Agustin Duque

Milad Aghajohari

Tim Cooijmans

razvan ciuca

Tianyu Zhang

Gauthier Gidel

Aaron Courville

2025-01-22

ICLR.cc/2025/Conference (oral)

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang

Jinyu Xiang

Zhaoyang Yu

Fengwei Teng

Xiong-Hui Chen

Jiaqi Chen

Mingchen Zhuge

Xin Cheng

Sirui Hong

Jinlin Wang

Bingnan Zheng

Bang Liu

Yuyu Luo

Chenglin Wu

Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains, typically by employing … (see more)agentic workflows that follow detailed instructions and operational sequences. However, constructing these workflows requires significant human effort, limiting scalability and generalizability. Recent research has sought to automate the generation and optimization of these workflows, but existing methods still rely on initial manual setup and fall short of achieving fully automated and effective workflow generation. To address this challenge, we reformulate workflow optimization as a search problem over code-represented workflows, where LLM-invoking nodes are connected by edges. We introduce AFLOW, an automated framework that efficiently explores this space using Monte Carlo Tree Search, iteratively refining workflows through code modification, tree-structured experience, and execution feedback. Empirical evaluations across six benchmark datasets demonstrate AFLOW's efficacy, yielding a 5.7% average improvement over state-of-the-art baselines. Furthermore, AFLOW enables smaller models to outperform GPT-4o on specific tasks at 4.55% of its inference cost in dollars. The code is available at https://github.com/geekan/MetaGPT.

2025-01-22

ICLR.cc/2025/Conference (oral)

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang

Jinyu Xiang

Zhaoyang Yu

Fengwei Teng

Xiong-Hui Chen

Jiaqi Chen

Mingchen Zhuge

Xin Cheng

Sirui Hong

Jinlin Wang

Bingnan Zheng

Bang Liu

Yuyu Luo

Chenglin Wu

Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains, typically by employing … (see more)agentic workflows that follow detailed instructions and operational sequences. However, constructing these workflows requires significant human effort, limiting scalability and generalizability. Recent research has sought to automate the generation and optimization of these workflows, but existing methods still rely on initial manual setup and fall short of achieving fully automated and effective workflow generation. To address this challenge, we reformulate workflow optimization as a search problem over code-represented workflows, where LLM-invoking nodes are connected by edges. We introduce AFLOW, an automated framework that efficiently explores this space using Monte Carlo Tree Search, iteratively refining workflows through code modification, tree-structured experience, and execution feedback. Empirical evaluations across six benchmark datasets demonstrate AFLOW's efficacy, yielding a 5.7% average improvement over state-of-the-art baselines. Furthermore, AFLOW enables smaller models to outperform GPT-4o on specific tasks at 4.55% of its inference cost in dollars. The code is available at https://github.com/geekan/MetaGPT.

2025-01-22

ICLR.cc/2025/Conference (oral)

Ant Colony Sampling with GFlowNets for Combinatorial Optimization

Minsu Kim

Sanghyeok Choi

Jiwoo Son

Hyeonah Kim

Jinkyoo Park

2025-01-22

aistats.org/AISTATS/2025/Conference (poster)

AssembleFlow: Rigid Flow Matching with Inertial Frames for Molecular Assembly

Hongyu Guo

Shengchao Liu

Molecular assembly, where a cluster of rigid molecules aggregated into strongly correlated forms, is fundamental to determining the properti… (see more)es of materials. However, traditional numerical methods for simulating this process are computationally expensive, and existing generative models on material generation overlook the rigidity inherent in molecular structures, leading to unwanted distortions and invalid internal structures in molecules. To address this, we introduce AssembleFlow. AssembleFlow leverages inertial frames to establish reference coordinate systems at the molecular level for tracking the orientation and motion of molecules within the cluster. It further decomposes molecular

2025-01-22

ICLR.cc/2025/Conference (poster)

AssembleFlow: Rigid Flow Matching with Inertial Frames for Molecular Assembly

Hongyu Guo

Shengchao Liu

Molecular assembly, where a cluster of rigid molecules aggregated into strongly correlated forms, is fundamental to determining the properti… (see more)es of materials. However, traditional numerical methods for simulating this process are computationally expensive, and existing generative models on material generation overlook the rigidity inherent in molecular structures, leading to unwanted distortions and invalid internal structures in molecules. To address this, we introduce AssembleFlow. AssembleFlow leverages inertial frames to establish reference coordinate systems at the molecular level for tracking the orientation and motion of molecules within the cluster. It further decomposes molecular

2025-01-22

ICLR.cc/2025/Conference (poster)

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Michael Noukhovitch

Shengyi Huang

Sophie Xhonneux

Arian Hosseini

Rishabh Agarwal

Aaron Courville

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling wi… (see more)th a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.

2025-01-22

ICLR.cc/2025/Conference (poster)