Publications

Pitfalls of Evidence-Based AI Policy

Stephen Casper

Dylan Hadfield-Menell

Nations across the world are working to govern AI. However, from a technical perspective, the best way to do this is not yet clear. Meanwhil… (see more)e, recent debates over AI regulation have led to calls for “evidence-based AI policy” which emphasize holding regulatory action to a high evidentiary standard. Evidence is of irreplaceable value to policymaking. However, holding regulatory action to too high an evidentiary standard can lead to systematic neglect of certain risks. In historical policy debates (e.g., over tobacco ca. 1965 and fossil fuels ca. 1990) “evidence-based policy” rhetoric is also a well-precedented strategy to downplay the urgency of action, delay regulation, and protect industry interests. Here, we argue that if the goal is evidence-based AI policy, the first regulatory objective must be to actively facilitate the process of identifying, studying, and deliberating about AI risks. We discuss a set of 16 regulatory goals to facilitate this and show that the EU, UK, USA, Brazil, Canada, and China all have substantial opportunities to adopt further evidence-seeking policies.

2025-01-23

ICLR.cc/2025/BlogPosts (accepted)

openreview.net

Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection

Yun Zhu

Jia-Chen Gu

Caitlin Sikora

Ho Ko

Yinxiao Liu

Chu-Cheng Lin

Lei Shu

Liangchen Luo

Lei Meng

Bang Liu

Jindong Chen

Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility by incorporating external context… (see more)s. However, the input length grows linearly in the number of retrieved documents, causing a dramatic increase in latency. In this paper, we propose a novel paradigm named Sparse RAG, which seeks to cut computation costs through sparsity. Speciﬁcally, Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents. Then, LLMs selectively decode the output by only attending to highly relevant caches auto-regressively, which are chosen via prompting LLMs with special control tokens. It is notable that Sparse RAG combines the assessment of each individual document and the generation of the response into a single process. The designed sparse mechanism in a RAG system can facilitate the reduction of the number of documents loaded during decoding for accelerating the inference of the RAG system. Additionally, ﬁltering out undesirable contexts enhances the model’s focus on relevant context, inherently improving its generation quality. Evaluation results on four datasets show that Sparse RAG can be used to strike an optimal balance between generation quality and computational efﬁciency, demonstrating its generalizability across tasks.

2025-01-22

ICLR.cc/2025/Conference (poster)

openreview.net

Accelerating neural network training: An analysis of the AlgoPerf competition

Priya Kasimbeg

Frank Schneider

Runa Eschenhagen

Juhan Bae

Chandramouli Shama Sastry

Mark Saroufim

BOYUAN FENG

Less Wright

Edward Z. Yang

Zachary Nado

Sourabh Medapati

Philipp Hennig

Michael Rabbat

George E. Dahl

The goal of the AlgoPerf: Training Algorithms competition is to evaluate practical speed-ups in neural network training achieved solely by i… (see more)mproving the underlying training algorithms. In the external tuning ruleset, submissions must provide workload-agnostic hyperparameter search spaces, while in the self-tuning ruleset they must be completely hyperparameter-free. In both rulesets, submissions are compared on time-to-result across multiple deep learning workloads, training on fixed hardware. This paper presents the inaugural AlgoPerf competition's results, which drew 18 diverse submissions from 10 teams. Our investigation reveals several key findings: (1) The winning submission in the external tuning ruleset, using Distributed Shampoo, demonstrates the effectiveness of non-diagonal preconditioning over popular methods like Adam, even when compared on wall-clock runtime. (2) The winning submission in the self-tuning ruleset, based on the Schedule Free AdamW algorithm, demonstrates a new level of effectiveness for completely hyperparameter-free training algorithms. (3) The top-scoring submissions were surprisingly robust to workload changes. We also discuss the engineering challenges encountered in ensuring a fair comparison between different training algorithms. These results highlight both the significant progress so far, and the considerable room for further improvements.

2025-01-22

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

Accelerating neural network training: An analysis of the AlgoPerf competition

Priya Kasimbeg

Frank Schneider

Runa Eschenhagen

Juhan Bae

Chandramouli Shama Sastry

Mark Saroufim

BOYUAN FENG

Less Wright

Edward Z. Yang

Zachary Nado

Sourabh Medapati

Philipp Hennig

Michael Rabbat

George E. Dahl

The goal of the AlgoPerf: Training Algorithms competition is to evaluate practical speed-ups in neural network training achieved solely by i… (see more)mproving the underlying training algorithms. In the external tuning ruleset, submissions must provide workload-agnostic hyperparameter search spaces, while in the self-tuning ruleset they must be completely hyperparameter-free. In both rulesets, submissions are compared on time-to-result across multiple deep learning workloads, training on fixed hardware. This paper presents the inaugural AlgoPerf competition's results, which drew 18 diverse submissions from 10 teams. Our investigation reveals several key findings: (1) The winning submission in the external tuning ruleset, using Distributed Shampoo, demonstrates the effectiveness of non-diagonal preconditioning over popular methods like Adam, even when compared on wall-clock runtime. (2) The winning submission in the self-tuning ruleset, based on the Schedule Free AdamW algorithm, demonstrates a new level of effectiveness for completely hyperparameter-free training algorithms. (3) The top-scoring submissions were surprisingly robust to workload changes. We also discuss the engineering challenges encountered in ensuring a fair comparison between different training algorithms. These results highlight both the significant progress so far, and the considerable room for further improvements.

2025-01-22

ICLR.cc/2025/Conference (poster)

openreview.net

Accelerating Training with Neuron Interaction and Nowcasting Networks

Neural network training can be accelerated when a learnable update rule is used in lieu of classic adaptive optimizers (e.g. Adam). However,… (see more) learnable update rules can be costly and unstable to train and use. A simpler recently proposed approach to accelerate training is to use Adam for most of the optimization steps and periodically, only every few steps, nowcast (predict future) parameters. We improve this approach by Neuron interaction and Nowcasting (NiNo) networks. NiNo leverages neuron connectivity and graph neural networks to more accurately nowcast parameters by learning in a supervised way from a set of training trajectories over multiple tasks. We show that in some networks, such as Transformers, neuron connectivity is non-trivial. By accurately modeling neuron connectivity, we allow NiNo to accelerate Adam training by up to 50\% in vision and language tasks.

2025-01-22

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

Action abstractions for amortized sampling

Oussama Boussif

Lena Nehale Ezzine

Joseph D Viviano

Michał Koziarski

Moksh J. Jain

2025-01-22

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

AdaFisher: Adaptive Second Order Optimization via Fisher Information

Damien MARTINS GOMES

Yanlei Zhang

Eugene Belilovsky

Guy Wolf

Mahdi S. Hosseini

First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limi… (see more)ted curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs are still limited due to increased per-iteration computations and suboptimal accuracy compared to the first order methods. We present AdaFisher--an adaptive second-order optimizer that leverages a block-diagonal approximation to the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced convergence capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modelling and stand out for its stability and robustness in hyperparameter tuning. We demonstrate that AdaFisher outperforms the SOTA optimizers in terms of both accuracy and convergence speed. Code available from \href{https://github.com/AtlasAnalyticsLab/AdaFisher}{https://github.com/AtlasAnalyticsLab/AdaFisher}

2025-01-22

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

Adaptive teachers for amortized samplers

Minsu Kim

Sungsoo Ahn

Jinkyoo Park

Nikolay Malkin

Yoshua Bengio

Amortized inference is the task of training a parametric model, such as a neural network, to approximate a distribution with a given unnorma… (see more)lized density where exact sampling is intractable. When sampling is implemented as a sequential decision-making process, reinforcement learning (RL) methods, such as generative flow networks, can be used to train the sampling policy. Off-policy RL training facilitates the discovery of diverse, high-reward candidates, but existing methods still face challenges in efficient exploration. We propose to use an adaptive training distribution (the Teacher) to guide the training of the primary amortized sampler (the Student) by prioritizing high-loss regions. The Teacher, an auxiliary behavior model, is trained to sample high-error regions of the Student and can generalize across unexplored modes, thereby enhancing mode coverage by providing an efficient training curriculum. We validate the effectiveness of this approach in a synthetic environment designed to present an exploration challenge, two diffusion-based sampling tasks, and four biochemical discovery tasks demonstrating its ability to improve sample efficiency and mode coverage.

2025-01-22

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

Advantage Alignment Algorithms

Juan Agustin Duque

Milad Aghajohari

Tim Cooijmans

2025-01-22

ICLR.cc/2025/Conference (oral)

openreview.net

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang

Jinyu Xiang

Zhaoyang Yu

Fengwei Teng

Xiong-Hui Chen

Jiaqi Chen

Mingchen Zhuge

Xin Cheng

Sirui Hong

Jinlin Wang

Bingnan Zheng

Bang Liu

Yuyu Luo

Chenglin Wu

Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains, typically by employing … (see more)agentic workflows that follow detailed instructions and operational sequences. However, constructing these workflows requires significant human effort, limiting scalability and generalizability. Recent research has sought to automate the generation and optimization of these workflows, but existing methods still rely on initial manual setup and fall short of achieving fully automated and effective workflow generation. To address this challenge, we reformulate workflow optimization as a search problem over code-represented workflows, where LLM-invoking nodes are connected by edges. We introduce AFLOW, an automated framework that efficiently explores this space using Monte Carlo Tree Search, iteratively refining workflows through code modification, tree-structured experience, and execution feedback. Empirical evaluations across six benchmark datasets demonstrate AFLOW's efficacy, yielding a 5.7% average improvement over state-of-the-art baselines. Furthermore, AFLOW enables smaller models to outperform GPT-4o on specific tasks at 4.55% of its inference cost in dollars. The code is available at https://github.com/geekan/MetaGPT.

2025-01-22

ICLR.cc/2025/Conference (oral)

openreview.net

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang

Jinyu Xiang

Zhaoyang Yu

Fengwei Teng

Xiong-Hui Chen

Jiaqi Chen

Mingchen Zhuge

Xin Cheng

Sirui Hong

Jinlin Wang

Bingnan Zheng

Bang Liu

Yuyu Luo

Chenglin Wu

Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains, typically by employing … (see more)agentic workflows that follow detailed instructions and operational sequences. However, constructing these workflows requires significant human effort, limiting scalability and generalizability. Recent research has sought to automate the generation and optimization of these workflows, but existing methods still rely on initial manual setup and fall short of achieving fully automated and effective workflow generation. To address this challenge, we reformulate workflow optimization as a search problem over code-represented workflows, where LLM-invoking nodes are connected by edges. We introduce AFLOW, an automated framework that efficiently explores this space using Monte Carlo Tree Search, iteratively refining workflows through code modification, tree-structured experience, and execution feedback. Empirical evaluations across six benchmark datasets demonstrate AFLOW's efficacy, yielding a 5.7% average improvement over state-of-the-art baselines. Furthermore, AFLOW enables smaller models to outperform GPT-4o on specific tasks at 4.55% of its inference cost in dollars. The code is available at https://github.com/geekan/MetaGPT.

2025-01-22

ICLR.cc/2025/Conference (oral)

openreview.net