Publications

Handling Delay in Real-Time Reinforcement Learning

Rishav

Matthew D Riemer

Stephen Chung

Real-time reinforcement learning (RL) introduces several challenges. First, policies are constrained to a fixed number of actions per second… (voir plus) due to hardware limitations. Second, the environment may change while the network is still computing an action, leading to observational delay. The first issue can partly be addressed with pipelining, leading to higher throughput and potentially better policies. However, the second issue remains: if each neuron operates in parallel with an execution time of

2025-01-22

ICLR.cc/2025/Conference (poster)

openreview.net

Handling Delay in Real-Time Reinforcement Learning

Ivan Anokhin

Rishav

Matthew D Riemer

Stephen Chung

Irina Rish

Samira Ebrahimi Kahou

2025-01-22

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models

Seanie Lee

Haebin Seong

Dong Bok Lee

Minki Kang

Xiaoyin Chen

Dominik Wagner

Yoshua Bengio

Juho Lee

Sung Ju Hwang

Safety guard models that detect malicious queries aimed at large language models (LLMs) are essential for ensuring the secure and responsibl… (voir plus)e deployment of LLMs in real-world applications. However, deploying existing safety guard models with billions of parameters alongside LLMs on mobile devices is impractical due to substantial memory requirements and latency. To reduce this cost, we distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. Due to the limited diversity of harmful instructions in the existing labeled dataset, naively distilled models tend to underperform compared to larger models. To bridge the gap between small and large models, we propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions. Given a prompt such as,"Make a single harmful instruction prompt that would elicit offensive content", we add an affirmative prefix (e.g.,"I have an idea for a prompt:") to the LLM's response. This encourages the LLM to continue generating the rest of the response, leading to sampling harmful instructions. Another LLM generates a response to the harmful instruction, and the teacher model labels the instruction-response pair. We empirically show that our HarmAug outperforms other relevant baselines. Moreover, a 435-million-parameter safety guard model trained with HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.

2025-01-22

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

Influence Functions for Scalable Data Attribution in Diffusion Models

Bruno Mlodozeniec

Runa Eschenhagen

Juhan Bae

Alexander Immer

David Scott Krueger

Richard E. Turner

Diffusion models have led to significant advancements in generative modelling. Yet their widespread adoption poses challenges regarding data… (voir plus) attribution and interpretability. In this paper, we aim to help address such challenges in diffusion models by extending influence functions. Influence function-based data attribution methods approximate how a model's output would have changed if some training data were removed. In supervised learning, this is usually used for predicting how the loss on a particular example would change. For diffusion models, we focus on predicting the change in the probability of generating a particular example via several proxy measurements. We show how to formulate influence functions for such quantities and how previously proposed methods can be interpreted as particular design choices in our framework. To ensure scalability of the Hessian computations in influence functions, we use a K-FAC approximation based on generalised Gauss-Newton matrices specifically tailored to diffusion models. We show that our recommended method outperforms previously proposed data attribution methods on common data attribution evaluations, such as the Linear Data-modelling Score (LDS) or retraining without top influences, without the need for method-specific hyperparameter tuning.

2025-01-22

ICLR.cc/2025/Conference (présentation orale)

openreview.net

InnerThoughts: Disentangling Representations and Predictions in Large Language Models

Didier Chételat

Joseph Cotnareanu

Rylee Thompson

Yingxue Zhang

Mark Coates

Large language models (LLMs) contain substantial factual knowledge which is commonly elicited by multiple-choice question-answering prompts.… (voir plus) Internally, such models process the prompt through multiple transformer layers, building varying representations of the problem within its hidden states. Ultimately, however, only the hidden state corresponding to the final layer and token position is used to predict the answer label. In this work, we propose instead to learn a small separate neural network predictor module on a collection of training questions, that take the hidden states from all the layers at the last temporal position as input and outputs predictions. In effect, such a framework disentangles the representational abilities of LLMs from their predictive abilities. On a collection of hard benchmarks, our method achieves considerable improvements in performance, sometimes comparable to supervised fine-tuning procedures, but at a fraction of the computational cost.

2025-01-22

aistats.org/AISTATS/2025/Conference (poster)

proceedings.mlr.press

openreview.net

InnerThoughts: Disentangling Representations and Predictions in Large Language Models

Didier Chételat

Joseph Cotnareanu

Rylee Thompson

Yingxue Zhang

Mark Coates

Large language models (LLMs) contain substantial factual knowledge which is commonly elicited by multiple-choice question-answering prompts.… (voir plus) Internally, such models process the prompt through multiple transformer layers, building varying representations of the problem within its hidden states. Ultimately, however, only the hidden state corresponding to the final layer and token position is used to predict the answer label. In this work, we propose instead to learn a small separate neural network predictor module on a collection of training questions, that take the hidden states from all the layers at the last temporal position as input and outputs predictions. In effect, such a framework disentangles the representational abilities of LLMs from their predictive abilities. On a collection of hard benchmarks, our method achieves considerable improvements in performance, sometimes comparable to supervised fine-tuning procedures, but at a fraction of the computational cost.

2025-01-22

aistats.org/AISTATS/2025/Conference (poster)

openreview.net

Input Space Mode Connectivity in Deep Neural Networks

Jakub Vrabel

Ori Shem-Ur

Yaron Oz

David Scott Krueger

We extend the concept of loss landscape mode connectivity to the input space of deep neural networks. Initially studied in parameter space, … (voir plus)mode connectivity describes the existence of low-loss paths between solutions (loss minimizers) found via gradient descent. We present theoretical and empirical evidence of its presence in the input space of deep networks, thereby highlighting the broader nature of the phenomenon. We observe that different input images with similar predictions are generally connected, and for trained models, the path tends to be simple, with only a small deviation from being a linear path. We conjecture that input space mode connectivity in high-dimensional spaces is a geometric phenomenon, present even in untrained models, and can be explained by percolation theory. We exploit mode connectivity to obtain new insights about adversarial examples and show its potential for adversarial detection and interpretability.

2025-01-22

ICLR.cc/2025/Conference (poster)

openreview.net

InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation

Gaurav Sahu

Abhay Puri

Juan A. Rodriguez

Amirhossein Abaskohi

Mohammad Chegini

Alexandre Drouin

Perouz Taslakian

Valentina Zantedeschi

Alexandre Lacoste

David Vázquez

Nicolas Chapados

Chris Pal

Sai Rajeswar

Issam Hadj Laradji

2025-01-22

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Thomas Bush

Stephen Chung

Usman Anwar

Adrià Garriga-Alonso

David Scott Krueger

We present the first mechanistic evidence that model-free reinforcement learning agents can learn to plan. This is achieved by applying a me… (voir plus)thodology based on concept-based interpretability to a model-free agent in Sokoban -- a commonly used benchmark for studying planning. Specifically, we demonstrate that DRC, a generic model-free agent introduced by [Guez et al. (2019)](https://arxiv.org/abs/1901.03559), uses learned concept representations to internally formulate plans that both predict the long-term effects of actions on the environment and influence action selection. Our methodology involves: (1) probing for planning-relevant concepts, (2) investigating plan formation within the agent's representations, and (3) verifying that discovered plans (in the agent's representations) have a causal effect on the agent's behavior through interventions. We also show that the emergence of these plans coincides with the emergence of a planning-like property: the ability to benefit from additional test-time compute. Finally, we perform a qualitative analysis of the planning algorithm learned by the agent and discover a strong resemblance to parallelized bidirectional search. Our findings advance understanding of the internal mechanisms underlying planning behavior in agents, which is important given the recent trend of emergent planning and reasoning capabilities in LLMs through RL.

2025-01-22

ICLR.cc/2025/Conference (présentation orale)

openreview.net

Langevin Soft Actor-Critic: Efficient Exploration through Uncertainty-Driven Critic Learning

Haque Ishfaq

Guangyuan Wang

Mohammad Sami Nur Islam

Doina Precup

Existing actor-critic algorithms, which are popular for continuous control reinforcement learning (RL) tasks, suffer from poor sample effici… (voir plus)ency due to lack of principled exploration mechanism within them. Motivated by the success of Thompson sampling for efficient exploration in RL, we propose a novel model-free RL algorithm, Langevin Soft Actor Critic (LSAC), which prioritizes enhancing critic learning through uncertainty estimation over policy optimization. LSAC employs three key innovations: approximate Thompson sampling through distributional Langevin Monte Carlo (LMC) based

2025-01-22

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

Learning diverse attacks on large language models for robust red-teaming and safety tuning

Seanie Lee

Minsu Kim

Lynn Cherif

David Dobre

Juho Lee

Sung Ju Hwang

Moksh J. Jain

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of lar… (voir plus)ge language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

2025-01-22

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

Learning Versatile Optimizers on a Compute Diet

Learned optimization has emerged as a promising alternative to hand-crafted optimizers, with the potential to discover stronger learned upda… (voir plus)te rules that enable faster, hyperparameter-free training of neural networks. A critical element for practically useful learned optimizers, that can be used off-the-shelf after meta-training, is strong meta-generalization: the ability to apply the optimizers to new tasks. Recent state-of-the-art work in learned optimizers, VeLO (Metz et al., 2022), requires a large number of highly diverse meta-training tasks along with massive computational resources, 4000 TPU months, to achieve meta-generalization. This makes further improvements to such learned optimizers impractical. In this work, we identify several key elements in learned optimizer architectures and meta-training procedures that can lead to strong meta-generalization. We also propose evaluation metrics to reliably assess quantitative performance of an optimizer at scale on a set of evaluation tasks. Our proposed approach, Celo, makes a significant leap in improving the meta-generalization performance of learned optimizers and also outperforms tuned state-of-the-art optimizers on a diverse set of out-of-distribution tasks, despite being meta-trained for just 24 GPU hours.

2025-01-22

ArXiv (prépublication)

doi.org

arxiv.org

Hackathon | Créer une IA plus sécuritaire pour la santé mentale des jeunes

Communauté de pratique de Mila : Sécurité en IA

Éclaireurs autochtones en IA

Avantage IA

Publications

Hackathon | Créer une IA plus sécuritaire pour la santé mentale des jeunes

Communauté de pratique de Mila : Sécurité en IA

Éclaireurs autochtones en IA

Avantage IA

Mots-clés populaires:

Publications