Publications

Local Search GFlowNets

Minsu Kim

Taeyoung Yun

Emmanuel Bengio

Dinghuai Zhang

Sungsoo Ahn

Jinkyoo Park

Generative Flow Networks (GFlowNets) are amortized sampling methods that learn a distribution over discrete objects proportional to their re… (voir plus)wards. GFlowNets exhibit a remarkable ability to generate diverse samples, yet occasionally struggle to consistently produce samples with high rewards due to over-exploration on wide sample space. This paper proposes to train GFlowNets with local search, which focuses on exploiting high-rewarded sample space to resolve this issue. Our main idea is to explore the local neighborhood via backtracking and reconstruction guided by backward and forward policies, respectively. This allows biasing the samples toward high-reward solutions, which is not possible for a typical GFlowNet solution generation scheme, which uses the forward policy to generate the solution from scratch. Extensive experiments demonstrate a remarkable performance improvement in several biochemical tasks. Source code is available: https://github.com/dbsxodud-11/ls_gfn.

2024-01-16

ICLR.cc/2024/Conference (spotlight)

openreview.net

LOQA: Learning with Opponent Q-Learning Awareness

Milad Aghajohari

Juan Agustin Duque

Tim Cooijmans

Aaron Courville

In various real-world scenarios, interactions among agents often resemble the dynamics of general-sum games, where each agent strives to opt… (voir plus)imize its own utility. Despite the ubiquitous relevance of such settings, decentralized machine learning algorithms have struggled to find equilibria that maximize individual utility while preserving social welfare. In this paper we introduce Learning with Opponent Q-Learning Awareness (LOQA) , a novel reinforcement learning algorithm tailored to optimizing an agent's individual utility while fostering cooperation among adversaries in partially competitive environments. LOQA assumes that each agent samples actions proportionally to their action-value function Q. Experimental results demonstrate the effectiveness of LOQA at achieving state-of-the-art performance in benchmark scenarios such as the Iterated Prisoner's Dilemma and the Coin Game. LOQA achieves these outcomes with a significantly reduced computational footprint compared to previous works, making it a promising approach for practical multi-agent applications.

2024-01-16

ICLR.cc/2024/Conference (poster)

openreview.net

Mastering Memory Tasks with World Models

Mohammad Reza Samsami

Artem Zholus

Janarthanan Rajendran

Sarath Chandar Anbil Parthipan

Current model-based reinforcement learning (MBRL) agents struggle with long-term dependencies. This limits their ability to effectively solv… (voir plus)e tasks involving extended time gaps between actions and outcomes, or tasks demanding the recalling of distant observations to inform current actions. To improve temporal coherence, we integrate a new family of state space models (SSMs) in world models of MBRL agents to present a new method, Recall to Imagine (R2I). This integration aims to enhance both long-term memory and long-horizon credit assignment. Through a diverse set of illustrative tasks, we systematically demonstrate that R2I not only establishes a new state-of-the-art for challenging memory and credit assignment RL tasks, such as BSuite and POPGym, but also showcases superhuman performance in the complex memory domain of Memory Maze. At the same time, it upholds comparable performance in classic RL tasks, such as Atari and DMC, suggesting the generality of our method. We also show that R2I is faster than the state-of-the-art MBRL method, DreamerV3, resulting in faster wall-time convergence.

2024-01-16

ICLR.cc/2024/Conference (présentation orale)

doi.org

openreview.net

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Samyak Jain

Robert Kirk

Ekdeep Singh Lubana

Robert P. Dick

Hidenori Tanaka

Edward Grefenstette

Tim Rocktäschel

David Scott Krueger

Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning… (voir plus) systems, including developing models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings, and show that: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a `wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such ``wrapped capabilities'' are relevant leads to sample-efficient revival of the capability, i.e., the model begins reusing these capabilities after only a few gradient steps. This indicates that practitioners can unintentionally remove a model's safety wrapper merely by fine-tuning it on a, e.g., superficially unrelated, downstream task. We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.

2024-01-16

ICLR.cc/2024/Conference (poster)

openreview.net

Motif: Intrinsic Motivation from Artificial Intelligence Feedback

Martin Klissarov

Pierluca D'Oro

Shagun Sodhani

Roberta Raileanu

Pierre-Luc Bacon

Pascal Vincent

Amy Zhang

Mikael Henaff

Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, … (voir plus)a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt.

2024-01-16

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

Object centric architectures enable efficient causal representation learning

Amin Mansouri

Jason Hartford

Yan Zhang

Yoshua Bengio

Causal representation learning has showed a variety of settings in which we can disentangle latent variables with identifiability guarantees… (voir plus) (up to some reasonable equivalence class). Common to all of these approaches is the assumption that (1) the latent variables are represented as

2024-01-16

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

PhyloGFN: Phylogenetic inference with generative flow networks

Ming Yang Zhou

Zichao Yan

Elliot Layne

Nikolay Malkin

Dinghuai Zhang

Moksh J. Jain

Mathieu Blanchette

Yoshua Bengio

2024-01-16

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

Piecewise Linear Parametrization of Policies: Towards Interpretable Deep Reinforcement Learning

Maxime Wabartha

Joelle Pineau

Learning inherently interpretable policies is a central challenge in the path to developing autonomous agents that humans can trust. Linear … (voir plus)policies can justify their decisions while interacting in a dynamic environment, but their reduced expressivity prevents them from solving hard tasks. Instead, we argue for the use of piecewise-linear policies. We carefully study to what extent they can retain the interpretable properties of linear policies while reaching competitive performance with neural baselines. In particular, we propose the HyperCombinator (HC), a piecewise-linear neural architecture expressing a policy with a controllably small number of sub-policies. Each sub-policy is linear with respect to interpretable features, shedding light on the decision process of the agent without requiring an additional explanation model. We evaluate HC policies in control and navigation experiments, visualize the improved interpretability of the agent and highlight its trade-off with performance. Moreover, we validate that the restricted model class that the HyperCombinator belongs to is compatible with the algorithmic constraints of various reinforcement learning algorithms.

2024-01-16

ICLR.cc/2024/Conference (poster)

openreview.net

Piecewise Linear Parametrization of Policies: Towards Interpretable Deep Reinforcement Learning

Maxime Wabartha

Joelle Pineau

2024-01-16

ICLR.cc/2024/Conference (poster)

openreview.net

Poly-View Contrastive Learning

Amitis Shidani

(Rex) Devon Hjelm

Jason Ramapuram

Russell Webb

Eeshan Gunesh Dhekane

Dan Busbridge

2024-01-16

ICLR.cc/2024/Conference (poster)

openreview.net

Pre-Training and Fine-Tuning Generative Flow Networks

Ling Pan

Moksh J. Jain

Kanika Madan

Yoshua Bengio

Generative Flow Networks (GFlowNets) are amortized samplers that learn stochastic policies to sequentially generate compositional objects fr… (voir plus)om a given unnormalized reward distribution. They can generate diverse sets of high-reward objects, which is an important consideration in scientific discovery tasks. However, as they are typically trained from a given extrinsic reward function, it remains an important open challenge about how to leverage the power of pre-training and train GFlowNets in an unsupervised fashion for efficient adaptation to downstream tasks. Inspired by recent successes of unsupervised pre-training in various domains, we introduce a novel approach for reward-free pre-training of GFlowNets. By framing the training as a self-supervised problem, we propose an outcome-conditioned GFlowNet (OC-GFN) that learns to explore the candidate space. Specifically, OC-GFN learns to reach any targeted outcomes, akin to goal-conditioned policies in reinforcement learning. We show that the pre-trained OC-GFN model can allow for a direct extraction of a policy capable of sampling from any new reward functions in downstream tasks. Nonetheless, adapting OC-GFN on a downstream task-specific reward involves an intractable marginalization over possible outcomes. We propose a novel way to approximate this marginalization by learning an amortized predictor enabling efficient fine-tuning. Extensive experimental results validate the efficacy of our approach, demonstrating the effectiveness of pre-training the OC-GFN, and its ability to swiftly adapt to downstream tasks and discover modes more efficiently. This work may serve as a foundation for further exploration of pre-training strategies in the context of GFlowNets.

2024-01-16

ICLR.cc/2024/Conference (spotlight)

doi.org

openreview.net

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

Haque Ishfaq

Qingfeng Lan

Pan Xu

A. Rupam Mahmood

Doina Precup

Animashree Anandkumar

Kamyar Azizzadenesheli

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcom… (voir plus)ings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of

2024-01-16

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

La recherche en IA au service du monde réel

Boussole des politiques en IA

Vie étudiante et ressources

Publications

La recherche en IA au service du monde réel

Boussole des politiques en IA

Vie étudiante et ressources

Mots-clés populaires:

Publications