Pierluca D'Oro

Hierarchical Procedural Meta-Reasoning for Generalizable Multimodal Agents

Yao Fu

Shengyi Qian

Fanyi Xiao

Honglak Lee

Joseph Tighe

Manchen Wang

While multimodal agents can achieve strong performance through fine-tuning, their ability to generalize remains limited in complex real-worl… (see more)d tasks such as mobile navigation, where diverse applications, frequent system changes, and customized workflows are common. We argue that a fundamental bottleneck lies in whether an agent possesses sufficient task-specific procedural knowledge to accomplish a given goal. In practice, due to the limited or outdated knowledge of the agent, the procedural steps it generates can be hallucinated and misaligned with the environment during execution. However, better procedural knowledge can be provided by the general capabilities of large language models, or obtained from additional external resources such as web search when necessary. Based on this view, we propose Procedure-Aware Multimodal Agent with Meta Reasoning, a framework that explicitly represents task knowledge as natural-language procedures and trains a procedure-aware grounded agent to condition its actions on this knowledge. By learning to leverage procedural knowledge from different sources, our approach enables robust and reliable generalization with reduced procedural hallucination across tasks, applications, interface versions, and multi-app workflows, achieving substantial improvements on challenging Android benchmarks.

2026-03-01

Reliable Autonomy @ International Conference on Learning Representations (poster)

openreview.net

Mol-MoE: Training Preference-Guided Routers for Molecule Generation

Diego Calanzone

Pierluca D'Oro

Pierre-Luc Bacon

Recent advances in language models have enabled framing molecule generation as sequence modeling. However, existing approaches often rely on… (see more) single-objective reinforcement learning, limiting their applicability to real-world drug design, where multiple competing properties must be optimized. Traditional multi-objective reinforcement learning (MORL) methods require costly retraining for each new objective combination, making rapid exploration of trade-offs impractical. To overcome these limitations, we introduce Mol-MoE, a mixture-of-experts (MoE) architecture that enables efficient test-time steering of molecule generation without retraining. Central to our approach is a preference-based router training objective that incentivizes the router to combine experts in a way that aligns with user-specified trade-offs. This provides improved flexibility in exploring the chemical property space at test time, facilitating rapid trade-off exploration. Benchmarking against state-of-the-art methods, we show that Mol-MoE achieves superior sample quality and steerability.

2025-03-04

ICLR.cc/2025/Workshop/GEM (published)

doi.org

openreview.net

Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons

When training neural networks, dying neurons -- units becoming inactive or saturated -- are traditionally seen as harmful. This paper sheds … (see more)new light on this phenomenon. By exploring the impact of various hyperparameter configurations on dying neurons during training, we gather insights on how to improve upon sparse training approaches to pruning. We introduce Demon Pruning (DemP), a method that controls the proliferation of dead neurons through a combination of noise injection on active units and a one-cycle schedule regularization strategy, dynamically leading to network sparsity. Experiments on CIFAR-10 and ImageNet datasets demonstrate that DemP outperforms existing dense-to-sparse structured pruning methods, achieving better accuracy-sparsity tradeoffs and accelerating training by up to 3.56

2025-02-12

TMLR (accepted)

doi.org

openreview.net

MaestroMotif: Skill Design From Artificial Intelligence Feedback

Martin Klissarov

Mikael Henaff

Roberta Raileanu

Marlos C. Machado

Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an… (see more) AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.

2025-01-21

ICLR.cc/2025/Conference (oral)

doi.org

openreview.net

Towards General-Purpose Model-Free Reinforcement Learning

Scott Fujimoto

Pierluca D'Oro

Amy Zhang

Yuandong Tian

Michael G. Rabbat

Reinforcement learning (RL) promises a framework for near-universal problem-solving. In practice however, RL algorithms are often tailored t… (see more)o specific benchmarks, relying on carefully tuned hyperparameters and algorithmic choices. Recently, powerful model-based RL methods have shown impressive general results across benchmarks but come at the cost of increased complexity and slow run times, limiting their broader applicability. In this paper, we attempt to find a unifying model-free deep RL algorithm that can address a diverse class of domains and problem settings. To achieve this, we leverage model-based representations that approximately linearize the value function, taking advantage of the denser task objectives used by model-based RL while avoiding the costs associated with planning or simulated trajectories. We evaluate our algorithm, MR.Q, on a variety of common RL benchmarks with a single set of hyperparameters and show a competitive performance against domain-specific and general baselines, providing a concrete step towards building general-purpose model-free deep RL algorithms.

2025-01-21

ICLR.cc/2025/Conference (spotlight)

doi.org

openreview.net

Controlling Multimodal LLMs via Reward-guided Decoding

Adriana Romero

2024-10-09

NeurIPS.cc/2024/Workshop/AFM (poster)

doi.org

openreview.net

Do Transformer World Models Give Better Policy Gradients?

A natural approach for reinforcement learning is to predict future rewards by unrolling a neural network world model, and to backpropagate t… (see more)hrough the resulting computational graph to learn a policy. However, this method often becomes impractical for long horizons since typical world models induce hard-to-optimize loss landscapes. Transformers are known to efficiently propagate gradients over long horizons: could they be the solution to this problem? Surprisingly, we show that commonly-used transformer world models produce circuitous gradient paths, which can be detrimental to long-range policy gradients. To tackle this challenge, we propose a class of world models called Actions World Models (AWMs), designed to provide more direct routes for gradient propagation. We integrate such AWMs into a policy gradient framework that underscores the relationship between network architectures and the policy gradient updates they inherently represent. We demonstrate that AWMs can generate optimization landscapes that are easier to navigate even when compared to those from the simulator itself. This property allows transformer AWMs to produce better policies than competitive baselines in realistic long-horizon tasks.

2024-07-07

Proceedings of the 41st International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

Controlling Large Language Model Agents with Entropic Activation Steering

Nathan Rahn

Pierluca D'Oro

Bellemare Marc-Emmanuel

Marc G Bellemare

The generality of pretrained large language models (LLMs) has prompted increasing interest in their use as in-context learning agents. To be… (see more) successful, such agents must form beliefs about how to achieve their goals based on limited interaction with their environment, resulting in uncertainty about the best action to take at each step. In this paper, we study how LLM agents form and act on these beliefs by conducting experiments in controlled sequential decision-making tasks. To begin, we find that LLM agents are overconfident: They draw strong conclusions about what to do based on insufficient evidence, resulting in inadequately explorative behavior. We dig deeper into this phenomenon and show how it emerges from a collapse in the entropy of the action distribution implied by sampling from the LLM. We then demonstrate that existing token-level sampling techniques are by themselves insufficient to make the agent explore more. Motivated by this fact, we introduce Entropic Activation Steering (EAST), an activation steering method for in-context LLM agents. EAST computes a steering vector as an entropy-weighted combination of representations, and uses it to manipulate an LLM agent's uncertainty over actions by intervening on its activations during the forward pass. We show that EAST can reliably increase the entropy in an LLM agent's actions, causing more explorative behavior to emerge. Finally, EAST modifies the subjective uncertainty an LLM agent expresses, paving the way to interpreting and controlling how LLM agents represent uncertainty about their decisions.

2024-06-23

MI @ International Conference on Machine Learning (poster)

doi.org

openreview.net

Motif: Intrinsic Motivation From Artificial Intelligence Feedback

Roberta Raileanu

Mikael Henaff

Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, … (see more)a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt.

2024-01-15

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

The Curse of Diversity in Ensemble-Based Exploration

We uncover a surprising phenomenon in deep reinforcement learning: training a diverse ensemble of data-sharing agents -- a well-established … (see more)exploration strategy -- can significantly impair the performance of the individual ensemble members when compared to standard single-agent training. Through careful analysis, we attribute the degradation in performance to the low proportion of self-generated data in the shared training data for each ensemble member, as well as the inefficiency of the individual ensemble members to learn from such highly off-policy data. We thus name this phenomenon the curse of diversity. We find that several intuitive solutions -- such as a larger replay buffer or a smaller ensemble size -- either fail to consistently mitigate the performance loss or undermine the advantages of ensembling. Finally, we demonstrate the potential of representation learning to counteract the curse of diversity with a novel method named Cross-Ensemble Representation Learning (CERL) in both discrete and continuous control domains. Our work offers valuable insights into an unexpected pitfall in ensemble-based exploration and raises important caveats for future applications of similar approaches.

2024-01-15

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control

Bellemare Marc-Emmanuel

Deep reinforcement learning agents for continuous control are known to exhibit significant instability in their performance over time. In th… (see more)is work, we provide a fresh perspective on these behaviors by studying the return landscape: the mapping between a policy and a return. We find that popular algorithms traverse noisy neighborhoods of this landscape, in which a single update to the policy parameters leads to a wide range of returns. By taking a distributional view of these returns, we map the landscape, characterizing failure-prone regions of policy space and revealing a hidden dimension of policy quality. We show that the landscape exhibits surprising structure by finding simple paths in parameter space which improve the stability of a policy. To conclude, we develop a distribution-aware procedure which finds such paths, navigating away from noisy neighborhoods in order to improve the robustness of a policy. Taken together, our results provide new insight into the optimization, evaluation, and design of agents.

2023-09-20

NeurIPS.cc/2023/Conference (poster)

doi.org

openreview.net

Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier

Bellemare Marc-Emmanuel

Aaron Courville

2023-01-31