Controlling Multimodal LLMs via Reward-guided Decoding
Oscar Mañas
Pierluca D'Oro
Koustuv Sinha
Michal Drozdzal
CTRL-O: Language-Controllable Object-Centric Visual Representation Learning
Aniket Rajiv Didolkar
Andrii Zadaianchuk
Rabiul Awal
Maximilian Seitzer
Efstratios Gavves
Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each … (see more)slot captures a distinct object. Current state-of-the-art models have shown remarkable success in object discovery, particularly in complex real-world scenes, while also generalizing well to unseen domains. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects and parts, without allowing user input to guide or modify which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as enabling models to represent scenes at variable levels of granularity based on user specification. In this work, we propose a novel approach that conditions slot representations through guided decomposition, paired with a novel contrastive learning objective, to enable user-directed control over which objects are represented. Our method achieves such controllability without any mask supervision and successfully binds to user-specified objects in complex real-world scenes.
Deliberate Practice with Synthetic Data
Reyhane Askari-Hemmat
Mohammad Pezeshki
Pietro Astolfi
Melissa Hall
Florian Bordes
Jakob Verbeek
Michal Drozdzal
Doob's Lagrangian: A Sample-Efficient Variational Approach to Transition Path Sampling
Yuanqi Du
Michael Plainer
Rob Brekelmans
Chenru Duan
Frank No'e
Carla P. Gomes
Alan Aspuru-Guzik
Rare event sampling in dynamical systems is a fundamental problem arising in the natural sciences, which poses significant computational cha… (see more)llenges due to an exponentially large space of trajectories. For settings where the dynamical system of interest follows a Brownian motion with known drift, the question of conditioning the process to reach a given endpoint or desired rare event is definitively answered by Doob's h-transform. However, the naive estimation of this transform is infeasible, as it requires simulating sufficiently many forward trajectories to estimate rare event probabilities. In this work, we propose a variational formulation of Doob's h-transform as an optimization problem over trajectories between a given initial point and the desired ending point. To solve this optimization, we propose a simulation-free training objective with a model parameterization that imposes the desired boundary conditions by design. Our approach significantly reduces the search space over trajectories and avoids expensive trajectory simulation and inefficient importance sampling estimators which are required in existing methods. We demonstrate the ability of our method to find feasible transition paths on real-world molecular simulation and protein folding tasks.
Enhancing Multi-Agent Multi-Modal Collaboration with Fine-Grained Reward Modeling
Qian Yang
Weixiang Yan
Multi-Modal Large Language Models (MLLMs) have significantly advanced multi-modal reasoning but still struggle with compositional reasoning … (see more)tasks. Multi-agent collaboration provides a promising solution by leveraging the distinct capabilities of different agents. Specifically, a decomposer agent to handle task breakdown and an answerer agent to generate responses. While there have been efforts to adaptively decompose tasks based on the answerer agent's capabilities, such as using in-context learning, these methods often prove insufficient for fully effective decomposition. We address this issue by enhancing collaboration through fine-grained reward modeling, where each generated sub-question is assigned a specialized reward without requiring extra annotation or tuning of a reward model. Our proposed method dynamically optimizes the decomposition process, enabling better alignment between agents. Experimental results on four vision-language tasks demonstrate consistent improvements, with a 5.5\% absolute increase in mean performance over traditional approaches. These findings highlight the efficacy of fine-grained reward modeling for enhancing multi-agent, multi-modal collaboration.
Evaluating Interventional Reasoning Capabilities of Large Language Models
Numerous decision-making tasks require estimating causal effects under interventions on different parts of a system. As practitioners consid… (see more)er using large language models (LLMs) to automate decisions, studying their causal reasoning capabilities becomes crucial. A recent line of work evaluates LLMs ability to retrieve commonsense causal facts, but these evaluations do not sufficiently assess how LLMs reason about interventions. Motivated by the role that interventions play in causal inference, in this paper, we conduct empirical analyses to evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types, and enable a study of intervention-based reasoning. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts. Our analysis on four LLMs highlights that while GPT- 4 models show promising accuracy at predicting the intervention effects, they remain sensitive to distracting factors in the prompts.
Fast Convergence of Softmax Policy Mirror Ascent for Bandits & Tabular MDPs
Reza Asad
Reza Babanezhad Harikandeh
Issam Hadj Laradji
Sharan Vaswani
We analyze the convergence of a novel policy gradient algorithm (referred to as SPMA) for multi-armed bandits and tabular Markov decision pr… (see more)ocesses (MDPs). SPMA is an instantiation of mirror ascent and uses the softmax parameterization with a log-sum-exp mirror map. Given access to the exact policy gradients, we prove that SPMA with a constant step-size requires
Faster, More Efficient RLHF through Off-Policy Asynchronous Learning
Michael Noukhovitch
Shengyi Huang
Sophie Xhonneux
Arian Hosseini
To achieve state-of-the-art chatbots, large language models are finetuned with reinforcement learning (RL), frequently to optimize human fee… (see more)dback (RLHF). This process is computationally expensive and can take weeks. Offline approaches, like DPO, learn on a static dataset and are efficient but not performant. The dominant paradigm, online and on-policy---synchronously generating from the model, labelling with a reward model, and learning on feedback from the model's own outputs---is performant but not efficient. Following prior work in the generall deep RL setting, we propose separating the actor and learner in RLHF. This enables the asynchronously generation of new samples while learning on prior samples, thus leading to overall faster training and better scaling. But this requires a novel regime for RLHF, online but off-policy: learning on samples from a previous version of our model. We ask a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? We find that a contrastive loss, Online DPO, is most robust to off-policy data and that robustness increases with the scale of the policy model. We show even further compute optimizations but demonstrate that they come at a performance cost, giving rise to a trade-off. Finally, we verify our design choices by training LLaMA 3.1 8B with RLHF as a helpful chatbot in half the time of a synchronous run while matching final performance.
GraphText: Graph Reasoning in Text Space
Jianan Zhao
Le Zhuo
Yikang Shen
Meng Qu
Kai Liu
Michael M. Bronstein
Zhaocheng Zhu
High Dimensional First Order Mini-Batch Algorithms on Quadratic Problems
Andrew Nicholas Cheng
Kiwon Lee
We analyze the dynamics of general mini-batch first order algorithms on the …
How Learning Rates Shape Neural Network Focus: Insights from Example Ranking
Ekaterina Lobacheva
Keller Jordan
Aristide Baratin
The learning rate is a key hyperparameter that affects both the speed of training and the generalization performance of neural networks. Th… (see more)rough a new {\it loss-based example ranking} analysis, we show that networks trained with different learning rates focus their capacity on different parts of the data distribution, leading to solutions with different generalization properties. These findings, which hold across architectures and datasets, provide new insights into how learning rates affect model performance and example-level dynamics in neural networks.
Input Space Mode Connectivity in Deep Neural Networks
Jakub Vrabel
Ori Shem-Ur
Yaron Oz
We extend the concept of loss landscape mode connectivity to the input space of deep neural networks. Initially studied in parameter space, … (see more)mode connectivity describes the existence of low-loss paths between solutions (loss minimizers) found via gradient descent. We present theoretical and empirical evidence of its presence in the input space of deep networks, thereby highlighting the broader nature of the phenomenon. We observe that different input images with similar predictions are generally connected, and for trained models, the path tends to be simple, with only a small deviation from being a linear path. We conjecture that input space mode connectivity in high-dimensional spaces is a geometric phenomenon, present even in untrained models, and can be explained by percolation theory. We exploit mode connectivity to obtain new insights about adversarial examples and show its potential for adversarial detection and interpretability.