Bidirectional Information Flow (BIF) - A Sample Efficient Hierarchical Gaussian Process for Bayesian Optimization
Juan David Guerra
Thomas Garbay
Hierarchical Gaussian Process (H-GP) models divide problems into different subtasks, allowing for different models to address each part, mak… (voir plus)ing them well-suited for problems with inherent hierarchical structure. However, typical H-GP models do not fully take advantage of this structure, only sending information up or down the hierarchy. This one-way coupling limits sample efficiency and slows convergence. We propose Bidirectional Information Flow (BIF), an efficient H-GP framework that establishes bidirectional information exchange between parent and child models in H-GPs for online training. BIF retains the modular structure of hierarchical models - the parent combines subtask knowledge from children GPs - while introducing top-down feedback to continually refine children models during online learning. This mutual exchange improves sample efficiency, enables robust training, and allows modular reuse of learned subtask models. BIF outperforms conventional H-GP Bayesian Optimization methods, achieving up to 85% and 5x higher
Building spatial world models from sparse transitional episodic memories
Zizhan He
Maxime Daigle
Many animals possess a remarkable capacity to rapidly construct flexible mental models of their environments. These world models are crucial… (voir plus) for ethologically relevant behaviors such as navigation, exploration, and planning. The ability to form episodic memories and make inferences based on these sparse experiences is believed to underpin the efficiency and adaptability of these models in the brain. Here, we ask: Can a neural network learn to construct a spatial model of its surroundings from sparse and disjoint episodic memories? We formulate the problem in a simulated world and propose a novel framework, the Episodic Spatial World Model (ESWM), as a potential answer. We show that ESWM is highly sample-efficient, requiring minimal observations to construct a robust representation of the environment. It is also inherently adaptive, allowing for rapid updates when the environment changes. In addition, we demonstrate that ESWM readily enables near-optimal strategies for exploring novel environments and navigating between arbitrary points, all without the need for additional training.
Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down
Yingzhi Wang
Anas Alhmoud
Saad Alsahly
Muhammad Alqurishi
Caption This, Reason That: VLMs Caught in the Middle
Zihan Weng
Lucas Gomez
Taylor Whittington Webb
Compositional Risk Minimization
Divyat Mahajan
Mohammad Pezeshki
Charles Arnal
Kartik Ahuja
Context is Key: A Benchmark for Forecasting with Essential Textual Information
Andrew Robert Williams
Arjun Ashok
Étienne Marcotte
Valentina Zantedeschi
Jithendaraa Subramanian
Roland Riachi
James Requeima
Alexandre Lacoste
Dimension-adapted Momentum Outscales SGD
Damien Ferbach
Katie Everett
Elliot Paquette
We investigate scaling laws for stochastic momentum algorithms with small batch on the power law random features model, parameterized by dat… (voir plus)a complexity, target complexity, and model size. When trained with a stochastic momentum algorithm, our analysis reveals four distinct loss curve shapes determined by varying data-target complexities. While traditional stochastic gradient descent with momentum (SGD-M) yields identical scaling law exponents to SGD, dimension-adapted Nesterov acceleration (DANA) improves these exponents by scaling momentum hyperparameters based on model size and data complexity. This outscaling phenomenon, which also improves compute-optimal scaling behavior, is achieved by DANA across a broad range of data and target complexities, while traditional methods fall short. Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions and large-scale text experiments with LSTMs show DANA's improved loss exponents over SGD hold in a practical setting.
Discovering Symbolic Cognitive Models from Human and Animal Behavior
Nenad Tomasev
Navodita Sharma
Rishika Mohanta
Aparna Dev
Kuba Perlin
Siddhant Jain
Kyle Levin
Noemi Elteto
Will Dabney
Alexander Novikov
Glenn C Turner
Maria K Eckstein
Nathaniel D. Daw
Kevin J Miller
Kim Stachenfeld
Symbolic models play a key role in cognitive science, expressing computationally precise hypotheses about how the brain implements a cogniti… (voir plus)ve process. Identifying an appropriate model typically requires a great deal of effort and ingenuity on the part of a human scientist. Here, we adapt FunSearch (Romera-Paredes et al. 2024), a recently developed tool that uses Large Language Models (LLMs) in an evolutionary algorithm, to automatically discover symbolic cognitive models that accurately capture human and animal behavior. We consider datasets from three species performing a classic reward-learning task that has been the focus of substantial modeling effort, and find that the discovered programs outperform state-of-the-art cognitive models for each. The discovered programs can readily be interpreted as hypotheses about human and animal cognition, instantiating interpretable symbolic learning and decision-making algorithms. Broadly, these results demonstrate the viability of using LLM-powered program synthesis to propose novel scientific hypotheses regarding mechanisms of human and animal cognition.
Does learning the right latent variables necessarily improve in-context learning?
Sarthak Mittal
Eric Elmoznino
Leo Gagnon
Sangnie Bhardwaj
Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights, suggesting ave… (voir plus)nues for efficiently solving new tasks. For many tasks, e.g., linear regression, the data factorizes: examples are independent given a task latent that generates the data, e.g., linear coefficients. While an optimal predictor leverages this factorization by inferring task latents, it is unclear if Transformers implicitly do so or if they instead exploit heuristics and statistical shortcuts enabled by attention layers. Both scenarios have inspired active ongoing work. In this paper, we systematically investigate the effect of explicitly inferring task latents. We minimally modify the Transformer architecture with a bottleneck designed to prevent shortcuts in favor of more structured solutions, and then compare performance against standard Transformers across various ICL tasks. Contrary to intuition and some recent works, we find little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance, in general. Curiously, we find that while the bottleneck effectively learns to extract latent task variables from context, downstream processing struggles to utilize them for robust prediction. Our study highlights the intrinsic limitations of Transformers in achieving structured ICL solutions that generalize, and shows that while inferring the right latents aids interpretability, it is not sufficient to alleviate this problem.
FLAM: Frame-Wise Language-Audio Modeling
Yusong Wu
Christos Tsirigotis
Ke Chen
Oriol Nieto
Prem Seetharaman
Justin Salamon
A flexible machine learning Mendelian randomization estimator applied to predict the safety and efficacy of sclerostin inhibition
Jason Hartford
Benoît J. Arsenault
AI for Global Climate Cooperation: Modeling Global Climate Negotiations, Agreements, and Long-Term Cooperation in RICE-N
Tianyu Zhang
Andrew Robert Williams
Phillip Wozny
Kai-Hendrik Cohrs
Koen Ponse
Marco Jiralerspong
Soham Rajesh Phade
Sunil Srinivasa
Lu Liu
Yang Zhang
Prateek Gupta
Erman Acar
Stephan Zheng