Publications

In-Context Parametric Inference: Point or Distribution Estimators?

Sarthak Mittal

Nikolay Malkin

Bayesian and frequentist inference are two fundamental paradigms in statistical estimation. Bayesian methods treat hypotheses as random vari… (voir plus)ables, incorporating priors and updating beliefs via Bayes' theorem, whereas frequentist methods assume fixed but unknown hypotheses, relying on estimators like maximum likelihood. While extensive research has compared these approaches, the frequentist paradigm of obtaining point estimates has become predominant in deep learning, as Bayesian inference is challenging due to the computational complexity and the approximation gap of posterior estimation methods. However, a good understanding of trade-offs between the two approaches is lacking in the regime of amortized estimators, where in-context learners are trained to estimate either point values via maximum likelihood or maximum a posteriori estimation, or full posteriors using normalizing flows, score-based diffusion samplers, or diagonal Gaussian approximations, conditioned on observations. To help resolve this, we conduct a rigorous comparative analysis spanning diverse problem settings, from linear models to shallow neural networks, with a robust evaluation framework assessing both in-distribution and out-of-distribution generalization on tractable tasks. Our experiments indicate that amortized point estimators generally outperform posterior inference, though the latter remain competitive in some low-dimensional problems, and we further discuss why this might be the case.

2025-02-17

ArXiv (prépublication)

arxiv.org

In-Context Parametric Inference: Point or Distribution Estimators?

Sarthak Mittal

Yoshua Bengio

Nikolay Malkin

Guillaume Lajoie

Bayesian and frequentist inference are two fundamental paradigms in statistical estimation. Bayesian methods treat hypotheses as random vari… (voir plus)ables, incorporating priors and updating beliefs via Bayes' theorem, whereas frequentist methods assume fixed but unknown hypotheses, relying on estimators like maximum likelihood. While extensive research has compared these approaches, the frequentist paradigm of obtaining point estimates has become predominant in deep learning, as Bayesian inference is challenging due to the computational complexity and the approximation gap of posterior estimation methods. However, a good understanding of trade-offs between the two approaches is lacking in the regime of amortized estimators, where in-context learners are trained to estimate either point values via maximum likelihood or maximum a posteriori estimation, or full posteriors using normalizing flows, score-based diffusion samplers, or diagonal Gaussian approximations, conditioned on observations. To help resolve this, we conduct a rigorous comparative analysis spanning diverse problem settings, from linear models to shallow neural networks, with a robust evaluation framework assessing both in-distribution and out-of-distribution generalization on tractable tasks. Our experiments indicate that amortized point estimators generally outperform posterior inference, though the latter remain competitive in some low-dimensional problems, and we further discuss why this might be the case.

2025-02-17

ArXiv (prépublication)

doi.org

arxiv.org

Integrating Present and Past in Unsupervised Continual Learning

Yipeng Zhang

Laurent Charlin

Richard Zemel

Mengye Ren

We formulate a unifying framework for *unsupervised continual learning (UCL)*, which disentangles learning objectives that are specific to t… (voir plus)he present and the past data, encompassing *stability*, *plasticity*, and *cross-task consolidation*. The framework reveals that many existing UCL approaches overlook cross-task consolidation and try to balance plasticity and stability in a shared embedding space. This results in worse performance due to a lack of within-task data diversity and reduced effectiveness in learning the current task. Our method, *Osiris*, which explicitly optimizes all three objectives on separate embedding spaces, achieves state-of-the-art performance on all benchmarks, including two novel ones proposed in this paper featuring semantically structured task sequences. Finally, we show some preliminary evidence that continual models can benefit from such more realistic learning scenarios.

2025-02-17

Proceedings of The 3rd Conference on Lifelong Learning Agents (publié)

proceedings.mlr.press

Intuitive physics understanding emerges from self-supervised pretraining on natural videos

Quentin Garrido

Nicolas Ballas

Mahmoud Assran

Adrien Bardes

Laurent Najman

Michael Rabbat

Emmanuel Dupoux

Yann LeCun

We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regi… (voir plus)ons in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.

2025-02-17

ArXiv (prépublication)

arxiv.org

Intuitive physics understanding emerges from self-supervised pretraining on natural videos

Quentin Garrido

Nicolas Ballas

Mahmoud Assran

Adrien Bardes

Laurent Najman

Michael Rabbat

Emmanuel Dupoux

Yann LeCun

We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regi… (voir plus)ons in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.

2025-02-17

ArXiv (prépublication)

doi.org

arxiv.org

Meta-Analysis with Untrusted Data

Shiva Kaul

Geoff Gordon

Meta-analyses are usually conducted on small amounts of “trusted” data, ideally from randomized, controlled trials. Excluding untrusted … (voir plus)(observational) data — such as medical records and related scientific literature — avoids potential confounding and ensures unbiased conclusions. Unfortunately, this exclusion can reduce predictive accuracy to the point of clinical irrelevance, especially when trials are heterogeneous. This paper shows how untrusted data can be safely incorporated into meta-analysis, improving predictions without sacrificing rigor or introducing unproven assumptions. Our approach, called conformal meta-analysis, consists of (1) learning a (potentially flawed) prior distribution from the untrusted data, (2) using the prior and trusted data to derive a simple, fully-conformal prediction interval for the observed trial effect, and (3) analytically extracting an interval for the true (unobserved) effect. In multiple experiments on healthcare datasets, our algorithms deliver tighter, sounder intervals than traditional ones. This paper conceptually realigns meta-analysis as a foundation for evidence-based medicine, embracing heterogeneity and untrusted data for more nuanced, precise predictions.

2025-02-17

Proceedings of the 4th Machine Learning for Health Symposium (publié)

proceedings.mlr.press

Partial Models for Building Adaptive Model-Based Reinforcement Learning Agents

Safa Alver

Ali Rahimi-Kalahroudi

Doina Precup

In neuroscience, one of the key behavioral tests for determining whether a subject of study exhibits model-based behavior is to study its ad… (voir plus)aptiveness to local changes in the environment. In reinforcement learning, however, recent studies have shown that modern model-based agents display poor adaptivity to such changes. The main reason for this is that modern agents are typically designed to improve sample efficiency in single task settings and thus do not take into account the challenges that can arise in other settings. In local adaptation settings, one particularly important challenge is in quickly building and maintaining a sufficiently accurate model after a local change. This is challenging for deep model-based agents as their models and replay buffers are monolithic structures lacking distribution shift handling capabilities. In this study, we show that the conceptually simple idea of partial models can allow deep model-based agents to overcome this challenge and thus allow for building locally adaptive model-based agents. By modeling the different parts of the state space through different models, the agent can not only maintain a model that is accurate across the state space, but it can also quickly adapt it in the presence of a local change in the environment. We demonstrate this by showing that the use of partial models in agents such as deep Dyna-Q, PlaNet and Dreamer can allow for them to effectively adapt to the local changes in their environments.

2025-02-17

Proceedings of The 3rd Conference on Lifelong Learning Agents (publié)

doi.org

arxiv.org

Sub-goal Distillation: A Method to Improve Small Language Agents

Maryam Hashemzadeh

Elias Stengel-Eskin

Sarath Chandar

Marc-Alexandre Côté

While Large Language Models (LLMs) have demonstrated significant promise as agents in interactive tasks, their substantial computational req… (voir plus)uirements and restricted number of calls constrain their practical utility, especially in long-horizon interactive tasks such as decision-making or in scenarios involving continuous ongoing tasks. To address these constraints, we propose a method for transferring the performance of an LLM with billions of parameters to a much smaller language model (770M parameters). Our approach involves constructing a hierarchical agent comprising a planning module, which learns through Knowledge Distillation from an LLM to generate sub-goals, and an execution module, which learns to accomplish these sub-goals using elementary actions. In detail, we leverage an LLM to annotate an oracle path with a sequence of sub-goals towards completing a goal. Subsequently, we utilize this annotated data to fine-tune both the planning and execution modules. Importantly, neither module relies on real-time access to an LLM during inference, significantly reducing the overall cost associated with LLM interactions to a fixed cost. In ScienceWorld, a challenging and multi-task interactive text environment, our method surpasses standard imitation learning based solely on elementary actions by 16.7% (absolute). Our analysis highlights the efficiency of our approach compared to other LLM-based methods. Our code and annotated data for distillation can be found on GitHub.

2025-02-17

Proceedings of The 3rd Conference on Lifelong Learning Agents (publié)

doi.org

arxiv.org

Warmup Generations: A Task-Agnostic Approach for Guiding Sequence-to-Sequence Learning with Unsupervised Initial State Generation

Senyu Li

Zipeng Sun

Jiayi Wang

Xue (Steve) Liu

Pontus Stenetorp

Siva Reddy

David Ifeoluwa Adelani

2025-02-17

ArXiv (prépublication)

doi.org

arxiv.org

A Strong Baseline for Molecular Few-Shot Learning

Philippe Formont

Hugo Jeannin

Pablo Piantanida

Ismail Ben Ayed

Few-shot learning has recently attracted significant interest in drug discovery, with a recent, fast-growing literature mostly involving con… (voir plus)voluted meta-learning strategies. We revisit the more straightforward fine-tuning approach for molecular data, and propose a regularized quadratic-probe loss based on the the Mahalanobis distance. We design a dedicated block-coordinate descent optimizer, which avoid the degenerate solutions of our loss. Interestingly, our simple fine-tuning approach achieves highly competitive performances in comparison to state-of-the-art methods, while being applicable to black-box settings and removing the need for specific episodic pre-training strategies. Furthermore, we introduce a new benchmark to assess the robustness of the competing methods to domain shifts. In this setting, our fine-tuning baseline obtains consistently better results than meta-learning methods.

2025-02-15

TMLR (accepté)

openreview.net

From Markov to Laplace: How Mamba In-Context Learns Markov Chains

Marco Bondaschi

Nived Rajaraman

Xiuying Wei

Kannan Ramchandran

Razvan Pascanu

Caglar Gulcehre

Michael C. Gastpar

Ashok Vardhan Makkuva

While transformer-based language models have driven the AI revolution thus far, their computational complexity has spurred growing interest … (voir plus)in viable alternatives, such as structured state space sequence models (SSMs) and Selective SSMs. Among these, Mamba (S6) and its variant Mamba-2 have shown remarkable inference speed ups over transformers while achieving comparable or superior performance on complex language modeling tasks. However, despite these architectural innovations and empirical successes, the fundamental learning capabilities of Mamba remain poorly understood. In this paper, we address this gap by studying in-context learning (ICL) on Markov chains and uncovering a surprising phenomenon: unlike transformers, even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator, which is both Bayes and minimax optimal, for all Markovian orders. To explain this, we theoretically characterize the representation capacity of Mamba and reveal the fundamental role of convolution in enabling it to represent the optimal Laplacian smoothing. These theoretical insights align strongly with empirical results and, to the best of our knowledge, represent the first formal connection between Mamba and optimal statistical estimators. Finally, we outline promising research directions inspired by these findings.

2025-02-14

ArXiv (prépublication)

arxiv.org

From Markov to Laplace: How Mamba In-Context Learns Markov Chains

Marco Bondaschi

Nived Rajaraman

Xiuying Wei

Kannan Ramchandran

Razvan Pascanu

Caglar Gulcehre

Michael C. Gastpar

Ashok Vardhan Makkuva

2025-02-14

ArXiv (prépublication)

doi.org

arxiv.org

Avantage IA

Mettre à profit l'IA pour un avenir durable

Bourse Mila en politiques de l'IA

Avantage IA

Mettre à profit l'IA pour un avenir durable

Publications

Avantage IA

Mettre à profit l'IA pour un avenir durable

Bourse Mila en politiques de l'IA

Avantage IA

Mettre à profit l'IA pour un avenir durable

Mots-clés populaires:

Publications