Publications

Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning

Samuel Garcin

Trevor McInroe

Pablo Samuel Castro

Christopher G. Lucas

David Abel

Prakash Panangaden

Stefano V Albrecht

Extracting relevant information from a stream of high-dimensional observations is a central challenge for deep reinforcement learning agents… (voir plus). Actor-critic algorithms add further complexity to this challenge, as it is often unclear whether the same information will be relevant to both the actor and the critic. To this end, we here explore the principles that underlie effective representations for an actor and for a critic. We focus our study on understanding whether an actor and a critic will benefit from a decoupled, rather than shared, representation. Our primary finding is that when decoupled, the representations for the actor and critic systematically specialise in extracting different types of information from the environment---the actor's representation tends to focus on action-relevant information, while the critic's representation specialises in encoding value and dynamics information. Finally, we demonstrate how these insights help select representation learning objectives that play into the actor's and critic's respective knowledge specialisations, and improve performance in terms of agent returns.

2025-01-22

ICLR.cc/2025/Conference (poster)

openreview.net

SymmCD: Symmetry-Preserving Crystal Generation with Diffusion Models

Daniel Levy

Siba Smarak Panigrahi

Sékou-Oumar Kaba

Qiang Zhu

Kin Long Kelvin Lee

Mikhail Galkin

Santiago Miret

Siamak Ravanbakhsh

2025-01-22

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo

João Loula

Benjamin LeBrun

Lei Du

Ben Lipkin

Clemente Pasti

Gabriel Grand

Tianyu Liu

Yahya Emara

Marjorie Freedman

Jason Eisner

Ryan Cotterell

Vikash Mansinghka

Alexander K. Lew

Tim Vieira

Timothy O'Donnell

A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be… (voir plus) naturally framed as probabilistic conditioning, but exact generation from the resulting distribution—which can differ substantially from the LM’s base distribution—is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). This SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains—Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis—we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8× larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. [Our system](https://github.com/probcomp/genparse) builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.

2025-01-22

ICLR.cc/2025/Conference (présentation orale)

openreview.net

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo

João Loula

Benjamin LeBrun

Li Du

Ben Lipkin

Clemente Pasti

Gabriel Grand

Tianyu Liu

Yahya Emara

Marjorie Freedman

Jason Eisner

Ryan Cotterell

Vikash Mansinghka

Alexander K. Lew

Tim Vieira

Timothy O'Donnell

A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be… (voir plus) naturally framed as probabilistic conditioning, but exact generation from the resulting distribution—which can differ substantially from the LM’s base distribution—is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). This SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains—Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis—we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8× larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. [Our system](https://github.com/probcomp/gen-parse) builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.

2025-01-22

ICLR.cc/2025/Conference (présentation orale)

openreview.net

TeD-Loc: Text Distillation for Weakly Supervised Object Localization

Shakeeb Murtaza

Soufiane Belharbi

Marco Pedersoli

Eric Granger

Weakly supervised object localization (WSOL) using classification models trained with only image-class labels remains an important challenge… (voir plus) in computer vision. Given their reliance on classification objectives, traditional WSOL methods like class activation mapping focus on the most discriminative object parts, often missing the full spatial extent. In contrast, recent WSOL methods based on vision-language models like CLIP require ground truth classes or external classifiers to produce a localization map, limiting their deployment in downstream tasks. Moreover, methods like GenPromp attempt to address these issues but introduce considerable complexity due to their reliance on conditional denoising processes and intricate prompt learning. This paper introduces Text Distillation for Localization (TeD-Loc), an approach that directly distills knowledge from CLIP text embeddings into the model backbone and produces patch-level localization. Multiple instance learning of these image patches allows for accurate localization and classification using one model without requiring external classifiers. Such integration of textual and visual modalities addresses the longstanding challenge of achieving accurate localization and classification concurrently, as WSOL methods in the literature typically converge at different epochs. Extensive experiments show that leveraging text embeddings and localization cues provides a cost-effective WSOL model. TeD-Loc improves Top-1 LOC accuracy over state-of-the-art models by about 5% on both CUB and ILSVRC datasets, while significantly reducing computational complexity compared to GenPromp.

2025-01-22

ArXiv (prépublication)

doi.org

arxiv.org

On the Identifiability of Causal Abstractions

Xiusi Li

Sékou-Oumar Kaba

Siamak Ravanbakhsh

Causal representation learning (CRL) enhances machine learning models' robustness and generalizability by learning structural causal models … (voir plus)associated with data-generating processes. We focus on a family of CRL methods that uses contrastive data pairs in the observable space, generated before and after a random, unknown intervention, to identify the latent causal model. (Brehmer et al., 2022) showed that this is indeed possible, given that all latent variables can be intervened on individually. However, this is a highly restrictive assumption in many systems. In this work, we instead assume interventions on arbitrary subsets of latent variables, which is more realistic. We introduce a theoretical framework that calculates the degree to which we can identify a causal model, given a set of possible interventions, up to an abstraction that describes the system at a higher level of granularity.

2025-01-22

aistats.org/AISTATS/2025/Conference (poster)

doi.org

openreview.net

On the Identifiability of Causal Abstractions

Xiusi Li

Sékou-Oumar Kaba

Siamak Ravanbakhsh

Causal representation learning methods seek to enhance machine learning models' robustness and generalization capabilities by learning laten… (voir plus)t representations and causal graphs aligned with the data generating process. In many systems, fully recovering the true causal structure is challenging because we cannot intervene on all latent variables individually. We introduce a theoretical framework that calculates the degree to which we can identify a causal structure in the more realistic setting of interventions on arbitrary subsets of latent variables. We find that in that case, we can only identify a causal model up to a \emph{causal abstraction}. These causal abstractions are still meaningful in that they describe the system at a higher level of granularity. Conversely, given a causal abstraction, our framework provides sufficient conditions for its identifiability. Our findings extend existing identifiability results in two areas: those that address abstractions of latent variables without considering graphical structures and those that focus on graphical structures without incorporating their abstractions.

2025-01-22

aistats.org/AISTATS/2025/Conference (poster)

openreview.net

The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

Tian Jin

Ahmed Imtiaz Humayun

Utku Evci

Suvinay Subramanian

Amir Yazdanbakhsh

Dan Alistarh

Gintare Karolina Dziugaite

Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large l… (voir plus)anguage models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into a single phase--provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. These findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training. Through empirical and theoretical validation, we demonstrate that this modified scaling law accurately models evaluation loss for both sparsely and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms. Our findings indicate that while sparse pre-training achieves the same final model quality as dense pre-training for equivalent compute budgets, it provides substantial benefits through reduced model size, enabling significant potential computational savings during inference.

2025-01-22

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

On the Modeling Capabilities of Large Language Models for Sequential Decision Making

Martin Klissarov

(Rex) Devon Hjelm

Alexander T Toshev

Bogdan Mazoure

2025-01-22

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

The Pitfalls of Memorization: When Memorization Hurts Generalization

David Lopez-Paz

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explan… (voir plus)ations.This behavior leads to poor generalization when the learned explanations rely on spurious correlations. In this work, we formalize the interplay between memorization and generalization, showing that spurious correlations would particularly lead to poor generalization when are combined with memorization. Memorization can reduce training loss to zero, leaving no incentive to learn robust, generalizable patterns. To address this, we propose memorization-aware training (MAT), which uses held-out predictions as a signal of memorization to shift a model's logits. MAT encourages learning robust patterns invariant across distributions, improving generalization under distribution shifts.

2025-01-22

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

The Size of Teachers as a Measure of Data Complexity: PAC-Bayes Excess Risk Bounds and Scaling Laws

Gintare Karolina Dziugaite

Daniel M. Roy

We study the generalization properties of randomly initialized neural networks, under the assumption that the network is larger than some un… (voir plus)known "teacher" network that achieves low risk. We extend the analysis of Buzaglo et al. (2024) to allow for student networks of arbitrary width and depth, and to the setting where no (small) teacher network perfectly interpolates the data. We obtain an oracle inequality, relating the risk of Gibbs posterior sampling to that of narrow teacher networks. As a result, the sample complexity is once again bounded in terms of the size of narrow teacher networks that themselves achieve small risk. We then introduce a new notion of data complexity, based on the minimal size of a teacher network required to achieve a certain level of excess risk. By comparing the scaling laws resulting from our bounds to those observed in empirical studies, we are able to estimate the data complexity of standard benchmarks according to our measure.

2025-01-22

aistats.org/AISTATS/2025/Conference (poster)

openreview.net

The Size of Teachers as a Measure of Data Complexity: PAC-Bayes Excess Risk Bounds and Scaling Laws

Gintare Karolina Dziugaite

Daniel M. Roy

We study the generalization properties of randomly initialized neural networks, under the assumption that the network is larger than some un… (voir plus)known "teacher" network that achieves low risk. We extend the analysis of Buzaglo et al. (2024) to allow for student networks of arbitrary width and depth, and to the setting where no (small) teacher network perfectly interpolates the data. We obtain an oracle inequality, relating the risk of Gibbs posterior sampling to that of narrow teacher networks. As a result, the sample complexity is once again bounded in terms of the size of narrow teacher networks that themselves achieve small risk. We then introduce a new notion of data complexity, based on the minimal size of a teacher network required to achieve a certain level of excess risk. By comparing the scaling laws resulting from our bounds to those observed in empirical studies, we are able to estimate the data complexity of standard benchmarks according to our measure.

2025-01-22

aistats.org/AISTATS/2025/Conference (poster)

openreview.net

Avantage IA

Mettre à profit l'IA pour un avenir durable

Bourse Mila en politiques de l'IA

Avantage IA

Mettre à profit l'IA pour un avenir durable

Publications

Avantage IA

Mettre à profit l'IA pour un avenir durable

Bourse Mila en politiques de l'IA

Avantage IA

Mettre à profit l'IA pour un avenir durable

Mots-clés populaires:

Publications