Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning
Md Rifat Arefin
Gopeshh Subbaraj
Nicolas Gontier
Yann LeCun
Ravid Shwartz-Ziv
Solving Hidden Monotone Variational Inequalities with Surrogate Losses
Ryan D'Orazio
Danilo Vucetic
Zichu Liu
Junhyung Lyle Kim
Deep learning has proven to be effective in a wide variety of loss minimization problems. However, many applications of interest, like minim… (see more)izing projected Bellman error and min-max optimization, cannot be modelled as minimizing a scalar loss function but instead correspond to solving a variational inequality (VI) problem. This difference in setting has caused many practical challenges as naive gradient-based approaches from supervised learning tend to diverge and cycle in the VI case. In this work, we propose a principled surrogate-based approach compatible with deep learning to solve VIs. We show that our surrogate-based approach has three main benefits: (1) under assumptions that are realistic in practice (when hidden monotone structure is present, interpolation, and sufficient optimization of the surrogates), it guarantees convergence, (2) it provides a unifying perspective of existing methods, and (3) is amenable to existing deep learning optimizers like ADAM. Experimentally, we demonstrate our surrogate-based approach is effective in min-max optimization and minimizing projected Bellman error. Furthermore, in the deep reinforcement learning case, we propose a novel variant of TD(0) which is more compute and sample efficient.
Structure Language Models for Protein Conformation Generation
Jiarui Lu
Xiaoyin Chen
Stephen Zhewen Lu
Chence Shi
Hongyu Guo
Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning
Samuel Garcin
Trevor McInroe
Christopher G. Lucas
David Abel
Stefano V Albrecht
Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning
Samuel Garcin
Trevor McInroe
Christopher G. Lucas
David Abel
Stefano V Albrecht
Extracting relevant information from a stream of high-dimensional observations is a central challenge for deep reinforcement learning agents… (see more). Actor-critic algorithms add further complexity to this challenge, as it is often unclear whether the same information will be relevant to both the actor and the critic. To this end, we here explore the principles that underlie effective representations for an actor and for a critic. We focus our study on understanding whether an actor and a critic will benefit from a decoupled, rather than shared, representation. Our primary finding is that when decoupled, the representations for the actor and critic systematically specialise in extracting different types of information from the environment---the actor's representation tends to focus on action-relevant information, while the critic's representation specialises in encoding value and dynamics information. Finally, we demonstrate how these insights help select representation learning objectives that play into the actor's and critic's respective knowledge specialisations, and improve performance in terms of agent returns.
SymmCD: Symmetry-Preserving Crystal Generation with Diffusion Models
Daniel Levy
Siba Smarak Panigrahi
Sékou-Oumar Kaba
Qiang Zhu
Kin Long Kelvin Lee
Mikhail Galkin
Santiago Miret
Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo
João Loula
Benjamin LeBrun
Li Du
Ben Lipkin
Clemente Pasti
Gabriel Grand
Tianyu Liu
Yahya Emara
Marjorie Freedman
Jason Eisner
Ryan Cotterell
Vikash Mansinghka
Alexander K. Lew
Tim Vieira
A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be… (see more) naturally framed as probabilistic conditioning, but exact generation from the resulting distribution—which can differ substantially from the LM’s base distribution—is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). This SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains—Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis—we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8× larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. [Our system](https://github.com/probcomp/genparse) builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.
Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo
João Loula
Benjamin LeBrun
Li Du
Ben Lipkin
Clemente Pasti
Gabriel Grand
Tianyu Liu
Yahya Emara
Marjorie Freedman
Jason Eisner
Ryan Cotterell
Vikash Mansinghka
Alexander K. Lew
Tim Vieira
A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be… (see more) naturally framed as probabilistic conditioning, but exact generation from the resulting distribution—which can differ substantially from the LM’s base distribution—is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). This SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains—Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis—we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8× larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. [Our system](https://github.com/probcomp/gen-parse) builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.
TeD-Loc: Text Distillation for Weakly Supervised Object Localization
Shakeeb Murtaza
Soufiane Belharbi
Eric Granger
Weakly supervised object localization (WSOL) using classification models trained with only image-class labels remains an important challenge… (see more) in computer vision. Given their reliance on classification objectives, traditional WSOL methods like class activation mapping focus on the most discriminative object parts, often missing the full spatial extent. In contrast, recent WSOL methods based on vision-language models like CLIP require ground truth classes or external classifiers to produce a localization map, limiting their deployment in downstream tasks. Moreover, methods like GenPromp attempt to address these issues but introduce considerable complexity due to their reliance on conditional denoising processes and intricate prompt learning. This paper introduces Text Distillation for Localization (TeD-Loc), an approach that directly distills knowledge from CLIP text embeddings into the model backbone and produces patch-level localization. Multiple instance learning of these image patches allows for accurate localization and classification using one model without requiring external classifiers. Such integration of textual and visual modalities addresses the longstanding challenge of achieving accurate localization and classification concurrently, as WSOL methods in the literature typically converge at different epochs. Extensive experiments show that leveraging text embeddings and localization cues provides a cost-effective WSOL model. TeD-Loc improves Top-1 LOC accuracy over state-of-the-art models by about 5% on both CUB and ILSVRC datasets, while significantly reducing computational complexity compared to GenPromp.
On the Identifiability of Causal Abstractions
Xiusi Li
Sékou-Oumar Kaba
Causal representation learning (CRL) enhances machine learning models' robustness and generalizability by learning structural causal models … (see more)associated with data-generating processes. We focus on a family of CRL methods that uses contrastive data pairs in the observable space, generated before and after a random, unknown intervention, to identify the latent causal model. (Brehmer et al., 2022) showed that this is indeed possible, given that all latent variables can be intervened on individually. However, this is a highly restrictive assumption in many systems. In this work, we instead assume interventions on arbitrary subsets of latent variables, which is more realistic. We introduce a theoretical framework that calculates the degree to which we can identify a causal model, given a set of possible interventions, up to an abstraction that describes the system at a higher level of granularity.
On the Identifiability of Causal Abstractions
Xiusi Li
Sékou-Oumar Kaba
Causal representation learning methods seek to enhance machine learning models' robustness and generalization capabilities by learning laten… (see more)t representations and causal graphs aligned with the data generating process. In many systems, fully recovering the true causal structure is challenging because we cannot intervene on all latent variables individually. We introduce a theoretical framework that calculates the degree to which we can identify a causal structure in the more realistic setting of interventions on arbitrary subsets of latent variables. We find that in that case, we can only identify a causal model up to a \emph{causal abstraction}. These causal abstractions are still meaningful in that they describe the system at a higher level of granularity. Conversely, given a causal abstraction, our framework provides sufficient conditions for its identifiability. Our findings extend existing identifiability results in two areas: those that address abstractions of latent variables without considering graphical structures and those that focus on graphical structures without incorporating their abstractions.
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws
Tian Jin
Ahmed Imtiaz Humayun
Utku Evci
Suvinay Subramanian
Amir Yazdanbakhsh
Dan Alistarh
Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large l… (see more)anguage models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into a single phase--provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. These findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training. Through empirical and theoretical validation, we demonstrate that this modified scaling law accurately models evaluation loss for both sparsely and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms. Our findings indicate that while sparse pre-training achieves the same final model quality as dense pre-training for equivalent compute budgets, it provides substantial benefits through reduced model size, enabling significant potential computational savings during inference.