On the Identifiability of Causal Abstractions
Sékou-Oumar Kaba
Causal representation learning (CRL) enhances machine learning models' robustness and generalizability by learning structural causal models … (see more)associated with data-generating processes. We focus on a family of CRL methods that uses contrastive data pairs in the observable space, generated before and after a random, unknown intervention, to identify the latent causal model. (Brehmer et al., 2022) showed that this is indeed possible, given that all latent variables can be intervened on individually. However, this is a highly restrictive assumption in many systems. In this work, we instead assume interventions on arbitrary subsets of latent variables, which is more realistic. We introduce a theoretical framework that calculates the degree to which we can identify a causal model, given a set of possible interventions, up to an abstraction that describes the system at a higher level of granularity.
On the Identifiability of Causal Abstractions
Sékou-Oumar Kaba
Causal representation learning methods seek to enhance machine learning models' robustness and generalization capabilities by learning laten… (see more)t representations and causal graphs aligned with the data generating process. In many systems, fully recovering the true causal structure is challenging because we cannot intervene on all latent variables individually. We introduce a theoretical framework that calculates the degree to which we can identify a causal structure in the more realistic setting of interventions on arbitrary subsets of latent variables. We find that in that case, we can only identify a causal model up to a \emph{causal abstraction}. These causal abstractions are still meaningful in that they describe the system at a higher level of granularity. Conversely, given a causal abstraction, our framework provides sufficient conditions for its identifiability. Our findings extend existing identifiability results in two areas: those that address abstractions of latent variables without considering graphical structures and those that focus on graphical structures without incorporating their abstractions.
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws
Tian Jin
Ahmed Imtiaz Humayun
Utku Evci
Suvinay Subramanian
Amir Yazdanbakhsh
Dan Alistarh
Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large l… (see more)anguage models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into a single phase--provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. These findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training. Through empirical and theoretical validation, we demonstrate that this modified scaling law accurately models evaluation loss for both sparsely and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms. Our findings indicate that while sparse pre-training achieves the same final model quality as dense pre-training for equivalent compute budgets, it provides substantial benefits through reduced model size, enabling significant potential computational savings during inference.
On the Modeling Capabilities of Large Language Models for Sequential Decision Making
The Pitfalls of Memorization: When Memorization Hurts Generalization
Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explan… (see more)ations.This behavior leads to poor generalization when the learned explanations rely on spurious correlations. In this work, we formalize the interplay between memorization and generalization, showing that spurious correlations would particularly lead to poor generalization when are combined with memorization. Memorization can reduce training loss to zero, leaving no incentive to learn robust, generalizable patterns. To address this, we propose memorization-aware training (MAT), which uses held-out predictions as a signal of memorization to shift a model's logits. MAT encourages learning robust patterns invariant across distributions, improving generalization under distribution shifts.
The Size of Teachers as a Measure of Data Complexity: PAC-Bayes Excess Risk Bounds and Scaling Laws
We study the generalization properties of randomly initialized neural networks, under the assumption that the network is larger than some un… (see more)known "teacher" network that achieves low risk. We extend the analysis of Buzaglo et al. (2024) to allow for student networks of arbitrary width and depth, and to the setting where no (small) teacher network perfectly interpolates the data. We obtain an oracle inequality, relating the risk of Gibbs posterior sampling to that of narrow teacher networks. As a result, the sample complexity is once again bounded in terms of the size of narrow teacher networks that themselves achieve small risk. We then introduce a new notion of data complexity, based on the minimal size of a teacher network required to achieve a certain level of excess risk. By comparing the scaling laws resulting from our bounds to those observed in empirical studies, we are able to estimate the data complexity of standard benchmarks according to our measure.
The Size of Teachers as a Measure of Data Complexity: PAC-Bayes Excess Risk Bounds and Scaling Laws
We study the generalization properties of randomly initialized neural networks, under the assumption that the network is larger than some un… (see more)known "teacher" network that achieves low risk. We extend the analysis of Buzaglo et al. (2024) to allow for student networks of arbitrary width and depth, and to the setting where no (small) teacher network perfectly interpolates the data. We obtain an oracle inequality, relating the risk of Gibbs posterior sampling to that of narrow teacher networks. As a result, the sample complexity is once again bounded in terms of the size of narrow teacher networks that themselves achieve small risk. We then introduce a new notion of data complexity, based on the minimal size of a teacher network required to achieve a certain level of excess risk. By comparing the scaling laws resulting from our bounds to those observed in empirical studies, we are able to estimate the data complexity of standard benchmarks according to our measure.
On the Transfer of Object-Centric Representation Learning
Aniket Rajiv Didolkar
Andrii Zadaianchuk
Anirudh Goyal
Michael Curtis Mozer
Georg Martius
Maximilian Seitzer
The goal of object-centric representation learning is to decompose visual scenes into a structured representation that isolates the entities… (see more) into individual vectors. Recent successes have shown that object-centric representation learning can be scaled to real-world scenes by utilizing features from pre-trained foundation models like DINO. However, so far, these object-centric methods have mostly been applied in-distribution, with models trained and evaluated on the same dataset. This is in contrast to the underlying foundation models, which have been shown to be applicable to a wide range of data and tasks. Thus, in this work, we answer the question of whether current real-world capable object-centric methods exhibit similar levels of transferability by introducing a benchmark comprising seven different synthetic and real-world datasets. We analyze the factors influencing performance under transfer and find that training on diverse real-world images improves generalization to unseen scenarios. Furthermore, inspired by the success of task-specific fine-tuning in foundation models, we introduce a novel fine-tuning strategy to adapt pre-trained vision encoders for the task of object discovery. We find that the proposed approach results in state-of-the-art performance for unsupervised object discovery, exhibiting strong zero-shot transfer to unseen datasets.
On the Transfer of Object-Centric Representation Learning.
Aniket Rajiv Didolkar
Andrii Zadaianchuk
Anirudh Goyal
Michael Curtis Mozer
Georg Martius
Maximilian Seitzer
Towards General-Purpose Model-Free Reinforcement Learning
Scott Fujimoto
Amy Zhang
Yuandong Tian
Reinforcement learning (RL) promises a framework for near-universal problem-solving. In practice however, RL algorithms are often tailored t… (see more)o specific benchmarks, relying on carefully tuned hyperparameters and algorithmic choices. Recently, powerful model-based RL methods have shown impressive general results across benchmarks but come at the cost of increased complexity and slow run times, limiting their broader applicability. In this paper, we attempt to find a unifying model-free deep RL algorithm that can address a diverse class of domains and problem settings. To achieve this, we leverage model-based representations that approximately linearize the value function, taking advantage of the denser task objectives used by model-based RL while avoiding the costs associated with planning or simulated trajectories. We evaluate our algorithm, MR.Q, on a variety of common RL benchmarks with a single set of hyperparameters and show a competitive performance against domain-specific and general baselines, providing a concrete step towards building general-purpose model-free deep RL algorithms.
Towards Improving Exploration through Sibling Augmented GFlowNets.
Towards Improving Exploration through Sibling Augmented GFlowNets
Exploration is a key factor for the success of an active learning agent, especially when dealing with sparse extrinsic terminal rewards and … (see more)long trajectories. We introduce Sibling Augmented Generative Flow Networks (SA-GFN), a novel framework designed to enhance exploration and training efficiency of Generative Flow Networks (GFlowNets). SA-GFN uses a decoupled dual network architecture, comprising of a main Behavior Network and an exploratory Sibling Network, to enable a diverse exploration of the underlying distribution using intrinsic rewards. Inspired by the ideas on exploration from reinforcement learning, SA-GFN provides a general-purpose exploration and learning paradigm that integrates with multiple GFlowNet training objectives and is especially helpful for exploration over a wide range of sparse or low reward distributions and task structures. An extensive set of experiments across a diverse range of tasks, reward structures and trajectory lengths, along with a thorough set of ablations, demonstrate the superior performance of SA-GFN in terms of exploration efficacy and convergence speed as compared to the existing methods. In addition, SA-GFN's versatility and compatibility with different GFlowNet training objectives and intrinsic reward methods underscores its broad applicability in various problem domains.