Publications

Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning

Samuel Garcin

Trevor McInroe

Pablo Samuel Castro

Prakash Panangaden

Christopher G. Lucas

David Abel

Stefano V Albrecht

2025-01-21

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

Surprising Effectiveness of pretraining Ternary Language Model at Scale

Ayush Kaushal

Tejas Vaidhya

Arnab Mondal

Tejas Pandey

Aaryan Bhagat

Irina Rish

Rapid advancements in GPU computational power has outpaced memory capacity and bandwidth growth, creating bottlenecks in Large Language Mode… (see more)l (LLM) inference. Post-training quantization is the leading method for addressing memory-related bottlenecks in LLM inference, but it suffers from significant performance degradation below 4-bit precision. This paper addresses these challenges by investigating the pretraining of low-bitwidth models specifically Ternary Language Models (TriLMs) as an alternative to traditional floating-point models (FloatLMs) and their post-training quantized versions (QuantLMs). We present Spectra LLM suite, the first open suite of LLMs spanning multiple bit-widths, including FloatLMs, QuantLMs, and TriLMs, ranging from 99M to 3.9B parameters trained on 300B tokens. Our comprehensive evaluation demonstrates that TriLMs offer superior scaling behavior in terms of model size (in bits). Surprisingly, at scales exceeding one billion parameters, TriLMs consistently outperform their QuantLM and FloatLM counterparts for a given bit size across various benchmarks. Notably, the 3.9B parameter TriLM matches the performance of the FloatLM 3.9B across all benchmarks, despite having fewer bits than FloatLM 830M. Overall, this research provides valuable insights into the feasibility and scalability of low-bitwidth language models, paving the way for the development of more efficient LLMs.

2025-01-21

International Conference on Learning Representations (spotlight)

openreview.net

SymmCD: Symmetry-Preserving Crystal Generation with Diffusion Models

Daniel Levy

Siba Smarak Panigrahi

Sékou-Oumar Kaba

Qiang Zhu

Kin Long Kelvin Lee

Mikhail Galkin

Santiago Miret

Siamak Ravanbakhsh

Generating novel crystalline materials has the potential to lead to advancements in fields such as electronics, energy storage, and catalysi… (see more)s. The defining characteristic of crystals is their symmetry, which plays a central role in determining their physical properties. However, existing crystal generation methods either fail to generate materials that display the symmetries of real-world crystals, or simply replicate the symmetry information from examples in a database. To address this limitation, we propose SymmCD, a novel diffusion-based generative model that explicitly incorporates crystallographic symmetry into the generative process. We decompose crystals into two components and learn their joint distribution through diffusion: 1) the asymmetric unit, the smallest subset of the crystal which can generate the whole crystal through symmetry transformations, and; 2) the symmetry transformations needed to be applied to each atom in the asymmetric unit. We also use a novel and interpretable representation for these transformations, enabling generalization across different crystallographic symmetry groups. We showcase the competitive performance of SymmCD on a subset of the Materials Project, obtaining diverse and valid crystals with realistic symmetries and predicted properties.

2025-01-21

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo

João Loula

Benjamin LeBrun

Li Du

Ben Lipkin

Clemente Pasti

Gabriel Grand

Tianyu Liu

Yahya Emara

Marjorie Freedman

Jason Eisner

Ryan Cotterell

Vikash Mansinghka

Alexander K. Lew

Tim Vieira

Timothy J. O'Donnell

A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be… (see more) naturally framed as probabilistic conditioning, but exact generation from the resulting distribution—which can differ substantially from the LM’s base distribution—is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). This SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inferencetime, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains—Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis—we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8× larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. [Our system](https://github.com/probcomp/gen-parse) builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.

2025-01-21

ICLR.cc/2025/Conference (oral)

openreview.net

TeD-Loc: Text Distillation for Weakly Supervised Object Localization

Shakeeb Murtaza

Soufiane Belharbi

Marco Pedersoli

Eric Granger

Weakly supervised object localization (WSOL) using classification models trained with only image-class labels remains an important challenge… (see more) in computer vision. Given their reliance on classification objectives, traditional WSOL methods like class activation mapping focus on the most discriminative object parts, often missing the full spatial extent. In contrast, recent WSOL methods based on vision-language models like CLIP require ground truth classes or external classifiers to produce a localization map, limiting their deployment in downstream tasks. Moreover, methods like GenPromp attempt to address these issues but introduce considerable complexity due to their reliance on conditional denoising processes and intricate prompt learning. This paper introduces Text Distillation for Localization (TeD-Loc), an approach that directly distills knowledge from CLIP text embeddings into the model backbone and produces patch-level localization. Multiple instance learning of these image patches allows for accurate localization and classification using one model without requiring external classifiers. Such integration of textual and visual modalities addresses the longstanding challenge of achieving accurate localization and classification concurrently, as WSOL methods in the literature typically converge at different epochs. Extensive experiments show that leveraging text embeddings and localization cues provides a cost-effective WSOL model. TeD-Loc improves Top-1 LOC accuracy over state-of-the-art models by about 5% on both CUB and ILSVRC datasets, while significantly reducing computational complexity compared to GenPromp.

2025-01-21

ArXiv (preprint)

doi.org

arxiv.org

The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

Tian Jin

Ahmed Imtiaz Humayun

Utku Evci

Suvinay Subramanian

Amir Yazdanbakhsh

Dan Alistarh

Gintare Karolina Dziugaite

Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large l… (see more)anguage models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into a single phase--provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. These findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training. Through empirical and theoretical validation, we demonstrate that this modified scaling law accurately models evaluation loss for both sparsely and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms. Our findings indicate that while sparse pre-training achieves the same final model quality as dense pre-training for equivalent compute budgets, it provides substantial benefits through reduced model size, enabling significant potential computational savings during inference.

2025-01-21

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

On the Modeling Capabilities of Large Language Models for Sequential Decision Making

Martin Klissarov

R Devon Hjelm

Alexander T Toshev

Bogdan Mazoure

2025-01-21

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

The Pitfalls of Memorization: When Memorization Hurts Generalization

Reza Bayat

Mohammad Pezeshki

Elvis Dohmatob

David Lopez-Paz

Pascal Vincent

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explan… (see more)ations.This behavior leads to poor generalization when the learned explanations rely on spurious correlations. In this work, we formalize the interplay between memorization and generalization, showing that spurious correlations would particularly lead to poor generalization when are combined with memorization. Memorization can reduce training loss to zero, leaving no incentive to learn robust, generalizable patterns. To address this, we propose memorization-aware training (MAT), which uses held-out predictions as a signal of memorization to shift a model's logits. MAT encourages learning robust patterns invariant across distributions, improving generalization under distribution shifts.

2025-01-21

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

On the Transfer of Object-Centric Representation Learning.

Aniket Rajiv Didolkar

Andrii Zadaianchuk

Anirudh Goyal

Michael Curtis Mozer

Yoshua Bengio

Georg Martius

Maximilian Seitzer

2025-01-21

ICLR.cc/2025/Conference (poster)

openreview.net

Towards General-Purpose Model-Free Reinforcement Learning

Scott Fujimoto

Pierluca D'Oro

Amy Zhang

Yuandong Tian

Michael G. Rabbat

Reinforcement learning (RL) promises a framework for near-universal problem-solving. In practice however, RL algorithms are often tailored t… (see more)o specific benchmarks, relying on carefully tuned hyperparameters and algorithmic choices. Recently, powerful model-based RL methods have shown impressive general results across benchmarks but come at the cost of increased complexity and slow run times, limiting their broader applicability. In this paper, we attempt to find a unifying model-free deep RL algorithm that can address a diverse class of domains and problem settings. To achieve this, we leverage model-based representations that approximately linearize the value function, taking advantage of the denser task objectives used by model-based RL while avoiding the costs associated with planning or simulated trajectories. We evaluate our algorithm, MR.Q, on a variety of common RL benchmarks with a single set of hyperparameters and show a competitive performance against domain-specific and general baselines, providing a concrete step towards building general-purpose model-free deep RL algorithms.

2025-01-21

ICLR.cc/2025/Conference (spotlight)

doi.org

openreview.net

Towards Improving Exploration Through Sibling Augmented GFlowNets

2025-01-21

ICLR.cc/2025/Conference (poster)