Pascal Vincent

2025-02-28

ArXiv (prépublication)

Steering Large Language Model Activations in Sparse Spaces

Reza Bayat

Ali Rahimi-Kalahroudi

Mohammad Pezeshki

Sarath Chandar

A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. Activation steering, which… (voir plus) modifies internal model activations during inference, offers a potential solution. However, prior work in dense activation spaces struggles with superposition, wherein multiple features become entangled, limiting interpretability and precise control. In contrast, sparse representations provide an untapped opportunity for more interpretable behavior modulation. In this work, we introduce sparse activation steering (SAS), a method that leverages sparse autoencoders (SAEs) to steer LLM behavior in sparse spaces. By isolating behavior-specific features through a contrastive prompt-pairing approach, we define a set of features that can selectively reinforce or suppress behaviors. Experiments on Gemma 2 LLMs show that SAS vectors enable nuanced behavioral modulation and finer-grained control. Furthermore, scaling SAEs improves monosemanticity of SAS vectors, suggesting more reliable and interpretable interventions.

2025-02-28

ArXiv (prépublication)

MaestroMotif: Skill Design from Artificial Intelligence Feedback

Martin Klissarov

Mikael Henaff

Roberta Raileanu

Shagun Sodhani

Amy Zhang

Pierre-Luc Bacon

Doina Precup

Marlos C. Machado

Pierluca D'Oro

Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an… (voir plus) AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.

2025-01-22

ICLR.cc/2025/Conference (présentation orale)

The Pitfalls of Memorization: When Memorization Hurts Generalization

Reza Bayat

Mohammad Pezeshki

Elvis Dohmatob

David Lopez-Paz

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explan… (voir plus)ations.This behavior leads to poor generalization when the learned explanations rely on spurious correlations. In this work, we formalize the interplay between memorization and generalization, showing that spurious correlations would particularly lead to poor generalization when are combined with memorization. Memorization can reduce training loss to zero, leaving no incentive to learn robust, generalizable patterns. To address this, we propose memorization-aware training (MAT), which uses held-out predictions as a signal of memorization to shift a model's logits. MAT encourages learning robust patterns invariant across distributions, improving generalization under distribution shifts.

2025-01-22

ICLR.cc/2025/Conference (poster)

MaestroMotif: Skill Design from Artificial Intelligence Feedback

Martin Klissarov

Mikael Henaff

Roberta Raileanu

Shagun Sodhani

Amy Zhang

Pierre-Luc Bacon

Doina Precup

Marlos C. Machado

Pierluca D'Oro

Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an… (voir plus) AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.

2024-12-11

ArXiv (prépublication)

MaestroMotif: Skill Design from Artificial Intelligence Feedback

Martin Klissarov

Mikael Henaff

Roberta Raileanu

Shagun Sodhani

Amy Zhang

Pierre-Luc Bacon

Doina Precup

Marlos C. Machado

Pierluca D'Oro

Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an… (voir plus) AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.

2024-12-11

ArXiv (prépublication)

Compositional Risk Minimization

Divyat Mahajan

Mohammad Pezeshki

Ioannis Mitliagkas

Kartik Ahuja

2024-10-10

NeurIPS.cc/2024/Workshop/Compositional_Learning (poster)

The Pitfalls of Memorization: When Memorization Hinders Generalization

Reza Bayat

Mohammad Pezeshki

Elvis Dohmatob

David Lopez-Paz

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explan… (voir plus)ations. This leads to poor generalization when the learned explanations are spurious. In this work, we formalize

2024-10-10

NeurIPS.cc/2024/Workshop/SciForDL (poster)

The Pitfalls of Memorization: When Memorization Hinders Generalization

Reza Bayat

Mohammad Pezeshki

Elvis Dohmatob

David Lopez-Paz

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explan… (voir plus)ations. This leads to poor generalization when the learned explanations are spurious. In this work, we formalize

2024-10-10

NeurIPS.cc/2024/Workshop/SciForDL (poster)

Stochastic positional embeddings improve masked image modeling

Amir Bar

Florian Bordes

Assaf Shocher

Mahmoud Assran

Nicolas Ballas

Trevor Darrell

Amir Globerson

Yann LeCun

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (publié)

proceedings.mlr.press

Stochastic positional embeddings improve masked image modeling

Amir Bar

Florian Bordes

Assaf Shocher

Mahmoud Assran

Nicolas Ballas

Trevor Darrell

Amir Globerson

Yann LeCun

Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images. Despite its recent… (voir plus) success, learning good representations through MIM remains challenging because it requires predicting the right semantic content in accurate locations. For example, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location. In this work, we propose to incorporate location uncertainty into MIM by using stochastic positional embeddings (StoP). Specifically, we condition the model on stochastic masked token positions drawn from a Gaussian distribution. StoP reduces overfitting to location features and guides the model toward learning features that are more robust to location uncertainties. Quantitatively, StoP improves downstream MIM performance on a variety of downstream tasks, including

2024-05-01

ICML.cc/2024/Conference (poster)

On the Identifiability of Quantized Factors

Vitória Barin Pacela

Kartik Ahuja

Simon Lacoste-Julien

Disentanglement aims to recover meaningful latent ground-truth factors from the observed distribution solely, and is formalized through the… (voir plus) theory of identifiability. The identifiability of independent latent factors is proven to be impossible in the unsupervised i.i.d. setting under a general nonlinear map from factors to observations. In this work, however, we demonstrate that it is possible to recover quantized latent factors under a generic nonlinear diffeomorphism. We only assume that the latent factors have independent discontinuities in their density, without requiring the factors to be statistically independent. We introduce this novel form of identifiability, termed quantized factor identifiability, and provide a comprehensive proof of the recovery of the quantized factors.

2024-03-15

Proceedings of the Third Conference on Causal Learning and Reasoning (publié)

proceedings.mlr.press