Reza Bayat

Sungnyun Kim

Jiyoun Ha

Tal Schuster

Adam Fisch

Hrayr Harutyunyan

Ziwei Ji

Aaron Courville

Se-Young Yun

Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deploy… (voir plus)ment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to further decrease memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

Steering Large Language Model Activations in Sparse Spaces

Ali Rahimi-Kalahroudi

Mohammad Pezeshki

A. Chandar

P Vincent

2025-07-07

Conference on Language Modeling (accepté)

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Thinking

Sangmin Bae

Yujin Kim

Sungnyun Kim

Jiyoun Ha

Tal Schuster

Adam Fisch

Hrayr Harutyunyan

Ziwei Ji

Aaron Courville

Se-Young Yun

Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deploy… (voir plus)ment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assign recursion depth to tokens, thereby focusing quadratic attention computation only where it is most useful. Further enhancing its efficiency, MoR incorporates a recursion-wise key-value caching mechanism that eliminates redundant memory access across recursion steps by selectively storing only the key-value caches for designated tokens. Across pretraining runs at model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.

2025-06-10

ICML.cc/2025/Workshop/ES-FoMo-III (publié)

Learning adversarially robust kernel ensembles with kernel average pooling.

Pouya Bashivan

Adam Ibrahim

Amirozhan Dehghani

Yifei Ren

2025-02-28

Expert Systems with Applications (publié)

The Pitfalls of Memorization: When Memorization Hurts Generalization

Mohammad Pezeshki

Elvis Dohmatob

David Lopez-Paz

Pascal Vincent

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explan… (voir plus)ations.This behavior leads to poor generalization when the learned explanations rely on spurious correlations. In this work, we formalize the interplay between memorization and generalization, showing that spurious correlations would particularly lead to poor generalization when are combined with memorization. Memorization can reduce training loss to zero, leaving no incentive to learn robust, generalizable patterns. To address this, we propose memorization-aware training (MAT), which uses held-out predictions as a signal of memorization to shift a model's logits. MAT encourages learning robust patterns invariant across distributions, improving generalization under distribution shifts.

2025-01-21

ICLR.cc/2025/Conference (poster)

The Pitfalls of Memorization: When Memorization Hinders Generalization

Mohammad Pezeshki

Elvis Dopgima Dohmatob

David Lopez-Paz

P Vincent

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explan… (voir plus)ations. This leads to poor generalization when the learned explanations are spurious. In this work, we formalize

2024-10-09

NeurIPS.cc/2024/Workshop/SciForDL (poster)

Adversarial Training with Synthesized Data: A Path to Robust and Generalizable Neural Networks

Irina Rish

Adversarial Training (AT) is a well-known framework designed to mitigate adversarial vulnerabilities in neural networks. Recent research ind… (voir plus)icates that incorporating adversarial examples (AEs) in training can enhance models' generalization capabilities. To understand the impact of AEs on learning dynamics, we study AT through the lens of sample difficulty methodologies. Our findings show that AT leads to more stable learning dynamics compared to Natural Training (NT), resulting in gradual performance improvements and less overconfident predictions. This suggests that AT steers training away from learning easy, perturbable spurious features toward more resilient and generalizable ones. However, a trade-off exists between adversarial robustness and generalization gains, due to robust overfitting, limiting practical deployment. To address this, we propose using synthesized data to bridge this gap. Our results demonstrate that AT benefits significantly from synthesized data, whereas NT does not, enhancing generalization without compromising robustness and offering new avenues for developing robust and generalizable models.

2024-06-27

ICML.cc/2024/Workshop/NextGenAISafety (poster)

Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques

Daniel Z Kaplan

Vision-Language Models (VLMs) have witnessed a surge in both research and real-world applications. However, as they becoming increasingly pr… (voir plus)evalent, ensuring their robustness against adversarial attacks is paramount. This work systematically investigates the impact of model design choices on the adversarial robustness of VLMs against image-based attacks. Additionally, we introduce novel, cost-effective approaches to enhance robustness through prompt formatting. By rephrasing questions and suggesting potential adversarial perturbations, we demonstrate substantial improvements in model robustness against strong image-based attacks such as Auto-PGD. Our findings provide important guidelines for developing more robust VLMs, particularly for deployment in safety-critical environments.

2024-06-27

ICML.cc/2024/Workshop/NextGenAISafety (poster)

Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones

Mehrnaz Mofakhami

Ioannis Mitliagkas

Joao Monteiro

Valentina Zantedeschi

2024-06-19

ICML.cc/2024/Workshop/ES-FoMo-II (poster)

Adversarial Feature Desensitization

Neural networks are known to be vulnerable to adversarial attacks -- slight but carefully constructed perturbations of the inputs which can … (voir plus)drastically impair the network's performance. Many defense methods have been proposed for improving robustness of deep networks by training them on adversarially perturbed inputs. However, these models often remain vulnerable to new types of attacks not seen during training, and even to slightly stronger versions of previously seen attacks. In this work, we propose a novel approach to adversarial robustness, which builds upon the insights from the domain adaptation field. Our method, called Adversarial Feature Desensitization (AFD), aims at learning features that are invariant towards adversarial perturbations of the inputs. This is achieved through a game where we learn features that are both predictive and robust (insensitive to adversarial attacks), i.e. cannot be used to discriminate between natural and adversarial data. Empirical results on several benchmarks demonstrate the effectiveness of the proposed approach against a wide range of attack types and attack strengths. Our code is available at https://github.com/BashivanLab/afd.

2020-12-31

Advances in Neural Information Processing Systems 34 (NeurIPS 2021) (publié)