Adam Ibrahim

Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. Whi… (see more)le self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge. Recent works have adopted a repeated cosine annealing schedule for large-scale continual pre-training; however, these schedules (1) inherently cause forgetting during the re-warming phase and (2) have not been systematically compared to existing continual SSL methods. In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative. Our extensive empirical evaluation across diverse image and language datasets demonstrates that the infinite learning rate schedule consistently enhances continual pre-training performance compared to a repeated cosine decay without being restricted to a fixed iteration budget. For instance, in a small-scale MAE pre-training setup, it outperforms several strong baselines from the literature. We then scale up our experiments to larger MAE pre-training and autoregressive language model pre-training. Our results show that the infinite learning rate schedule remains effective at scale, surpassing repeated cosine decay for both MAE pre-training and zero-shot LM benchmarks.

2025-06-10

ICML.cc/2025/Workshop/ES-FoMo-III (published)

doi.org

openreview.net

Learning adversarially robust kernel ensembles with kernel average pooling.

Pouya Bashivan

Reza Bayat

Adam Ibrahim

Amirozhan Dehghani

Yifei Ren

2025-02-28

Expert Systems with Applications (published)

doi.org

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Quentin Gregory Anthony

Timothee LESORT

Eugene Belilovsky

Irina Rish

2024-07-07

TMLR (accepted)

doi.org

openreview.net

Continual Pre-Training of Large Language Models: How to (re)warm your model?

Quentin Gregory Anthony

Eugene Belilovsky

Irina Rish

Timothee LESORT

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes a… (see more)vailable. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data. Taking a step towards efficient continual pre-training, in this work, we examine the effect of different warm-up strategies. Our hypothesis is that the learning rate must be re-increased to improve compute efficiency when training on a new dataset. We study the warmup phase of models pre-trained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens), following a linear warmup and cosine decay schedule. We conduct all experiments on the Pythia 410M language model architecture and evaluate performance through validation perplexity. We experiment with different pre-training checkpoints, various maximum learning rates, and various warmup lengths. Our results show that while rewarming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch

2023-06-19

ICML.cc/2023/Workshop/ES-FoMO (poster)

doi.org

openreview.net

Towards Out-of-Distribution Adversarial Robustness

Adam Ibrahim

Charles Guille-escuret

Ioannis Mitliagkas

Irina Rish

David Krueger

Pouya Bashivan

Adversarial robustness continues to be a major challenge for deep learning. A core issue is that robustness to one type of attack often fail… (see more)s to transfer to other attacks. While prior work establishes a theoretical trade-off in robustness against different

2023-06-19

ICML.cc/2023/Workshop/AdvML-Frontiers (published)

doi.org

openreview.net

Learning Robust Kernel Ensembles with Kernel Average Pooling

Pouya Bashivan

Adam Ibrahim

Amirozhan Dehghani

Yifei Ren

2022-09-29

ArXiv (preprint)

doi.org

openreview.net

Gradient Descent Is Optimal Under Lower Restricted Secant Inequality And Upper Error Bound

Charles Guille-escuret

Baptiste Goujaud

Adam Ibrahim

Ioannis Mitliagkas

The study of first-order optimization is sensitive to the assumptions made on the objective functions. These assumptions induce complexity c… (see more)lasses which play a key role in worst-case analysis, including the fundamental concept of algorithm optimality. Recent work argues that strong convexity and smoothness, popular assumptions in literature, lead to a pathological definition of the condition number (Guille-Escuret et al., 2021). Motivated by this result, we focus on the class of functions satisfying a lower restricted secant inequality and an upper error bound. On top of being robust to the aforementioned pathological behavior and including some non-convex functions, this pair of conditions displays interesting geometrical properties. In particular, the necessary and sufficient conditions to interpolate a set of points and their gradients within the class can be separated into simple conditions on each sampled gradient. This allows the performance estimation problem (PEP, Drori and Teboulle (2012)) to be solved analytically, leading to a lower bound on the convergence rate that proves gradient descent to be exactly optimal on this class of functions among all first-order algorithms.

2021-12-31

NeurIPS (published)

doi.org

openreview.net

Adversarial Feature Desensitization

Neural networks are known to be vulnerable to adversarial attacks -- slight but carefully constructed perturbations of the inputs which can … (see more)drastically impair the network's performance. Many defense methods have been proposed for improving robustness of deep networks by training them on adversarially perturbed inputs. However, these models often remain vulnerable to new types of attacks not seen during training, and even to slightly stronger versions of previously seen attacks. In this work, we propose a novel approach to adversarial robustness, which builds upon the insights from the domain adaptation field. Our method, called Adversarial Feature Desensitization (AFD), aims at learning features that are invariant towards adversarial perturbations of the inputs. This is achieved through a game where we learn features that are both predictive and robust (insensitive to adversarial attacks), i.e. cannot be used to discriminate between natural and adversarial data. Empirical results on several benchmarks demonstrate the effectiveness of the proposed approach against a wide range of attack types and attack strengths. Our code is available at https://github.com/BashivanLab/afd.

2020-12-31

Advances in Neural Information Processing Systems 34 (NeurIPS 2021) (published)

doi.org

openreview.net

Linear Lower Bounds and Conditioning of Differentiable Games

Recent successes of game-theoretic formulations in ML have caused a resurgence of research interest in differentiable games. Overwhelmingly,… (see more) that research focuses on methods and upper bounds on their speed of convergence. In this work, we approach the question of fundamental iteration complexity by providing lower bounds to complement the linear (i.e. geometric) upper bounds observed in the literature on a wide class of problems. We cast saddle-point and min-max problems as 2-player games. We leverage tools from single-objective convex optimisation to propose new linear lower bounds for convex-concave games. Notably, we give a linear lower bound for

2020-11-20

Proceedings of the 37th International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Adam Ibrahim

Publications

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Popular keywords:

Adam Ibrahim

Publications