Publications

On the Implicit Relation Between Low-Rank Adaptation and Differential Privacy

Saber Malekmohammadi

A significant approach in natural language processing involves large-scale pre-training on general domain data followed by adaptation to spe… (see more)cific tasks or domains. As models grow in size, full fine-tuning all parameters becomes increasingly impractical. To address this, some methods for low-rank task adaptation of language models have been proposed, e.g. LoRA and FLoRA. These methods keep the pre-trained model weights fixed and incorporate trainable low-rank decomposition matrices into some layers of the transformer architecture, called adapters. This approach significantly reduces the number of trainable parameters required for downstream tasks compared to full fine-tuning all parameters. In this work, we look at low-rank adaptation from the lens of data privacy. We show theoretically that the low-rank adaptation used in LoRA and FLoRA is equivalent to injecting some random noise into the batch gradients w.r.t the adapter parameters coming from their full fine-tuning, and we quantify the variance of the injected noise. By establishing a Berry-Esseen type bound on the total variation distance between the noise distribution and a Gaussian distribution with the same variance, we show that the dynamics of LoRA and FLoRA are very close to differentially private full fine-tuning the adapters, which suggests that low-rank adaptation implicitly provides privacy w.r.t the fine-tuning data. Finally, using Johnson-Lindenstrauss lemma, we show that when augmented with gradient clipping, low-rank adaptation is almost equivalent to differentially private full fine-tuning adapters with a fixed noise scale.

2024-10-10

NeurIPS.cc/2024/Workshop/M3L (poster)

doi.org

openreview.net

The Pitfalls of Memorization: When Memorization Hinders Generalization

Reza Bayat

Mohammad Pezeshki

Elvis Dohmatob

David Lopez-Paz

Pascal Vincent

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explan… (see more)ations. This leads to poor generalization when the learned explanations are spurious. In this work, we formalize

2024-10-10

NeurIPS.cc/2024/Workshop/SciForDL (poster)

openreview.net

The Pitfalls of Memorization: When Memorization Hinders Generalization

Reza Bayat

Mohammad Pezeshki

Elvis Dohmatob

David Lopez-Paz

Pascal Vincent

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explan… (see more)ations. This leads to poor generalization when the learned explanations are spurious. In this work, we formalize

2024-10-10

NeurIPS.cc/2024/Workshop/SciForDL (poster)

openreview.net

Tight Lower Bounds and Improved Convergence in Performative Prediction

Pedram Khorsandi

Rushil Gupta

Mehrnaz Mofakhami

Simon Lacoste-Julien

Gauthier Gidel

Performative prediction is a framework accounting for the shift in the data distribution induced by the prediction of a model deployed in th… (see more)e real world. Ensuring rapid convergence to a stable solution where the data distribution remains the same after the model deployment is crucial, especially in evolving environments. This paper extends the Repeated Risk Minimization (RRM) framework by utilizing historical datasets from previous retraining snapshots, yielding a class of algorithms that we call Affine Risk Minimizers and enabling convergence to a performatively stable point for a broader class of problems. We introduce a new upper bound for methods that use only the final iteration of the dataset and prove for the first time the tightness of both this new bound and the previous existing bounds within the same regime. We also prove that utilizing historical datasets can surpass the lower bound for last iterate RRM, and empirically observe faster convergence to the stable point on various performative prediction benchmarks. We offer at the same time the first lower bound analysis for RRM within the class of Affine Risk Minimizers, quantifying the potential improvements in convergence speed that could be achieved with other variants in our framework.

2024-10-10

NeurIPS.cc/2024/Workshop/OPT (published)

doi.org

openreview.net

TrajGPT: Healthcare Time-Series Representation Learning for Trajectory Prediction

Ziyang Song

Qincheng Lu

Mike He Zhu

David Buckeridge

Yue Li

In many domains, such as healthcare, time-series data is irregularly sampled with varying intervals between observations. This creates chall… (see more)enges for classical time-series models that require equally spaced data. To address this, we propose a novel time-series Transformer called **Trajectory Generative Pre-trained Transformer (TrajGPT)**. It introduces a data-dependent decay mechanism that adaptively forgets irrelevant information based on clinical context. By interpreting TrajGPT as ordinary differential equations (ODEs), our approach captures continuous dynamics from sparse and irregular time-series data. Experimental results show that TrajGPT, with its time-specific inference approach, accurately predicts trajectories without requiring task-specific fine-tuning.

2024-10-10

NeurIPS.cc/2024/Workshop/TSALM (published)

openreview.net

TrajGPT: Healthcare Time-Series Representation Learning for Trajectory Prediction

Ziyang Song

Qincheng Lu

Mike He Zhu

David Buckeridge

Yue Li

In many domains, such as healthcare, time-series data is irregularly sampled with varying intervals between observations. This creates chall… (see more)enges for classical time-series models that require equally spaced data. To address this, we propose a novel time-series Transformer called **Trajectory Generative Pre-trained Transformer (TrajGPT)**. It introduces a data-dependent decay mechanism that adaptively forgets irrelevant information based on clinical context. By interpreting TrajGPT as ordinary differential equations (ODEs), our approach captures continuous dynamics from sparse and irregular time-series data. Experimental results show that TrajGPT, with its time-specific inference approach, accurately predicts trajectories without requiring task-specific fine-tuning.

2024-10-10

NeurIPS.cc/2024/Workshop/TSALM (published)

openreview.net

Understanding Adam Requires Better Rotation Dependent Assumptions

Lucas Maes

Tianyue H. Zhang

Alexia Jolicoeur-Martineau

Ioannis Mitliagkas

Damien Scieur

Simon Lacoste-Julien

Charles Guille-Escuret

Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This… (see more) paper investigates Adam's sensitivity to rotations of the parameter space. We demonstrate that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature, evaluating their adequacy in explaining Adam's behavior across various rotation types. This work highlights the need for new, rotation-dependent theoretical frameworks to fully understand Adam's empirical success in modern machine learning tasks.

2024-10-10

NeurIPS.cc/2024/Workshop/OPT (published)

doi.org

openreview.net

Understanding Permutation Based Model Merging with Feature Visualizations

Congshu Zou

geraldin nanfack

Stefan Horoi

Eugene Belilovsky

Linear mode connectivity (LMC) has become a topic of great interest in recent years. It has been empirically demonstrated that popular deep … (see more)learning models trained from different initializations exhibit linear model connectivity up to permutation. Based on this, several approaches for finding a permutation of the model's features or weights have been proposed leading to several popular methods for model merging. These methods enable the simple averaging of two models to create a new high-performance model. However, besides accuracy, the properties of these models and their relationships to the representations of the models they derive from are poorly understood. In this work, we study the inner mechanisms behind LMC in model merging through the lens of classic feature visualization methods. Focusing on convolutional neural networks (CNNs) we make several observations that shed light on the underlying mechanisms of model merging by permute and average.

2024-10-10

NeurIPS.cc/2024/Workshop/UniReps (accepted)

openreview.net

Understanding Permutation Based Model Merging with Feature Visualizations

Congshu Zou

geraldin nanfack

Stefan Horoi

Eugene Belilovsky

Linear mode connectivity (LMC) has become a topic of great interest in recent years. It has been empirically demonstrated that popular deep … (see more)learning models trained from different initializations exhibit linear model connectivity up to permutation. Based on this, several approaches for finding a permutation of the model's features or weights have been proposed leading to several popular methods for model merging. These methods enable the simple averaging of two models to create a new high-performance model. However, besides accuracy, the properties of these models and their relationships to the representations of the models they derive from are poorly understood. In this work, we study the inner mechanisms behind LMC in model merging through the lens of classic feature visualization methods. Focusing on convolutional neural networks (CNNs) we make several observations that shed light on the underlying mechanisms of model merging by permute and average.

2024-10-10

NeurIPS.cc/2024/Workshop/UniReps (accepted)

openreview.net

Visual Language Alignment Tuning

Le Zhang

Qian Yang

Aishwarya Agrawal

2024-10-10

NeurIPS.cc/2024/Workshop/AFM (poster)

openreview.net

WASH: Train your Ensemble with Communication-Efficient Weight Shuffling, then Average

Louis Fournier

Adel Nabli

Masih Aminbeidokhti

Marco Pedersoli

Eugene Belilovsky

Edouard Oyallon

The performance of deep neural networks is enhanced by ensemble methods, which average the output of several models. However, this comes at … (see more)an increased cost at inference. Weight averaging methods aim at balancing the generalization of ensembling and the inference speed of a single model by averaging the parameters of an ensemble of models. Yet, naive averaging results in poor performance as models converge to different loss basins, and aligning the models to improve the performance of the average is challenging. Alternatively, inspired by distributed training, methods like DART and PAPA have been proposed to train several models in parallel such that they will end up in the same basin, resulting in good averaging accuracy. However, these methods either compromise ensembling accuracy or demand significant communication between models during training. In this paper, we introduce WASH, a novel distributed method for training model ensembles for weight averaging that achieves state-of-the-art image classification accuracy. WASH maintains models within the same basin by randomly shuffling a small percentage of weights during training, resulting in diverse models and lower communication costs compared to standard parameter averaging methods.

2024-10-10

NeurIPS.cc/2024/Workshop/OPT (published)

doi.org

openreview.net

AI-Assisted Generation of Difficult Math Questions

Vedant Shah

Dingli Yu

Kaifeng Lyu

Simon Park

Nan Rosemary Ke

Jiatong Yu

Yinghui He

Michael Curtis Mozer

James Lloyd McClelland

Yoshua Bengio

Sanjeev Arora

Anirudh Goyal

Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet dem… (see more)and for diverse and challenging math questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. We leverage LLM metacognition skills [Didolkar et al., 2024] of a strong LLM to extract core"skills"from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills. The use of two different skills within each question makes finding such questions an"out of distribution"task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multiturn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from the MATH dataset [Hendrycks et al., 2021] resulted in MATH

2024-10-09

NeurIPS.cc/2024/Workshop/MATH-AI (accepted)

doi.org

openreview.net

Rising to the Occasion

Mila Techaide 2025

The Development of the UN Scientific Panel on AI

Transition in Mila's Scientific Direction

AI Insights for Policymakers

Rising to the Occasion

Mila Techaide 2025

Publications

Rising to the Occasion

Mila Techaide 2025

The Development of the UN Scientific Panel on AI

Transition in Mila's Scientific Direction

AI Insights for Policymakers

Rising to the Occasion

Mila Techaide 2025

Popular keywords:

Publications