Eugene Belilovsky

nicolas.bernier@mila.quebec

nader.asadi@mila.quebec

Maîtrise recherche - Concordia

mohammadreza.davari@mila.quebec

Reza Davari

Doctorat - Concordia

Stagiaire de recherche - Concordia

louis.fournier@mila.quebec

Maîtrise recherche - Concordia

alexander.fulleringer@mila.quebec

Paul Janson

Maîtrise recherche - Concordia

paul.janson@mila.quebec

Charles-Etienne Joseph

Maîtrise recherche - UdeM

Co-superviseur⋅e :

charles-etienne.joseph@mila.quebec

Zafir Khalid

Maîtrise recherche - Concordia

zafir.khaled@mila.quebec

Gwen Legate

Doctorat - Concordia

Co-superviseur⋅e :

gwendolyne.legate@mila.quebec

abhinav.moudgil@mila.quebec

Abhinav Moudgil

Doctorat - Concordia

Adel Nabli

Doctorat - Concordia

adel.nabli@mila.quebec

Geraldin Nanfack

Postdoctorat - Concordia

Co-superviseur⋅e :

geraldin.nanfack@mila.quebec

Albert Orozco Camacho

Doctorat - Concordia

Co-superviseur⋅e :

Maîtrise recherche - Concordia

Co-superviseur⋅e :

paria.mehrbod@mila.quebec

Donald Shenaj

Collaborateur·rice de recherche - Concordia

Co-superviseur⋅e :

donald.shenaj@mila.quebec

Doctorat - Concordia

Co-superviseur⋅e :

vaibhav.singh@mila.quebec

Benjamin Therien

Doctorat - UdeM

Superviseur⋅e principal⋅e :

benjamin.therien@mila.quebec

Collaborateur·rice de recherche - UdeM

Superviseur⋅e principal⋅e :

pedro.vianna@mila.quebec

Humza Wajid Hameed

Maîtrise recherche - Concordia

humza.wajid@mila.quebec

amirhossein.zamani@mila.quebec

AmirHossein Zamani

Doctorat - Concordia

Congshu Zou

Maîtrise recherche - Concordia

congshu.zou@mila.quebec

Publications

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Adam Ibrahim

Benjamin Thérien

Kshitij Gupta

Mats Leon Richter

Quentin Gregory Anthony

Timothee LESORT

2024-07-08

TMLR (accepté)

Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis

Stefan Horoi

Albert Manuel Orozco Camacho

Ensembling multiple models enhances predictive performance by utilizing the varied learned features of the different models but incurs signi… (voir plus)ficant computational and storage costs. Model fusion, which combines parameters from multiple models into one, aims to mitigate these costs but faces practical challenges due to the complex, non-convex nature of neural network loss landscapes, where learned minima are often separated by high loss barriers. Recent works have explored using permutations to align network features, reducing the loss barrier in parameter space. However, permutations are restrictive since they assume a one-to-one mapping between the different models' neurons exists. We propose a new model merging algorithm, CCA Merge, which is based on Canonical Correlation Analysis and aims to maximize the correlations between linear combinations of the model features. We show that our method of aligning models leads to better performances than past methods when averaging models trained on the same, or differing data splits. We also extend this analysis into the harder many models setting where more than 2 models are merged, and we find that CCA Merge works significantly better in this setting than past methods.

2024-05-01

ICML.cc/2024/Conference (poster)

Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis

Stefan Horoi

Albert Manuel Orozco Camacho

2024-05-01

ICML.cc/2024/Conference (poster)

Adversarial Attacks on the Interpretation of Neuron Activation Maximization

G'eraldin Nanfack

Alexander Fulleringer

Jonathan Marty

Michael Eickenberg

Feature visualization is one of the most popular techniques used to interpret the internal behavior of individual units of trained deep neur… (voir plus)al networks. Based on activation maximization, they consist of finding synthetic or natural inputs that maximize neuron activations. This paper introduces an optimization framework that aims to deceive feature visualization through adversarial model manipulation. It consists of finetuning a pre-trained model with a specifically introduced loss that aims to maintain model performance, while also significantly changing feature visualization. We provide evidence of the success of this manipulation on several pre-trained models for the classification task with ImageNet.

2024-03-24

Proceedings of the AAAI Conference on Artificial Intelligence (publié)

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Adam Ibrahim

Benjamin Thérien

Kshitij Gupta

Mats Leon Richter

Quentin Anthony

Timothee LESORT

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes ava… (voir plus)ilable. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English

2024-03-13

ArXiv (prépublication)

Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis

Stefan Horoi

Albert Manuel Orozco Camacho

2024-03-02

ICLR.cc/2024/Workshop/Re-Align (poster)

Channel-Selective Normalization for Label-Shift Robust Test-Time Adaptation

Pedro Vianna

Muawiz Chaudhary

Paria Mehrbod

An Tang

Guy Cloutier

Michael Eickenberg

Deep neural networks have useful applications in many different tasks, however their performance can be severely affected by changes in the … (voir plus)data distribution. For example, in the biomedical field, their performance can be affected by changes in the data (different machines, populations) between training and test datasets. To ensure robustness and generalization to real-world scenarios, test-time adaptation has been recently studied as an approach to adjust models to a new data distribution during inference. Test-time batch normalization is a simple and popular method that achieved compelling performance on domain shift benchmarks. It is implemented by recalculating batch normalization statistics on test batches. Prior work has focused on analysis with test data that has the same label distribution as the training data. However, in many practical applications this technique is vulnerable to label distribution shifts, sometimes producing catastrophic failure. This presents a risk in applying test time adaptation methods in deployment. We propose to tackle this challenge by only selectively adapting channels in a deep network, minimizing drastic adaptation that is sensitive to label shifts. Our selection scheme is based on two principles that we empirically motivate: (1) later layers of networks are more sensitive to label shift (2) individual features can be sensitive to specific classes. We apply the proposed technique to three classification tasks, including CIFAR10-C, Imagenet-C, and diagnosis of fatty liver, where we explore both covariate and label distribution shifts. We find that our method allows to bring the benefits of TTA while significantly reducing the risk of failure common in other methods, while being robust to choice in hyperparameters.

2024-02-07

ArXiv (prépublication)

Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks

MohammadReza Davari

2023-12-11

ArXiv (prépublication)

Can We Learn Communication-Efficient Optimizers?

Charles-Étienne Joseph

Benjamin Thérien

Abhinav Moudgil

Boris Knyazev

2023-12-02

ArXiv (prépublication)

Channel Selection for Test-Time Adaptation Under Distribution Shift

Pedro Vianna

Muawiz Sajjad Chaudhary

An Tang

Guy Cloutier

Michael Eickenberg

To ensure robustness and generalization to real-world scenarios, test-time adaptation has been recently studied as an approach to adjust mod… (voir plus)els to a new data distribution during inference. Test-time batch normalization is a simple and popular method that achieved compelling performance on domain shift benchmarks by recalculating batch normalization statistics on test batches. However, in many practical applications this technique is vulnerable to label distribution shifts. We propose to tackle this challenge by only selectively adapting channels in a deep network, minimizing drastic adaptation that is sensitive to label shifts. We find that adapted models significantly improve the performance compared to the baseline models and counteract unknown label shifts.

2023-10-27

NeurIPS.cc/2023/Workshop/DistShift (poster)

Learning Optimizers for Local SGD

Charles-Étienne Joseph

Benjamin Thérien

Abhinav Moudgil

Boris Knyazev

2023-10-27

NeurIPS.cc/2023/Workshop/Federated_Learning (poster)

Gradient Masked Averaging for Federated Learning

Irene Tenison

Sai Aravind Sreeramadas

Vaikkunth Mugunthan

Edouard Oyallon

Federated learning (FL) is an emerging paradigm that permits a large number of clients with heterogeneous data to coordinate learning of a u… (voir plus)nified global model without the need to share data amongst each other. A major challenge in federated learning is the heterogeneity of data across client, which can degrade the performance of standard FL algorithms. Standard FL algorithms involve averaging of model parameters or gradient updates to approximate the global model at the server. However, we argue that in heterogeneous settings, averaging can result in information loss and lead to poor generalization due to the bias induced by dominant client gradients. We hypothesize that to generalize better across non-i.i.d datasets, the algorithms should focus on learning the invariant mechanism that is constant while ignoring spurious mechanisms that differ across clients. Inspired from recent works in Out-of-Distribution generalization, we propose a gradient masked averaging approach for FL as an alternative to the standard averaging of client updates. This aggregation technique for client updates can be adapted as a drop-in replacement in most existing federated algorithms. We perform extensive experiments on multiple FL algorithms with in-distribution, real-world, feature-skewed out-of-distribution, and quantity imbalanced datasets and show that it provides consistent improvements, particularly in the case of heterogeneous clients.

2023-10-23

TMLR (accepté)