Eugene Belilovsky

Reza Davari

Doctorat - Concordia

Stagiaire de recherche - Concordia

Maîtrise recherche - Concordia

Paul Janson

Maîtrise recherche - Concordia

Co-superviseur⋅e :

Charles-Etienne Joseph

Maîtrise recherche - UdeM

Co-superviseur⋅e :

Zafir Khalid

Maîtrise recherche - Concordia

Site web

Gwen Legate

Doctorat - Concordia

Co-superviseur⋅e :

Google Scholar

Melika Minaei Bidgoli

Stagiaire de recherche - Concordia University

Doctorat - Concordia

Adel Nabli

Doctorat - Concordia

Google Scholar

Geraldin Nanfack

Postdoctorat - Concordia

Co-superviseur⋅e :

Albert Orozco Camacho

Doctorat - Concordia

Co-superviseur⋅e :

Paria Paria

Maîtrise recherche - Concordia

Co-superviseur⋅e :

Donald Shenaj

Collaborateur·rice de recherche - Concordia

Co-superviseur⋅e :

Doctorat - Concordia

Co-superviseur⋅e :

Benjamin Therien

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Collaborateur·rice de recherche - UdeM

Superviseur⋅e principal⋅e :

Maîtrise recherche - Concordia

Doctorat - Concordia

Congshu Zou

Maîtrise recherche - Concordia

Publications

Accelerating Training with Neuron Interaction and Nowcasting Networks

Boris Knyazev

Abhinav Moudgil

Guillaume Lajoie

Simon Lacoste-Julien

Neural network training can be accelerated when a learnable update rule is used in lieu of classic adaptive optimizers (e.g. Adam). However,… (voir plus) learnable update rules can be costly and unstable to train and use. A simpler recently proposed approach to accelerate training is to use Adam for most of the optimization steps and periodically, only every few steps, nowcast (predict future) parameters. We improve this approach by Neuron interaction and Nowcasting (NiNo) networks. NiNo leverages neuron connectivity and graph neural networks to more accurately nowcast parameters by learning in a supervised way from a set of training trajectories over multiple tasks. We show that in some networks, such as Transformers, neuron connectivity is non-trivial. By accurately modeling neuron connectivity, we allow NiNo to accelerate Adam training by up to 50\% in vision and language tasks.

2024-09-06

ArXiv (prépublication)

Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis

Stefan Horoi

Albert Manuel Orozco Camacho

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (publié)

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Adam Ibrahim

Benjamin Thérien

Kshitij Gupta

Mats Leon Richter

Quentin Gregory Anthony

Timothee LESORT

2024-07-08

TMLR (accepté)

Model Breadcrumbs: Scalable Upcycling of Finetuned Foundation Models via Sparse Task Vectors Merging

MohammadReza Davari

2024-07-03

ICML.cc/2024/Workshop/FM-Wild (poster)

Simulating federated learning for steatosis detection using ultrasound images

Yue Qi

Pedro Vianna

Alexandre Cadrin-Chênevert

Katleen Blanchet

Emmanuel Montagnon

Louis-Antoine Mullie

Guy Cloutier

Michael Chassé

An Tang

2024-06-10

Scientific Reports (publié)

PETRA: Parallel End-to-end Training with Reversible Architectures

Stephane Rivaud

Louis Fournier

Thomas Pumir

Michael Eickenberg

Edouard Oyallon

Reversible architectures have been shown to be capable of performing on par with their non-reversible architectures, being applied in deep l… (voir plus)earning for memory savings and generative modeling. In this work, we show how reversible architectures can solve challenges in parallelizing deep model training. We introduce PETRA, a novel alternative to backpropagation for parallelizing gradient computations. PETRA facilitates effective model parallelism by enabling stages (i.e., a set of layers) to compute independently on different devices, while only needing to communicate activations and gradients between each other. By decoupling the forward and backward passes and keeping a single updated version of the parameters, the need for weight stashing is also removed. We develop a custom autograd-like training framework for PETRA, and we demonstrate its effectiveness on CIFAR-10, ImageNet32, and ImageNet, achieving competitive accuracies comparable to backpropagation using ResNet-18, ResNet-34, and ResNet-50 models.

2024-06-04

ArXiv (prépublication)

ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training

Adel Nabli

Louis Fournier

Pierre Erbacher

Louis Serrano

Edouard Oyallon

2024-06-03

ArXiv (prépublication)

From Feature Visualization to Visual Circuits: Effect of Adversarial Model Manipulation

G'eraldin Nanfack

Michael Eickenberg

Understanding the inner working functionality of large-scale deep neural networks is challenging yet crucial in several high-stakes applicat… (voir plus)ions. Mechanistic inter- pretability is an emergent field that tackles this challenge, often by identifying human-understandable subgraphs in deep neural networks known as circuits. In vision-pretrained models, these subgraphs are usually interpreted by visualizing their node features through a popular technique called feature visualization. Recent works have analyzed the stability of different feature visualization types under the adversarial model manipulation framework. This paper starts by addressing limitations in existing works by proposing a novel attack called ProxPulse that simultaneously manipulates the two types of feature visualizations. Surprisingly, when analyzing these attacks under the umbrella of visual circuits, we find that visual circuits show some robustness to ProxPulse. We, therefore, introduce a new attack based on ProxPulse that unveils the manipulability of visual circuits, shedding light on their lack of robustness. The effectiveness of these attacks is validated using pre-trained AlexNet and ResNet-50 models on ImageNet.

2024-06-03

ArXiv (prépublication)

$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

Benjamin Thérien

Charles-Étienne Joseph

Boris Knyazev

Edouard Oyallon

2024-05-31

ArXiv (prépublication)

WASH: Train your Ensemble with Communication-Efficient Weight Shuffling, then Average

Louis Fournier

Adel Nabli

Masih Aminbeidokhti

Marco Pedersoli

Edouard Oyallon

The performance of deep neural networks is enhanced by ensemble methods, which average the output of several models. However, this comes at … (voir plus)an increased cost at inference. Weight averaging methods aim at balancing the generalization of ensembling and the inference speed of a single model by averaging the parameters of an ensemble of models. Yet, naive averaging results in poor performance as models converge to different loss basins, and aligning the models to improve the performance of the average is challenging. Alternatively, inspired by distributed training, methods like DART and PAPA have been proposed to train several models in parallel such that they will end up in the same basin, resulting in good averaging accuracy. However, these methods either compromise ensembling accuracy or demand significant communication between models during training. In this paper, we introduce WASH, a novel distributed method for training model ensembles for weight averaging that achieves state-of-the-art image classification accuracy. WASH maintains models within the same basin by randomly shuffling a small percentage of weights during training, resulting in diverse models and lower communication costs compared to standard parameter averaging methods.

2024-05-27

ArXiv (prépublication)

AdaFisher: Adaptive Second Order Optimization via Fisher Information

Damien Martins Gomes

Yanlei Zhang

Mahdi S. Hosseini

First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limi… (voir plus)ted curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs are still limited due to increased per-iteration computations and suboptimal accuracy compared to the first order methods. We present AdaFisher--an adaptive second-order optimizer that leverages a block-diagonal approximation to the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced convergence capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modelling and stand out for its stability and robustness in hyperparameter tuning. We demonstrate that AdaFisher outperforms the SOTA optimizers in terms of both accuracy and convergence speed. Code available from \href{https://github.com/AtlasAnalyticsLab/AdaFisher}{https://github.com/AtlasAnalyticsLab/AdaFisher}

2024-05-26

ArXiv (prépublication)

Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis

Stefan Horoi

Albert Manuel Orozco Camacho

Ensembling multiple models enhances predictive performance by utilizing the varied learned features of the different models but incurs signi… (voir plus)ficant computational and storage costs. Model fusion, which combines parameters from multiple models into one, aims to mitigate these costs but faces practical challenges due to the complex, non-convex nature of neural network loss landscapes, where learned minima are often separated by high loss barriers. Recent works have explored using permutations to align network features, reducing the loss barrier in parameter space. However, permutations are restrictive since they assume a one-to-one mapping between the different models' neurons exists. We propose a new model merging algorithm, CCA Merge, which is based on Canonical Correlation Analysis and aims to maximize the correlations between linear combinations of the model features. We show that our method of aligning models leads to better performances than past methods when averaging models trained on the same, or differing data splits. We also extend this analysis into the harder many models setting where more than 2 models are merged, and we find that CCA Merge works significantly better in this setting than past methods.

2024-05-01

ICML.cc/2024/Conference (poster)