Eugene Belilovsky

Paul Janson

Doctorat - Concordia

Co-superviseur⋅e :

Charles-Etienne Joseph

Maîtrise recherche - UdeM

Co-superviseur⋅e :

Zafir Khalid

Maîtrise recherche - Concordia

Co-superviseur⋅e :

Site web

Gwen Legate

Doctorat - Concordia

Co-superviseur⋅e :

Maîtrise recherche - Concordia

Co-superviseur⋅e :

Doctorat - Concordia

Adel Nabli

Doctorat - Concordia

Google Scholar

Geraldin Nanfack

Postdoctorat - Concordia

Co-superviseur⋅e :

geraldin.nanfack@mila.quebec

Site web

Google Scholar

Albert Orozco Camacho

Doctorat - Concordia

Co-superviseur⋅e :

Doctorat - Concordia

Co-superviseur⋅e :

Benjamin Therien

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Collaborateur·rice de recherche - UdeM

Superviseur⋅e principal⋅e :

Doctorat - Concordia

Co-superviseur⋅e :

Congshu Zou

Maîtrise recherche - Concordia

Publications

Guiding The Last Layer in Federated Learning with Pre-Trained Models

Gwen Legate

Nicolas Bernier

Lucas Caccia

Edouard Oyallon

$\textbf{A}^2\textbf{CiD}^2$: Accelerating Asynchronous Communication in Decentralized Deep Learning

Adel Nabli

Edouard Oyallon

Automated liver segmentation and steatosis grading using deep learning on B-mode ultrasound images

Pedro Vianna

Merve Kulbay

Pamela Boustros

Sara-Ivana Calce

Cassandra Larocque-Rigney

Laurent Patry-Beaudoin

Yi Hui Luo

Muawiz Chaudary

Samuel Kadoury

Bich Nguyen

Emmanuel Montagnon

Michael Chassé

An Tang

Guy Cloutier

Early detection of nonalcoholic fatty liver disease (NAFLD) is crucial to avoid further complications. Ultrasound is often used for screenin… (voir plus)g and monitoring of hepatic steatosis, however it is limited by the subjective interpretation of images. Computer assisted diagnosis could aid radiologists to achieve objective grading, and artificial intelligence approaches have been tested across various medical applications. In this study, we evaluated the performance of a two-stage hepatic steatosis detection deep learning framework, with a first step of liver segmentation and a subsequent step of hepatic steatosis classification. We evaluated the models on internal and external datasets, aiming to understand the generalizability of the framework. In the external dataset, our segmentation model achieved a Dice score of 0.92 (95% CI: 0.78, 1.00), and our classification model achieved an area under the receiver operating characteristic curve of 0.84 (95% CI: 0.79, 0.89). Our findings highlight the potential benefits of applying artificial intelligence models in NAFLD assessment.

2023-09-03

IUS (publié)

Continual Pre-Training of Large Language Models: How to (re)warm your model?

Kshitij Gupta

Benjamin Therien

Adam Ibrahim

Mats Leon Richter

Quentin Gregory Anthony

Timothee LESORT

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes a… (voir plus)vailable. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data. Taking a step towards efficient continual pre-training, in this work, we examine the effect of different warm-up strategies. Our hypothesis is that the learning rate must be re-increased to improve compute efficiency when training on a new dataset. We study the warmup phase of models pre-trained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens), following a linear warmup and cosine decay schedule. We conduct all experiments on the Pythia 410M language model architecture and evaluate performance through validation perplexity. We experiment with different pre-training checkpoints, various maximum learning rates, and various warmup lengths. Our results show that while rewarming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch

2023-06-20

ICML.cc/2023/Workshop/ES-FoMO (poster)

Learning to Optimize with Recurrent Hierarchical Transformers

2023-06-19

ICML.cc/2023/Workshop/Frontiers4LCD (publié)

Simulated Annealing in Early Layers Leads to Better Generalization

Amir M. Sarfi

Zahra Karimpour

Muawiz Chaudhary

Nasir M. Khalid

Mirco Ravanelli

Sudhir Mudur

Recently, a number of iterative learning methods have been introduced to improve generalization. These typically rely on training for longer… (voir plus) periods of time in exchange for improved generalization. LLF (later-layer-forgetting) is a state-of-the-art method in this category. It strengthens learning in early layers by periodically re-initializing the last few layers of the network. Our principal innovation in this work is to use Simulated annealing in EArly Layers (SEAL) of the network in place of re-initialization of later layers. Essentially, later layers go through the normal gradient descent process, while the early layers go through short stints of gradient ascent followed by gradient descent. Extensive experiments on the popular Tiny-ImageNet dataset benchmark and a series of transfer learning and few-shot learning tasks show that we outperform LLF by a significant margin. We further show that, compared to normal training, LLF features, although improving on the target task, degrade the transfer learning performance across all datasets we explored. In comparison, our method outperforms LLF across the same target datasets by a large margin. We also show that the prediction depth of our method is significantly lower than that of LLF and normal training, indicating on average better prediction performance. 11The code to reproduce our results is publicly available at: https://github.com/amiiir-sarfi/SEAL

2023-06-17

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (publié)

arxiv.org

Preventing Dimensional Collapse in Contrastive Local Learning with Subsampling

Louis Fournier

Adeetya Patel

Michael Eickenberg

Edouard Oyallon

2023-06-16

ICML.cc/2023/Workshop/LLW (publié)

A2CiD2: Accelerating Asynchronous Communication in Decentralized Deep Learning

Adel Nabli

Edouard Oyallon

2023-06-14

ArXiv (prépublication)

arxiv.org

Reliability of CKA as a Similarity Measure in Deep Learning

MohammadReza Davari

Comparing learned neural representations in neural networks is a challenging but important problem, which has been approached in different w… (voir plus)ays. The Centered Kernel Alignment (CKA) similarity metric, particularly its linear variant, has recently become a popular approach and has been widely used to compare representations of a network's different layers, of architecturally similar networks trained differently, or of models with different architectures trained on the same data. A wide variety of claims about similarity and dissimilarity of these various representations have been made using CKA results. In this work we present analysis that formally characterizes CKA sensitivity to a large class of simple transformations, which can naturally occur in the context of modern machine learning. This provides a concrete explanation to CKA sensitivity to outliers, which has been observed in past works, and to transformations that preserve the linear separability of the data, an important generalization attribute. We empirically investigate several weaknesses of the CKA similarity metric, demonstrating situations in which it gives unexpected or counterintuitive results. Finally we study approaches for modifying representations to maintain functional behaviour while changing the CKA value. Our results illustrate that, in many cases, the CKA value can be easily manipulated without substantial changes to the functional behaviour of the models, and call for caution when leveraging activation alignment metrics.

2023-02-01

ICLR.cc/2023/Conference (poster)

Can Forward Gradient Match Backpropagation?

Louis Fournier

Stephane Rivaud

Michael Eickenberg

Edouard Oyallon

Forward Gradients - the idea of using directional derivatives in forward differentiation mode - have recently been shown to be utilizable fo… (voir plus)r neural network training while avoiding problems generally associated with backpropagation gradient computation, such as locking and memorization requirements. The cost is the requirement to guess the step direction, which is hard in high dimensions. While current solutions rely on weighted averages over isotropic guess vector distributions, we propose to strongly bias our gradient guesses in directions that are much more promising, such as feedback obtained from small, local auxiliary networks. For a standard computer vision neural network, we conduct a rigorous study systematically covering a variety of combinations of gradient targets and gradient guesses, including those previously presented in the literature. We find that using gradients obtained from a local loss as a candidate direction drastically improves on random noise in Forward Gradient methods.

2023-01-01

ICML (publié)

Gradient Masked Averaging for Federated Learning

Irene Tenison

Sai Aravind Sreeramadas

Vaikkunth Mugunthan

Edouard Oyallon

Federated learning (FL) is an emerging paradigm that permits a large number of clients with heterogeneous data to coordinate learning of a u… (voir plus)nified global model without the need to share data amongst each other. A major challenge in federated learning is the heterogeneity of data across client, which can degrade the performance of standard FL algorithms. Standard FL algorithms involve averaging of model parameters or gradient updates to approximate the global model at the server. However, we argue that in heterogeneous settings, averaging can result in information loss and lead to poor generalization due to the bias induced by dominant client gradients. We hypothesize that to generalize better across non-i.i.d datasets, the algorithms should focus on learning the invariant mechanism that is constant while ignoring spurious mechanisms that differ across clients. Inspired from recent works in Out-of-Distribution generalization, we propose a gradient masked averaging approach for FL as an alternative to the standard averaging of client updates. This aggregation technique for client updates can be adapted as a drop-in replacement in most existing federated algorithms. We perform extensive experiments on multiple FL algorithms with in-distribution, real-world, feature-skewed out-of-distribution, and quantity imbalanced datasets and show that it provides consistent improvements, particularly in the case of heterogeneous clients.

2023-01-01

Trans. Mach. Learn. Res. (publié)

Prototype-Sample Relation Distillation: Towards Replay-Free Continual Learning

Nader Asadi

MohammadReza Davari

Sudhir Mudur

Rahaf Aljundi

In Continual learning (CL) balancing effective adaptation while combating catastrophic forgetting is a central challenge. Many of the recent… (voir plus) best-performing methods utilize various forms of prior task data, e.g. a replay buffer, to tackle the catastrophic forgetting problem. Having access to previous task data can be restrictive in many real-world scenarios, for example when task data is sensitive or proprietary. To overcome the necessity of using previous tasks' data, in this work, we start with strong representation learning methods that have been shown to be less prone to forgetting. We propose a holistic approach to jointly learn the representation and class prototypes while maintaining the relevance of old class prototypes and their embedded similarities. Specifically, samples are mapped to an embedding space where the representations are learned using a supervised contrastive loss. Class prototypes are evolved continually in the same latent space, enabling learning and prediction at any point. To continually adapt the prototypes without keeping any prior task data, we propose a novel distillation loss that constrains class prototypes to maintain relative similarities as compared to new task data. This method yields state-of-the-art performance in the task-incremental setting, outperforming methods relying on large amounts of data, and provides strong performance in the class-incremental setting without using any stored data points.

2023-01-01

ICML (publié)