Eugene Belilovsky

Adel Nabli

Doctorat - Concordia University

adel.nabli@mila.quebec

alexander.fulleringer@mila.quebec

Albert Orozco Camacho

Doctorat - Concordia University

Co-superviseur⋅e :

Maîtrise recherche - Concordia University

amirhossein.zamani@mila.quebec

AmirHossein Zamani

Doctorat - Concordia University

Doctorat - Université de Montréal

Superviseur⋅e principal⋅e :

benjamin.therien@mila.quebec

Charles-Etienne Joseph

Maîtrise recherche - Université de Montréal

Co-superviseur⋅e :

charles-etienne.joseph@mila.quebec

Congshu Zou

Maîtrise recherche - Concordia University

congshu.zou@mila.quebec

Donald Shenaj

Collaborateur·rice de recherche - Concordia University

Co-superviseur⋅e :

donald.shenaj@mila.quebec

Postdoctorat - Concordia University

Co-superviseur⋅e :

geraldin.nanfack@mila.quebec

Gwen Legate

Doctorat - Concordia University

Co-superviseur⋅e :

gwendolyne.legate@mila.quebec

Humza Wajid Hameed

Maîtrise recherche - Concordia University

humza.wajid@mila.quebec

louis.fournier@mila.quebec

Louis Fournier

Stagiaire de recherche - Concordia University

Maîtrise recherche - Concordia University

Co-superviseur⋅e :

paria.mehrbod@mila.quebec

Reza Davari

Doctorat - Concordia University

mohammadreza.davari@mila.quebec

Nader Asadi

Collaborateur·rice alumni

Co-superviseur⋅e :

nicolas.bernier@mila.quebec

nader.asadi@mila.quebec

Maîtrise recherche - Concordia University

Paul Janson

Maîtrise recherche - Concordia University

paul.janson@mila.quebec

Collaborateur·rice de recherche - Université de Montréal

Superviseur⋅e principal⋅e :

pedro.vianna@mila.quebec

Vaibhav Singh

Doctorat - Concordia University

Co-superviseur⋅e :

vaibhav.singh@mila.quebec

Zafir Khalid

Maîtrise recherche - Concordia University

zafir.khaled@mila.quebec

Publications

Guiding The Last Layer in Federated Learning with Pre-Trained Models

Gwen Legate

Nicolas Bernier

Lucas Caccia

Edouard Oyallon

$\textbf{A}^2\textbf{CiD}^2$: Accelerating Asynchronous Communication in Decentralized Deep Learning

Adel Nabli

Edouard Oyallon

Automated liver segmentation and steatosis grading using deep learning on B-mode ultrasound images

Pedro Vianna

Merve Kulbay

Pamela Boustros

Sara-Ivana Calce

Cassandra Larocque-Rigney

Laurent Patry-Beaudoin

Yi Hui Luo

Muawiz Chaudary

Samuel Kadoury

Bich Nguyen

Emmanuel Montagnon

Michaël Chassé

An Tang

Guy Cloutier

Early detection of nonalcoholic fatty liver disease (NAFLD) is crucial to avoid further complications. Ultrasound is often used for screenin… (voir plus)g and monitoring of hepatic steatosis, however it is limited by the subjective interpretation of images. Computer assisted diagnosis could aid radiologists to achieve objective grading, and artificial intelligence approaches have been tested across various medical applications. In this study, we evaluated the performance of a two-stage hepatic steatosis detection deep learning framework, with a first step of liver segmentation and a subsequent step of hepatic steatosis classification. We evaluated the models on internal and external datasets, aiming to understand the generalizability of the framework. In the external dataset, our segmentation model achieved a Dice score of 0.92 (95% CI: 0.78, 1.00), and our classification model achieved an area under the receiver operating characteristic curve of 0.84 (95% CI: 0.79, 0.89). Our findings highlight the potential benefits of applying artificial intelligence models in NAFLD assessment.

2023-09-03

IUS (publié)

Can Forward Gradient Match Backpropagation?

Louis Fournier

Stephane Rivaud

Michael Eickenberg

Edouard Oyallon

Forward Gradients - the idea of using directional derivatives in forward differentiation mode - have recently been shown to be utilizable fo… (voir plus)r neural network training while avoiding problems generally associated with backpropagation gradient computation, such as locking and memorization requirements. The cost is the requirement to guess the step direction, which is hard in high dimensions. While current solutions rely on weighted averages over isotropic guess vector distributions, we propose to strongly bias our gradient guesses in directions that are much more promising, such as feedback obtained from small, local auxiliary networks. For a standard computer vision neural network, we conduct a rigorous study systematically covering a variety of combinations of gradient targets and gradient guesses, including those previously presented in the literature. We find that using gradients obtained from a local loss as a candidate direction drastically improves on random noise in Forward Gradient methods.

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (publié)

Continual Pre-Training of Large Language Models: How to (re)warm your model?

Kshitij Gupta

Benjamin Thérien

Adam Ibrahim

Mats Leon Richter

Quentin Gregory Anthony

Timothee LESORT

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes a… (voir plus)vailable. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data. Taking a step towards efficient continual pre-training, in this work, we examine the effect of different warm-up strategies. Our hypothesis is that the learning rate must be re-increased to improve compute efficiency when training on a new dataset. We study the warmup phase of models pre-trained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens), following a linear warmup and cosine decay schedule. We conduct all experiments on the Pythia 410M language model architecture and evaluate performance through validation perplexity. We experiment with different pre-training checkpoints, various maximum learning rates, and various warmup lengths. Our results show that while rewarming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch

2023-06-20

ICML.cc/2023/Workshop/ES-FoMO (poster)

Learning to Optimize with Recurrent Hierarchical Transformers

Abhinav Moudgil

Boris Knyazev

Guillaume Lajoie

2023-06-19

ICML.cc/2023/Workshop/Frontiers4LCD (publié)

Simulated Annealing in Early Layers Leads to Better Generalization

Amir M. Sarfi

Zahra Karimpour

Muawiz Chaudhary

Nasir M. Khalid

Mirco Ravanelli

Sudhir Mudur

Recently, a number of iterative learning methods have been introduced to improve generalization. These typically rely on training for longer… (voir plus) periods of time in exchange for improved generalization. LLF (later-layer-forgetting) is a state-of-the-art method in this category. It strengthens learning in early layers by periodically re-initializing the last few layers of the network. Our principal innovation in this work is to use Simulated annealing in EArly Layers (SEAL) of the network in place of re-initialization of later layers. Essentially, later layers go through the normal gradient descent process, while the early layers go through short stints of gradient ascent followed by gradient descent. Extensive experiments on the popular Tiny-ImageNet dataset benchmark and a series of transfer learning and few-shot learning tasks show that we outperform LLF by a significant margin. We further show that, compared to normal training, LLF features, although improving on the target task, degrade the transfer learning performance across all datasets we explored. In comparison, our method outperforms LLF across the same target datasets by a large margin. We also show that the prediction depth of our method is significantly lower than that of LLF and normal training, indicating on average better prediction performance. 11The code to reproduce our results is publicly available at: https://github.com/amiiir-sarfi/SEAL

2023-06-17

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (publié)

arxiv.org

Preventing Dimensional Collapse in Contrastive Local Learning with Subsampling

Louis Fournier

Adeetya Patel

Michael Eickenberg

Edouard Oyallon

2023-06-16

ICML.cc/2023/Workshop/LLW (publié)

A2CiD2: Accelerating Asynchronous Communication in Decentralized Deep Learning

Adel Nabli

Edouard Oyallon

2023-06-14

ArXiv (prépublication)

arxiv.org

Reliability of CKA as a Similarity Measure in Deep Learning

MohammadReza Davari

Stefan Horoi

Amine Natik

Guillaume Lajoie

Comparing learned neural representations in neural networks is a challenging but important problem, which has been approached in different w… (voir plus)ays. The Centered Kernel Alignment (CKA) similarity metric, particularly its linear variant, has recently become a popular approach and has been widely used to compare representations of a network's different layers, of architecturally similar networks trained differently, or of models with different architectures trained on the same data. A wide variety of claims about similarity and dissimilarity of these various representations have been made using CKA results. In this work we present analysis that formally characterizes CKA sensitivity to a large class of simple transformations, which can naturally occur in the context of modern machine learning. This provides a concrete explanation to CKA sensitivity to outliers, which has been observed in past works, and to transformations that preserve the linear separability of the data, an important generalization attribute. We empirically investigate several weaknesses of the CKA similarity metric, demonstrating situations in which it gives unexpected or counterintuitive results. Finally we study approaches for modifying representations to maintain functional behaviour while changing the CKA value. Our results illustrate that, in many cases, the CKA value can be easily manipulated without substantial changes to the functional behaviour of the models, and call for caution when leveraging activation alignment metrics.

2023-02-01

ICLR.cc/2023/Conference (poster)

Prototype-Sample Relation Distillation: Towards Replay-Free Continual Learning

Nader Asadi

MohammadReza Davari

Sudhir Mudur

Rahaf Aljundi

In Continual learning (CL) balancing effective adaptation while combating catastrophic forgetting is a central challenge. Many of the recent… (voir plus) best-performing methods utilize various forms of prior task data, e.g. a replay buffer, to tackle the catastrophic forgetting problem. Having access to previous task data can be restrictive in many real-world scenarios, for example when task data is sensitive or proprietary. To overcome the necessity of using previous tasks' data, in this work, we start with strong representation learning methods that have been shown to be less prone to forgetting. We propose a holistic approach to jointly learn the representation and class prototypes while maintaining the relevance of old class prototypes and their embedded similarities. Specifically, samples are mapped to an embedding space where the representations are learned using a supervised contrastive loss. Class prototypes are evolved continually in the same latent space, enabling learning and prediction at any point. To continually adapt the prototypes without keeping any prior task data, we propose a novel distillation loss that constrains class prototypes to maintain relative similarities as compared to new task data. This method yields state-of-the-art performance in the task-incremental setting, outperforming methods relying on large amounts of data, and provides strong performance in the class-incremental setting without using any stored data points.

2023-01-01

ICML (publié)

Re-Weighted Softmax Cross-Entropy to Control Forgetting in Federated Learning

Gwen Legate

Lucas Caccia

In Federated Learning a global model is learned by aggregating model updates computed at a set of independent client nodes. To reduce commun… (voir plus)ication costs, multiple gradient steps are performed at each node prior to aggregation. A key challenge in this setting is data heterogeneity across clients resulting in differing local objectives. This can lead clients to overly minimize their own local objective consequently diverging from the global solution. We demonstrate that individual client models experience a catastrophic forgetting with respect to data from other clients and propose an efficient approach that modifies the cross-entropy objective on a per-client basis by re-weighting the softmax logits prior to computing the loss. This approach shields classes outside a client’s label set from abrupt representation change and we empirically demonstrate it can alleviate client forgetting and provide consistent improvements to standard federated learning algorithms. Our method is particularly beneficial under the most challenging federated learning settings where data heterogeneity is high and client participation in each round is low.

2023-01-01

CoLLAs (publié)