Le Studio d'IA pour le climat de Mila vise à combler l’écart entre la technologie et l'impact afin de libérer le potentiel de l'IA pour lutter contre la crise climatique rapidement et à grande échelle.
Le programme a récemment publié sa première note politique, intitulée « Considérations politiques à l’intersection des technologies quantiques et de l’intelligence artificielle », réalisée par Padmapriya Mohan.
Hugo Larochelle nommé directeur scientifique de Mila
Professeur associé à l’Université de Montréal et ancien responsable du laboratoire de recherche en IA de Google à Montréal, Hugo Larochelle est un pionnier de l’apprentissage profond et fait partie des chercheur·euses les plus respecté·es au Canada.
Nous utilisons des témoins pour analyser le trafic et l’utilisation de notre site web, afin de personnaliser votre expérience. Vous pouvez désactiver ces technologies à tout moment, mais cela peut restreindre certaines fonctionnalités du site. Consultez notre Politique de protection de la vie privée pour en savoir plus.
Paramètre des cookies
Vous pouvez activer et désactiver les types de cookies que vous souhaitez accepter. Cependant certains choix que vous ferez pourraient affecter les services proposés sur nos sites (ex : suggestions, annonces personnalisées, etc.).
Cookies essentiels
Ces cookies sont nécessaires au fonctionnement du site et ne peuvent être désactivés. (Toujours actif)
Cookies analyse
Acceptez-vous l'utilisation de cookies pour mesurer l'audience de nos sites ?
Multimedia Player
Acceptez-vous l'utilisation de cookies pour afficher et vous permettre de regarder les contenus vidéo hébergés par nos partenaires (YouTube, etc.) ?
Recent trends of larger model and larger datasets require huge amounts of computational resources, making distributed deep learning essentia… (voir plus)l. Data parallelism is a common approach to speed up training, but it often involves frequent communication between workers, which can be a bottleneck. In this work, we propose a method called Pseudo-Asynchronous Local SGD (PALSGD) to improve the efficiency of data-parallel training. PALSGD is a novel extension of LocalSGD (SU Stich, 2018), designed to further reduce communication frequency by introducing a pseudo-synchronization mechanism. PALSGD allows the use of longer synchronization intervals compared to standard LocalSGD. Despite the reduced communication frequency, the pseudo-synchronization approach ensures that model consistency is maintained, leading to performance results comparable to those achieved with more frequent synchronization. Furthermore, we provide a theoretical analysis of PALSGD, establishing its convergence and deriving its convergence rate. This analysis offers insights into the algorithm's behavior and performance guarantees. We evaluated PALSGD on CIFAR-10 using a CNN and GPT-NEO on TinyStories. Our results show that PALSGD achieves better performance in less time compared to existing methods like distributed data parallel (DDP), Local SGD and DiLoCo (Douillard et al. 2023).
Understanding the optimization dynamics of neural networks is necessary for closing the gap between theory and practice. Stochastic first-or… (voir plus)der optimization algorithms are known to efficiently locate favorable minima in deep neural networks. This efficiency, however, contrasts with the non-convex and seemingly complex structure of neural loss landscapes. In this study, we delve into the fundamental geometric properties of sampled gradients along optimization paths. We focus on two key quantities, which appear in the restricted secant inequality and error bound. Both hold high significance for first-order optimization. Our analysis reveals that these quantities exhibit predictable, consistent behavior throughout training, despite the stochasticity induced by sampling minibatches. Our findings suggest that not only do optimization trajectories never encounter significant obstacles, but they also maintain stable dynamics during the majority of training. These observed properties are sufficiently expressive to theoretically guarantee linear convergence and prescribe learning rate schedules mirroring empirical practices. We conduct our experiments on image classification, semantic segmentation and language modeling across different batch sizes, network architectures, datasets, optimizers, and initialization seeds. We discuss the impact of each factor. Our work provides novel insights into the properties of neural network loss functions, and opens the door to theoretical frameworks more relevant to prevalent practice.
2024-07-08
Proceedings of the 41st International Conference on Machine Learning (publié)
The sharpness-aware minimization (SAM) procedure recently gained increasing attention due to its favorable generalization ability to unseen … (voir plus)data. SAM aims to find flatter (local) minima, utilizing a minimax objective. An immediate challenge in the application of SAM is the adjustment of two pivotal step sizes, which significantly influence its effectiveness. We introduce a novel, straightforward approach for adjusting step sizes that adapts to the smoothness of the objective function, thereby reducing the necessity for manual tuning. This method, termed Smoothness-Adaptive SAM (SA-SAM), not only simplifies the optimization process but also promotes the method's inherent tendency to converge towards flatter minima, enhancing performance in specific models.
Modern deep learning systems do not generalize well when the test data distribution is slightly different to the training data distribution.… (voir plus) While much promising work has been accomplished to address this fragility, a systematic study of the role of optimizers and their out-of-distribution generalization performance has not been undertaken. In this study, we examine the performance of popular first-order optimizers for different classes of distributional shift under empirical risk minimization and invariant risk minimization. We address this question for image and text classification using DomainBed, WILDS, and Backgrounds Challenge as testbeds for studying different types of shifts---namely correlation and diversity shift. We search over a wide range of hyperparameters and examine classification accuracy (in-distribution and out-of-distribution) for over 20,000 models. We arrive at the following findings, which we expect to be helpful for practitioners: i) adaptive optimizers (e.g., Adam) perform worse than non-adaptive optimizers (e.g., SGD, momentum SGD) on out-of-distribution performance. In particular, even though there is no significant difference in in-distribution performance, we show a measurable difference in out-of-distribution performance. ii) in-distribution performance and out-of-distribution performance exhibit three types of behavior depending on the dataset---linear returns, increasing returns, and diminishing returns. For example, in the training of natural language data using Adam, fine-tuning the performance of in-distribution performance does not significantly contribute to the out-of-distribution generalization performance.