Publications

Unmasking the Lottery Ticket Hypothesis: Efficient Adaptive Pruning for Finding Winning Tickets

Mansheej Paul

Feng Chen

Brett W. Larsen

Jonathan Frankle

Surya Ganguli

Modern deep learning involves training costly, highly overparameterized networks, thus motivating the search for sparser networks that requi… (see more)re less compute and memory but can still be trained to the same accuracy as the full network (i.e. matching). Iterative magnitude pruning (IMP) is a state of the art algorithm that can find such highly sparse matching subnetworks, known as winning tickets, that can be retrained from initialization or an early training stage. IMP operates by iterative cycles of training, masking a fraction of smallest magnitude weights, rewinding unmasked weights back to an early training point, and repeating. Despite its simplicity, the underlying principles for when and how IMP finds winning tickets remain elusive. In particular, what useful information does an IMP mask found at the end of training convey to a rewound network near the beginning of training? We find that—at higher sparsities—pairs of pruned networks at successive pruning iterations are connected by a linear path with zero error barrier if and only if they are matching. This indicates that masks found at the end of training encodes information about the identity of an axial subspace that intersects a desired linearly connected mode of a matching sublevel set. We leverage this observation to design a simple adaptive pruning heuristic for speeding up the discovery of winning tickets and achieve a 30% reduction in computation time on CIFAR-100. These results make progress toward demystifying the existence of winning tickets with an eye towards enabling the development of more efficient pruning algorithms.

2022-10-19

NeurIPS.cc/2022/Workshop/HITY (accepted)

openreview.net

Causal inference from text: A commentary

Dhanya Sridhar

David Blei

2022-10-18

Science Advances (published)

doi.org

Aligning MAGMA by Few-Shot Learning and Finetuning

Jean-Charles Layoun

Alexis Roger

Irina Rish

2022-10-17

ArXiv (preprint)

doi.org

arxiv.org

Generalizing in the Real World with Representation Learning

Tegan Maharaj

2022-10-17

ArXiv (preprint)

doi.org

arxiv.org

Adapting Triplet Importance of Implicit Feedback for Personalized Recommendation

Haolun Wu

Chen Ma

Yingxue Zhang

Xue Liu

Ruiming Tang

Mark J. Coates

2022-10-16

Proceedings of the 31st ACM International Conference on Information & Knowledge Management (published)

doi.org

arxiv.org

Lifelong Online Learning from Accumulated Knowledge

Changjian Shui

William Wang

Ihsen Hedhli

Chi Man Wong

Feng Wan

Boyu Wang

Christian Gagné

In this article, we formulate lifelong learning as an online transfer learning procedure over consecutive tasks, where learning a given task… (see more) depends on the accumulated knowledge. We propose a novel theoretical principled framework, lifelong online learning, where the learning process for each task is in an incremental manner. Specifically, our framework is composed of two-level predictions: the prediction information that is solely from the current task; and the prediction from the knowledge base by previous tasks. Moreover, this article tackled several fundamental challenges: arbitrary or even non-stationary task generation process, an unknown number of instances in each task, and constructing an efficient accumulated knowledge base. Notably, we provide a provable bound of the proposed algorithm, which offers insights on the how the accumulated knowledge improves the predictions. Finally, empirical evaluations on both synthetic and real datasets validate the effectiveness of the proposed algorithm.

2022-10-16

ACM Transactions on Knowledge Discovery from Data (published)

doi.org

OptEmbed: Learning Optimal Embedding Table for Click-through Rate Prediction

Fuyuan Lyu

Xing Tang

Hao Zhu

Huifeng Guo

Yingxue Zhang

Ruiming Tang

Xue Liu

Click-through rate (CTR) prediction model usually consists of three components: embedding table, feature interaction layer, and classifier. … (see more)Learning embedding table plays a fundamental role in CTR prediction from the view of the model performance and memory usage. The embedding table is a two-dimensional tensor, with its axes indicating the number of feature values and the embedding dimension, respectively. To learn an efficient and effective embedding table, recent works either assign various embedding dimensions for feature fields and reduce the number of embeddings respectively or mask the embedding table parameters. However, all these existing works cannot get an optimal embedding table. On the one hand, various embedding dimensions still require a large amount of memory due to the vast number of features in the dataset. On the other hand, decreasing the number of embeddings usually suffers from performance degradation, which is intolerable in CTR prediction. Finally, pruning embedding parameters will lead to a sparse embedding table, which is hard to be deployed. To this end, we propose an optimal embedding table learning framework OptEmbed, which provides a practical and general method to find an optimal embedding table for various base CTR models. Specifically, we propose pruning the redundant embeddings regarding corresponding features' importance by learnable pruning thresholds. Furthermore, we consider assigning various embedding dimensions as one single candidate architecture. To efficiently search the optimal embedding dimensions, we design a uniform embedding dimension sampling scheme to equally train all candidate architectures, meaning architecture-related parameters and learnable thresholds are trained simultaneously in one supernet. We then propose an evolution search method based on the supernet to find the optimal embedding dimensions for each field. Experiments on public datasets show that OptEmbed can learn a compact embedding table which can further improve the model performance.

2022-10-16

Proceedings of the 31st ACM International Conference on Information & Knowledge Management (published)

doi.org

arxiv.org

Using Graph Algorithms to Pretrain Graph Completion Transformers

Jonathan Pilault

Mikhail Galkin

Bahare Fatemi

Perouz Taslakian

David Vasquez

Christopher Pal

Recent work on Graph Neural Networks has demonstrated that self-supervised pretraining can further enhance performance on downstream graph, … (see more)link, and node classification tasks. However, the efficacy of pretraining tasks has not been fully investigated for downstream large knowledge graph completion tasks. Using a contextualized knowledge graph embedding approach, we investigate five different pretraining signals, constructed using several graph algorithms and no external data, as well as their combination. We leverage the versatility of our Transformer-based model to explore graph structure generation pretraining tasks (i.e. path and k-hop neighborhood generation), typically inapplicable to most graph embedding methods. We further propose a new path-finding algorithm guided by information gain and find that it is the best-performing pretraining task across three downstream knowledge graph completion datasets. While using our new path-finding algorithm as a pretraining signal provides 2-3% MRR improvements, we show that pretraining on all signals together gives the best knowledge graph completion results. In a multitask setting that combines all pretraining tasks, our method surpasses the latest and strong performing knowledge graph embedding methods on all metrics for FB15K-237, on MRR and Hit@1 for WN18RRand on MRR and hit@10 for JF17K (a knowledge hypergraph dataset).

2022-10-13

ArXiv (preprint)

doi.org

arxiv.org

Inductive biases for deep learning of higher-level cognition

Anirudh Goyal

Yoshua Bengio

2022-10-11

Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences (published)

doi.org

arxiv.org

Lookback for Learning to Branch

Prateek Gupta

Elias B. Khalil

Didier Chételat

Maxime Gasse

Yoshua Bengio

Andrea Lodi

M. Pawan Kumar

The expressive and computationally inexpensive bipartite Graph Neural Networks (GNN) have been shown to be an important component of deep le… (see more)arning based Mixed-Integer Linear Program (MILP) solvers. Recent works have demonstrated the effectiveness of such GNNs in replacing the branching (variable selection) heuristic in branch-and-bound (B&B) solvers. These GNNs are trained, offline and on a collection of MILPs, to imitate a very good but computationally expensive branching heuristic, strong branching. Given that B&B results in a tree of sub-MILPs, we ask (a) whether there are strong dependencies exhibited by the target heuristic among the neighboring nodes of the B&B tree, and (b) if so, whether we can incorporate them in our training procedure. Specifically, we find that with the strong branching heuristic, a child node's best choice was often the parent's second-best choice. We call this the "lookback" phenomenon. Surprisingly, the typical branching GNN of Gasse et al. (2019) often misses this simple "answer". To imitate the target behavior more closely by incorporating the lookback phenomenon in GNNs, we propose two methods: (a) target smoothing for the standard cross-entropy loss function, and (b) adding a Parent-as-Target (PAT) Lookback regularizer term. Finally, we propose a model selection framework to incorporate harder-to-formulate objectives such as solving time in the final models. Through extensive experimentation on standard benchmark instances, we show that our proposal results in up to 22% decrease in the size of the B&B tree and up to 15% improvement in the solving times.

2022-10-09

TMLR (accepted)

doi.org

openreview.net

Dissecting adaptive methods in GANs

Samy Jelassi

David Dobre

Arthur Mensch

Yuanzhi Li

Gauthier Gidel

Adaptive methods are a crucial component widely used for training generative adversarial networks (GANs). While there has been some work to … (see more)pinpoint the “marginal value of adaptive methods” in standard tasks, it remains unclear why they are still critical for GAN training. In this paper, we formally study how adaptive methods help train GANs; inspired by the grafting method proposed in Agarwal et al. (2020), we separate the magnitude and direction components of the Adam updates, and graft them to the direction and magnitude of SGDA updates respectively. By considering an update rule with the magnitude of the Adam update and the normalized direction of SGD, we empirically show that the adaptive magnitude of Adam is key for GAN training. This motivates us to have a closer look at the class of normalized stochastic gradient descent ascent (nSGDA) methods in the context of GAN training. We propose a synthetic theoretical framework to compare the performance of nSGDA and SGDA for GAN training with neural networks. We prove that in that setting, GANs trained with nSGDA recover all the modes of the true distribution, whereas the same networks trained with SGDA (and any learning rate configuration) suffer from mode collapse. The critical insight in our analysis is that normalizing the gradients forces the discriminator and generator to be updated at the same pace. We also experimentally show that for several datasets, Adam’s performance can be recovered with nSGDA methods.

2022-10-08

ArXiv (preprint)

doi.org

openreview.net

PipeBERT: High-throughput BERT Inference for ARM Big.LITTLE Multi-core Processors.

Hung-Yang Chang

Seyyed Hasan Mozafari

Cheng Chen

James J. Clark

Brett H. Meyer

Warren J. Gross

2022-10-07

Journal of Signal Processing Systems (published)

doi.org

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Publications

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Publications