Publications

The Paradox of Choice: On the Role of Attention in Hierarchical Reinforcement Learning
Andrei Cristian Nica
Decision-making AI agents are often faced with two important challenges: the depth of the planning horizon, and the branching factor due to … (see more)having many choices. Hierarchical reinforcement learning methods aim to solve the first problem, by providing shortcuts that skip over multiple time steps. To cope with the breadth, it is desirable to restrict the agent's attention at each step to a reasonable number of possible choices. The concept of affordances (Gibson, 1977) suggests that only certain actions are feasible in certain states. In this work, we first characterize "affordances" as a "hard" attention mechanism that strictly limits the available choices of temporally extended options. We then investigate the role of hard versus soft attention in training data collection, abstract value learning in long-horizon tasks, and handling a growing number of choices. To this end, we present an online, model-free algorithm to learn affordances that can be used to further learn subgoal options. Finally, we identify and empirically demonstrate the settings in which the "paradox of choice" arises, i.e. when having fewer but more meaningful choices improves the learning speed and performance of a reinforcement learning agent.
Timeliness of reporting of SARS-CoV-2 seroprevalence results and their utility for infectious disease surveillance
Claire Donnici
Natasha Ilincic
Christian Cao
Caseng Zhang
Gabriel Deveaux
David A. Clifton
Niklas Bobrovitz
Rahul K. Arora
Author Correction: Gradient-based learning drives robust representations in recurrent neural networks by balancing compression and expansion
Maxwell J. Farrell
Stefano Recanatesi
Timothy Moore
Eric Todd SheaBrown
Causal inference from text: A commentary
David Blei
Aligning MAGMA by Few-Shot Learning and Finetuning
Jean-Charles Layoun
Alexis Roger
Adapting Triplet Importance of Implicit Feedback for Personalized Recommendation
Haolun Wu
Chen Ma
Yingxue Zhang
Ruiming Tang
Lifelong Online Learning from Accumulated Knowledge
Changjian Shui
William Wang
Ihsen Hedhli
Chi Man Wong
Feng Wan
Boyu Wang
In this article, we formulate lifelong learning as an online transfer learning procedure over consecutive tasks, where learning a given task… (see more) depends on the accumulated knowledge. We propose a novel theoretical principled framework, lifelong online learning, where the learning process for each task is in an incremental manner. Specifically, our framework is composed of two-level predictions: the prediction information that is solely from the current task; and the prediction from the knowledge base by previous tasks. Moreover, this article tackled several fundamental challenges: arbitrary or even non-stationary task generation process, an unknown number of instances in each task, and constructing an efficient accumulated knowledge base. Notably, we provide a provable bound of the proposed algorithm, which offers insights on the how the accumulated knowledge improves the predictions. Finally, empirical evaluations on both synthetic and real datasets validate the effectiveness of the proposed algorithm.
OptEmbed: Learning Optimal Embedding Table for Click-through Rate Prediction
Fuyuan Lyu
Xing Tang
Hong Zhu
Huifeng Guo
Yingxue Zhang
Ruiming Tang
Click-through rate (CTR) prediction model usually consists of three components: embedding table, feature interaction layer, and classifier. … (see more)Learning embedding table plays a fundamental role in CTR prediction from the view of the model performance and memory usage. The embedding table is a two-dimensional tensor, with its axes indicating the number of feature values and the embedding dimension, respectively. To learn an efficient and effective embedding table, recent works either assign various embedding dimensions for feature fields and reduce the number of embeddings respectively or mask the embedding table parameters. However, all these existing works cannot get an optimal embedding table. On the one hand, various embedding dimensions still require a large amount of memory due to the vast number of features in the dataset. On the other hand, decreasing the number of embeddings usually suffers from performance degradation, which is intolerable in CTR prediction. Finally, pruning embedding parameters will lead to a sparse embedding table, which is hard to be deployed. To this end, we propose an optimal embedding table learning framework OptEmbed, which provides a practical and general method to find an optimal embedding table for various base CTR models. Specifically, we propose pruning the redundant embeddings regarding corresponding features' importance by learnable pruning thresholds. Furthermore, we consider assigning various embedding dimensions as one single candidate architecture. To efficiently search the optimal embedding dimensions, we design a uniform embedding dimension sampling scheme to equally train all candidate architectures, meaning architecture-related parameters and learnable thresholds are trained simultaneously in one supernet. We then propose an evolution search method based on the supernet to find the optimal embedding dimensions for each field. Experiments on public datasets show that OptEmbed can learn a compact embedding table which can further improve the model performance.
Inductive biases for deep learning of higher-level cognition
Anirudh Goyal
Lookback for Learning to Branch
Prateek Gupta
Elias Boutros Khalil
Didier Chételat
M. Pawan Kumar
Dissecting adaptive methods in GANs
Samy Jelassi
David Dobre
Arthur Mensch
Yuanzhi Li
Adaptive methods are a crucial component widely used for training generative adversarial networks (GANs). While there has been some work to … (see more)pinpoint the “marginal value of adaptive methods” in standard tasks, it remains unclear why they are still critical for GAN training. In this paper, we formally study how adaptive methods help train GANs; inspired by the grafting method proposed in Agarwal et al. (2020), we separate the magnitude and direction components of the Adam updates, and graft them to the direction and magnitude of SGDA updates respectively. By considering an update rule with the magnitude of the Adam update and the normalized direction of SGD, we empirically show that the adaptive magnitude of Adam is key for GAN training. This motivates us to have a closer look at the class of normalized stochastic gradient descent ascent (nSGDA) methods in the context of GAN training. We propose a synthetic theoretical framework to compare the performance of nSGDA and SGDA for GAN training with neural networks. We prove that in that setting, GANs trained with nSGDA recover all the modes of the true distribution, whereas the same networks trained with SGDA (and any learning rate configuration) suffer from mode collapse. The critical insight in our analysis is that normalizing the gradients forces the discriminator and generator to be updated at the same pace. We also experimentally show that for several datasets, Adam’s performance can be recovered with nSGDA methods.
PipeBERT: High-throughput BERT Inference for ARM Big.LITTLE Multi-core Processors
Hung-Yang Chang
Seyyed Hasan Mozafari
Cheng Chen
James J. Clark
Brett Meyer