Publications

Multi-scale Feature Learning Dynamics: Insights for Double Descent

A key challenge in building theoretical foundations for deep learning is the complex optimization dynamics of neural networks, resulting fro… (see more)m the high-dimensional interactions between the large number of network parameters. Such non-trivial dynamics lead to intriguing behaviors such as the phenomenon of "double descent" of the generalization error. The more commonly studied aspect of this phenomenon corresponds to model-wise double descent where the test error exhibits a second descent with increasing model complexity, beyond the classical U-shaped error curve. In this work, we investigate the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous transitions, or descents as the training time increases. By leveraging tools from statistical physics, we study a linear teacher-student setup exhibiting epoch-wise double descent similar to that in deep neural networks. In this setting, we derive closed-form analytical expressions for the evolution of generalization error over training. We find that double descent can be attributed to distinct features being learned at different scales: as fast-learning features overfit, slower-learning features start to fit, resulting in a second descent in test error. We validate our findings through numerical experiments where our theory accurately predicts empirical findings and remains consistent with observations in deep neural networks.

2022-06-27

Proceedings of the 39th International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

Neural-Symbolic Models for Logical Queries on Knowledge Graphs

Answering complex first-order logic (FOL) queries on knowledge graphs is a fundamental task for multi-hop reasoning. Traditional symbolic me… (see more)thods traverse a complete knowledge graph to extract the answers, which provides good interpretation for each step. Recent neural methods learn geometric embeddings for complex queries. These methods can generalize to incomplete knowledge graphs, but their reasoning process is hard to interpret. In this paper, we propose Graph Neural Network Query Executor (GNN-QE), a neural-symbolic model that enjoys the advantages of both worlds. GNN-QE decomposes a complex FOL query into relation projections and logical operations over fuzzy sets, which provides interpretability for intermediate variables. To reason about the missing links, GNN-QE adapts a graph neural network from knowledge graph completion to execute the relation projections, and models the logical operations with product fuzzy logic. Experiments on 3 datasets show that GNN-QE significantly improves over previous state-of-the-art models in answering FOL queries. Meanwhile, GNN-QE can predict the number of answers without explicit supervision, and provide visualizations for intermediate variables.

2022-06-27

Proceedings of the 39th International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime

Fabian Pedregosa

The recently developed average-case analysis of optimization methods allows a more fine-grained and representative convergence analysis than… (see more) usual worst-case results. In exchange, this analysis requires a more precise hypothesis over the data generating process, namely assuming knowledge of the expected spectral distribution (ESD) of the random matrix associated with the problem. This work shows that the concentration of eigenvalues near the edges of the ESD determines a problem's asymptotic average complexity. This a priori information on this concentration is a more grounded assumption than complete knowledge of the ESD. This approximate concentration is effectively a middle ground between the coarseness of the worst-case scenario convergence and the restrictive previous average-case analysis. We also introduce the Generalized Chebyshev method, asymptotically optimal under a hypothesis on this concentration and globally optimal when the ESD follows a Beta distribution. We compare its performance to classical optimization algorithms, such as gradient descent or Nesterov's scheme, and we show that, in the average-case context, Nesterov's method is universally nearly optimal asymptotically.

2022-06-27

Proceedings of the 39th International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

PatchUp: A Feature-Space Block-Level Regularization Technique for Convolutional Neural Networks

Mojtaba Faramarzi

M. Amini

Akilesh Badrinaaraayanan

Vikas Verma

A. Chandar

Large capacity deep learning models are often prone to a high generalization gap when trained with a limited amount of labeled training data… (see more). A recent class of methods to address this problem uses various ways to construct a new training sample by mixing a pair (or more) of training samples. We propose PatchUp, a hidden state block-level regularization technique for Convolutional Neural Networks (CNNs), that is applied on selected contiguous blocks of feature maps from a random pair of samples. Our approach improves the robustness of CNN models against the manifold intrusion problem that may occur in other state-of-the-art mixing approaches. Moreover, since we are mixing the contiguous block of features in the hidden space, which has more dimensions than the input space, we obtain more diverse samples for training towards different dimensions. Our experiments on CIFAR10/100, SVHN, Tiny-ImageNet, and ImageNet using ResNet architectures including PreActResnet18/34, WRN-28-10, ResNet101/152 models show that PatchUp improves upon, or equals, the performance of current state-of-the-art regularizers for CNNs. We also show that PatchUp can provide a better generalization to deformed samples and is more robust against adversarial attacks.

2022-06-27

Proceedings of the AAAI Conference on Artificial Intelligence (published)

doi.org

arxiv.org

SpeqNets: Sparsity-aware Permutation-equivariant Graph Networks

Christopher Morris

Gaurav Rattan

Sandra Kiefer

Siamak Ravanbkash

While (message-passing) graph neural networks have clear limitations in approximating permutation-equivariant functions over graphs or gener… (see more)al relational data, more expressive, higher-order graph neural networks do not scale to large graphs. They either operate on

2022-06-27

Proceedings of the 39th International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

The Primacy Bias in Deep Reinforcement Learning

This work identifies a common flaw of deep reinforcement learning (RL) algorithms: a tendency to rely on early interactions and ignore usefu… (see more)l evidence encountered later. Because of training on progressively growing datasets, deep RL agents incur a risk of overfitting to earlier experiences, negatively affecting the rest of the learning process. Inspired by cognitive science, we refer to this effect as the primacy bias. Through a series of experiments, we dissect the algorithmic aspects of deep RL that exacerbate this bias. We then propose a simple yet generally-applicable mechanism that tackles the primacy bias by periodically resetting a part of the agent. We apply this mechanism to algorithms in both discrete (Atari 100k) and continuous action (DeepMind Control Suite) domains, consistently improving their performance.

2022-06-27

Proceedings of the 39th International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

Towards efficient representation identification in supervised learning

Kartik Ahuja

Divyat Mahajan

Vasilis Syrgkanis

Ioannis Mitliagkas

Humans have a remarkable ability to disentangle complex sensory inputs (e.g., image, text) into simple factors of variation (e.g., shape, co… (see more)lor) without much supervision. This ability has inspired many works that attempt to solve the following question: how do we invert the data generation process to extract those factors with minimal or no supervision? Several works in the literature on non-linear independent component analysis have established this negative result; without some knowledge of the data generation process or appropriate inductive biases, it is impossible to perform this inversion. In recent years, a lot of progress has been made on disentanglement under structural assumptions, e.g., when we have access to auxiliary information that makes the factors of variation conditionally independent. However, existing work requires a lot of auxiliary information, e.g., in supervised classification, it prescribes that the number of label classes should be at least equal to the total dimension of all factors of variation. In this work, we depart from these assumptions and ask: a) How can we get disentanglement when the auxiliary information does not provide conditional independence over the factors of variation? b) Can we reduce the amount of auxiliary information required for disentanglement? For a class of models where auxiliary information does not ensure conditional independence, we show theoretically and experimentally that disentanglement (to a large extent) is possible even when the auxiliary information dimension is much less than the dimension of the true latent representation.

2022-06-27

Proceedings of the First Conference on Causal Learning and Reasoning (published)

doi.org

proceedings.mlr.press

Towards Scaling Difference Target Propagation by Learning Backprop Targets

The development of biologically-plausible learning algorithms is important for understanding learning in the brain, but most of them fail to… (see more) scale-up to real-world tasks, limiting their potential as explanations for learning by real brains. As such, it is important to explore learning algorithms that come with strong theoretical guarantees and can match the performance of backpropagation (BP) on complex tasks. One such algorithm is Difference Target Propagation (DTP), a biologically-plausible learning algorithm whose close relation with Gauss-Newton (GN) optimization has been recently established. However, the conditions under which this connection rigorously holds preclude layer-wise training of the feedback pathway synaptic weights (which is more biologically plausible). Moreover, good alignment between DTP weight updates and loss gradients is only loosely guaranteed and under very specific conditions for the architecture being trained. In this paper, we propose a novel feedback weight training scheme that ensures both that DTP approximates BP and that layer-wise feedback weight training can be restored without sacrificing any theoretical guarantees. Our theory is corroborated by experimental results and we report the best performance ever achieved by DTP on CIFAR-10 and ImageNet 32

2022-06-27

Proceedings of the 39th International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

Utility Theory for Sequential Decision Making

Mehran Shakerinava

Siamak Ravanbakhsh

The von Neumann-Morgenstern (VNM) utility theorem shows that under certain axioms of rationality, decision-making is reduced to maximizing t… (see more)he expectation of some utility function. We extend these axioms to increasingly structured sequential decision making settings and identify the structure of the corresponding utility functions. In particular, we show that memoryless preferences lead to a utility in the form of a per transition reward and multiplicative factor on the future return. This result motivates a generalization of Markov Decision Processes (MDPs) with this structure on the agent's returns, which we call Affine-Reward MDPs. A stronger constraint on preferences is needed to recover the commonly used cumulative sum of scalar rewards in MDPs. A yet stronger constraint simplifies the utility function for goal-seeking agents in the form of a difference in some function of states that we call potential functions. Our necessary and sufficient conditions demystify the reward hypothesis that underlies the design of rational agents in reinforcement learning by adding an axiom to the VNM rationality axioms and motivates new directions for AI research involving sequential decision making.

2022-06-27

Proceedings of the 39th International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

VIM: Variational Independent Modules for Video Prediction

Lluis Castrejon

2022-06-27

Proceedings of the First Conference on Causal Learning and Reasoning (published)

proceedings.mlr.press

Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error

Scott Fujimoto

David Meger

Doina Precup

Ofir Nachum

Shixiang Shane Gu

In this work, we study the use of the Bellman equation as a surrogate objective for value prediction accuracy. While the Bellman equation is… (see more) uniquely solved by the true value function over all state-action pairs, we find that the Bellman error (the difference between both sides of the equation) is a poor proxy for the accuracy of the value function. In particular, we show that (1) due to cancellations from both sides of the Bellman equation, the magnitude of the Bellman error is only weakly related to the distance to the true value function, even when considering all state-action pairs, and (2) in the finite data regime, the Bellman equation can be satisfied exactly by infinitely many suboptimal solutions. This means that the Bellman error can be minimized without improving the accuracy of the value function. We demonstrate these phenomena through a series of propositions, illustrative toy examples, and empirical analysis in standard benchmark domains.

2022-06-27

Proceedings of the 39th International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

YOUR AUTOREGRESSIVE GENERATIVE MODEL CAN BE BETTER IF YOU TREAT IT AS AN ENERGY-BASED ONE

Yezhen Wang

Tong Che

Bin Li

Kaitao Song

Hengzhi Pei

Yoshua Bengio

Dongsheng Li

2022-06-25

ArXiv (preprint)

doi.org

openreview.net

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Publications

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Publications