Publications

Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules

Murray Shanahan

Michael Mozer

Robust perception relies on both bottom-up and top-down signals. Bottom-up signals consist of what's directly observed through sensation. To… (voir plus)p-down signals consist of beliefs and expectations based on past experience and short-term memory, such as how the phrase `peanut butter and~...' will be completed. The optimal combination of bottom-up and top-down information remains an open question, but the manner of combination must be dynamic and both context and task dependent. To effectively utilize the wealth of potential top-down information available, and to prevent the cacophony of intermixed signals in a bidirectional architecture, mechanisms are needed to restrict information flow. We explore deep recurrent neural net architectures in which bottom-up and top-down signals are dynamically combined using attention. Modularity of the architecture further restricts the sharing and communication of information. Together, attention and modularity direct information flow, which leads to reliable performance improvements in perceptual and language tasks, and in particular improves robustness to distractions and noisy data. We demonstrate on a variety of benchmarks in language modeling, sequential image classification, video prediction and reinforcement learning that the \emph{bidirectional} information flow can improve results over strong baselines.

2019-12-31

ICML (publié)

doi.org

proceedings.mlr.press

Learning Graph Structure With A Finite-State Automaton Layer

Daniel D. Johnson

Hugo Larochelle

Daniel Tarlow

2019-12-31

Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (publié)

arxiv.org

Learning Long-term Dependencies Using Cognitive Inductive Biases in Self-attention RNNs

Attention and self-attention mechanisms, inspired by cognitive processes, are now central to state-of-the-art deep learning on sequential ta… (voir plus)sks. However, most recent progress hinges on heuristic approaches that rely on considerable memory and computational resources that scale poorly. In this work, we propose a relevancy screening mechanism, inspired by the cognitive process of memory consolidation, that allows for a scalable use of sparse self-attention with recurrence. We use simple numerical experiments to demonstrate that this mechanism helps enable recurrent systems on generalization and transfer learning tasks. Based on our results, we propose a concrete direction of research to improve scalability and generalization of attentive recurrent networks.

2019-12-31

(publié)

www.semanticscholar.org

Learning the Arrow of Time for Problems in Reinforcement Learning

Nasim Rahaman

Steffen Wolf

Anirudh Goyal

Roman Remme

Yoshua Bengio

2019-12-31

ICLR (publié)

openreview.net

Measuring Systematic Generalization in Neural Proof Generation with Transformers

Nicolas Gontier

Koustuv Sinha

Siva Reddy

Christopher Pal

We are interested in understanding how well Transformer language models (TLMs) can perform reasoning tasks when trained on knowledge encoded… (voir plus) in the form of natural language. We investigate their systematic generalization abilities on a logical reasoning task in natural language, which involves reasoning over relationships between entities grounded in first-order logical proofs. Specifically, we perform soft theorem-proving by leveraging TLMs to generate natural language proofs. We test the generated proofs for logical consistency, along with the accuracy of the final inference. We observe length-generalization issues when evaluated on longer-than-trained sequences. However, we observe TLMs improve their generalization performance after being exposed to longer, exhaustive proofs. In addition, we discover that TLMs are able to generalize better using backward-chaining proofs compared to their forward-chaining counterparts, while they find it easier to generate forward chaining proofs. We observe that models that are not trained to generate proofs are better at generalizing to problems based on longer proofs. This suggests that Transformers have efficient internal reasoning strategies that are harder to interpret. These results highlight the systematic generalization behavior of TLMs in the context of logical reasoning, and we believe this work motivates deeper inspection of their underlying reasoning strategies.

2019-12-31

NeurIPS (publié)

doi.org

arxiv.org

MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining

Zhi Wen

Xing Han Lu

Siva Reddy

2019-12-31

arXiv (prépublication)

doi.org

arxiv.org

Medical Imaging with Deep Learning: MIDL 2020 -- Short Paper Track

Tal Arbel

Ismail Ben Ayed

Marleen de Bruijne

Maxime Descoteaux

Hervé Lombaert

Chris Pal

This compendium gathers all the accepted extended abstracts from the Third International Conference on Medical Imaging with Deep Learning (M… (voir plus)IDL 2020), held in Montreal, Canada, 6-9 July 2020. Note that only accepted extended abstracts are listed here, the Proceedings of the MIDL 2020 Full Paper Track are published in the Proceedings of Machine Learning Research (PMLR).

2019-12-31

arXiv (prépublication)

doi.org

arxiv.org

Meta Attention Networks: Meta Learning Attention To Modulate Information Between Sparsely Interacting Recurrent Modules

Kanika Madan

Nan Rosemary Ke

Anirudh Goyal

Yoshua Bengio

Decomposing knowledge into interchangeable pieces promises a generalization advantage when, at some level of representation, the learner is … (voir plus)likely to be faced with situations requiring novel combinations of existing pieces of knowledge or computation. We hypothesize that such a decomposition of knowledge is particularly relevant for higher levels of representation as we see this at work in human cognition and natural language in the form of systematicity or systematic generalization. To study these ideas, we propose a particular training framework in which we assume that the pieces of knowledge an agent needs, as well as its reward function are stationary and can be re-used across tasks and changes in distribution. As the learner is confronted with variations in experiences, the attention selects which modules should be adapted and the parameters of those selected modules are adapted fast, while the parameters of attention mechanisms are updated slowly as meta-parameters. We ﬁnd that both the meta-learning and the modular aspects of the proposed system greatly help achieve faster learning in experiments with reinforcement learning setup involving navigation in a partially observed grid world.

2019-12-31

(publié)

www.semanticscholar.org

A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms

Yoshua Bengio

Tristan Deleu

Nasim Rahaman

Nan Rosemary Ke

Sébastien Lachapelle

Olexa Bilaniuk

Anirudh Goyal

Christopher Pal

We propose to meta-learn causal structures based on how fast a learner adapts to new distributions arising from sparse distributional change… (voir plus)s, e.g. due to interventions, actions of agents and other sources of non-stationarities. We show that under this assumption, the correct causal structural choices lead to faster adaptation to modified distributions because the changes are concentrated in one or just a few mechanisms when the learned knowledge is modularized appropriately. This leads to sparse expected gradients and a lower effective number of degrees of freedom needing to be relearned while adapting to the change. It motivates using the speed of adaptation to a modified distribution as a meta-learning objective. We demonstrate how this can be used to determine the cause-effect relationship between two observed variables. The distributional changes do not need to correspond to standard interventions (clamping a variable), and the learner has no direct knowledge of these interventions. We show that causal structures can be parameterized via continuous variables and learned end-to-end. We then explore how these ideas could be used to also learn an encoder that would map low-level observed variables to unobserved causal variables leading to faster adaptation out-of-distribution, learning a representation space where one can satisfy the assumptions of independent mechanisms and of small and sparse changes in these mechanisms due to actions and non-stationarities.

2019-12-31

ICLR (publié)

doi.org

openreview.net

Modeling Route Choice with Real-Time Information: Comparing the Recursive and Non-Recursive Approaches

Xinlian Yu

Tien Mai

Jing Ding-Mastera

Song Gao

Emma Frejinger

Transportation systems are inherently uncertain due to disruptions such as bad weather, incident and the randomness of traveler’s choices.… (voir plus) Real-time information allows travelers to adapt to actual traffic conditions and potentially mitigate the adverse effect of uncertainty. We study the routing policy choice problems in a stochastic time-dependent (STD) network. A routing policy is defined as a decision rule applied at the end of each link that maps the realized traffic condition to the decision on the link to take next. Two types of routing policy choice models are formulated with perfect online information (POI): recursive logit model and non-recursive logit model. In the non-recursive model, a choice set of routing policies between an origin-destination (OD) pair is generated, and a probabilistic choice is modeled at the origin, while the choice of the next link at each link is a deterministic execution of the chosen routing policy. In the recursive model, the probabilistic choice of the next link is modeled at each link, following the framework of dynamic discrete choice models. The difference between the two models results from the interplay of two sources of stochasticity, i.e., nature’s probability and choice probability. The two models are equivalent when either source of stochasticity is removed, that is, in a deterministic network (as shown in Fosgerau et al., 2013) or with deterministic choice. We use an illustrative example to explore the difference between the two models when both sources of stochasticity exist, and find that when a route has state-wise stochastic dominance over the other, the recursive model predicts more extreme choice probabilities. The relation can go either way when the two routes are non-dominated. We further compare the two models in terms of computational efficiency in estimation and prediction, and flexibility in systematic utility specification and modeling correlation.

2019-12-31

(publié)

www.semanticscholar.org

Models of Human Behavioral Agents in Bandits, Contextual Bandits and RL

Baihan Lin

Guillermo Cecchi

Djallel Bouneffouf

Jenna Reinen

Irina Rish

2019-12-31

HBAI@IJCAI (publié)

doi.org

Myeloarchitecture gradients in the human insula: Histological underpinnings and association to intrinsic functional connectivity

Jessica Royer

Casey Paquola

Sara Larivière

Reinder Vos de Wael

Shahin Tavakol

Alexander J. Lowe

Oualid Benkarim

Alan C. Evans

Danilo Bzdok

Jonathan Smallwood

Birgit Frauscher

Boris C. Bernhardt

2019-12-31

NeuroImage (publié)

doi.org

Fondateur en résidence Mila Ventures

TRAIL : IA responsable pour les professionnels et les leaders

Avantage IA : productivité dans la fonction publique

Publications

Fondateur en résidence Mila Ventures

TRAIL : IA responsable pour les professionnels et les leaders

Avantage IA : productivité dans la fonction publique

Mots-clés populaires:

Publications