Publications

On-line Adaptative Curriculum Learning for GANs
Thang Doan
Joao Monteiro
Isabela Albuquerque
R Devon Hjelm
Generative Adversarial Networks (GANs) can successfully approximate a probability distribution and produce realistic samples. However, open … (voir plus)questions such as sufficient convergence conditions and mode collapse still persist. In this paper, we build on existing work in the area by proposing a novel framework for training the generator against an ensemble of discriminator networks, which can be seen as a one-student/multiple-teachers setting. We formalize this problem within the full-information adversarial bandit framework, where we evaluate the capability of an algorithm to select mixtures of discriminators for providing the generator with feedback during learning. To this end, we propose a reward function which reflects the progress made by the generator and dynamically update the mixture weights allocated to each discriminator. We also draw connections between our algorithm and stochastic optimization methods and then show that existing approaches using multiple discriminators in literature can be recovered from our framework. We argue that less expressive discriminators are smoother and have a general coarse grained view of the modes map, which enforces the generator to cover a wide portion of the data distribution support. On the other hand, highly expressive discriminators ensure samples quality. Finally, experimental results show that our approach improves samples quality and diversity over existing baselines by effectively learning a curriculum. These results also support the claim that weaker discriminators have higher entropy improving modes coverage. Keywords: multiple discriminators, curriculum learning, multiple resolutions discriminators, multi-armed bandits, generative adversarial networks, smooth discriminators, multi-discriminator gan training, multiple experts.
Towards Non-Saturating Recurrent Units for Modelling Long-Term Dependencies
Modelling long-term dependencies is a challenge for recurrent neural networks. This is primarily due to the fact that gradients vanish durin… (voir plus)g training, as the sequence length increases. Gradients can be attenuated by transition operators and are attenuated or dropped by activation functions. Canonical architectures like LSTM alleviate this issue by skipping information through a memory mechanism. We propose a new recurrent architecture (Non-saturating Recurrent Unit; NRU) that relies on a memory mechanism but forgoes both saturating activation functions and saturating gates, in order to further alleviate vanishing gradients. In a series of synthetic and real world tasks, we demonstrate that the proposed model is the only model that performs among the top 2 models across all tasks with and without long-term dependencies, when compared against a range of other architectures.
Towards Understanding Generalization in Gradient-Based Meta-Learning
Christopher Pal
In this work we study generalization of neural networks in gradient-based meta-learning by analyzing various properties of the objective lan… (voir plus)dscapes. We experimentally demonstrate that as meta-training progresses, the meta-test solutions, obtained after adapting the meta-train solution of the model, to new tasks via few steps of gradient-based fine-tuning, become flatter, lower in loss, and further away from the meta-train solution. We also show that those meta-test solutions become flatter even as generalization starts to degrade, thus providing an experimental evidence against the correlation between generalization and flat minima in the paradigm of gradient-based meta-leaning. Furthermore, we provide empirical evidence that generalization to new tasks is correlated with the coherence between their adaptation trajectories in parameter space, measured by the average cosine similarity between task-specific trajectory directions, starting from a same meta-train solution. We also show that coherence of meta-test gradients, measured by the average inner product between the task-specific gradient vectors evaluated at meta-train solution, is also correlated with generalization. Based on these observations, we propose a novel regularizer for MAML and provide experimental evidence for its effectiveness.
Weakly-supervised Knowledge Graph Alignment with Adversarial Learning
This paper studies aligning knowledge graphs from different sources or languages. Most existing methods train supervised methods for the ali… (voir plus)gnment, which usually require a large number of aligned knowledge triplets. However, such a large number of aligned knowledge triplets may not be available or are expensive to obtain in many domains. Therefore, in this paper we propose to study aligning knowledge graphs in fully-unsupervised or weakly-supervised fashion, i.e., without or with only a few aligned triplets. We propose an unsupervised framework to align the entity and relation embddings of different knowledge graphs with an adversarial learning framework. Moreover, a regularization term which maximizes the mutual information between the embeddings of different knowledge graphs is used to mitigate the problem of mode collapse when learning the alignment functions. Such a framework can be further seamlessly integrated with existing supervised methods by utilizing a limited number of aligned triples as guidance. Experimental results on multiple datasets prove the effectiveness of our proposed approach in both the unsupervised and the weakly-supervised settings.
Self-supervised Learning of Distance Functions for Goal-Conditioned Reinforcement Learning
Srinivas Venkattaramanujam
Thang Doan
Goal-conditioned policies are used in order to break down complex reinforcement learning (RL) problems by using subgoals, which can be defin… (voir plus)ed either in state space or in a latent feature space. This can increase the efficiency of learning by using a curriculum, and also enables simultaneous learning and generalization across goals. A crucial requirement of goal-conditioned policies is to be able to determine whether the goal has been achieved. Having a notion of distance to a goal is thus a crucial component of this approach. However, it is not straightforward to come up with an appropriate distance, and in some tasks, the goal space may not even be known a priori. In this work we learn a distance-to-goal estimate which is computed in terms of the number of actions that would need to be carried out in a self-supervised approach. Our method solves complex tasks without prior domain knowledge in the online setting in three different scenarios in the context of goal-conditioned policies a) the goal space is the same as the state space b) the goal space is given but an appropriate distance is unknown and c) the state space is accessible, but only a subset of the state space represents desired goals, and this subset is known a priori. We also propose a goal-generation mechanism as a secondary contribution.
A Cross-Domain Transferable Neural Coherence Model
Peng Xu
Hamidreza Saghir
Jin Sung Kang
Teng Long
Avishek Joey Bose
Yanshuai Cao
Jackie Chi Kit Cheung
Coherence is an important aspect of text quality and is crucial for ensuring its readability. One important limitation of existing coherence… (voir plus) models is that training on one domain does not easily generalize to unseen categories of text. Previous work advocates for generative models for cross-domain generalization, because for discriminative models, the space of incoherent sentence orderings to discriminate against during training is prohibitively large. In this work, we propose a local discriminative neural model with a much smaller negative sampling space that can efficiently learn against incorrect orderings. The proposed coherence model is simple in structure, yet it significantly outperforms previous state-of-art methods on a standard benchmark dataset on the Wall Street Journal corpus, as well as in multiple new challenging settings of transfer to unseen categories of discourse on Wikipedia articles.
EditNTS: An Neural Programmer-Interpreter Model for Sentence Simplification through Explicit Editing
Mehdi Rezagholizadeh
Jackie CK Cheung
We present the first sentence simplification model that learns explicit edit operations (ADD, DELETE, and KEEP) via a neural programmer-inte… (voir plus)rpreter approach. Most current neural sentence simplification systems are variants of sequence-to-sequence models adopted from machine translation. These methods learn to simplify sentences as a byproduct of the fact that they are trained on complex-simple sentence pairs. By contrast, our neural programmer-interpreter is directly trained to predict explicit edit operations on targeted parts of the input sentence, resembling the way that humans perform simplification and revision. Our model outperforms previous state-of-the-art neural sentence simplification models (without external knowledge) by large margins on three benchmark text simplification corpora in terms of SARI (+0.95 WikiLarge, +1.89 WikiSmall, +1.41 Newsela), and is judged by humans to produce overall better and simpler output sentences.
Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study
Neural generative models have been become increasingly popular when building conversational agents. They offer flexibility, can be easily ad… (voir plus)apted to new domains, and require minimal domain engineering. A common criticism of these systems is that they seldom understand or use the available dialog history effectively. In this paper, we take an empirical approach to understanding how these models use the available dialog history by studying the sensitivity of the models to artificially introduced unnatural changes or perturbations to their context at test time. We experiment with 10 different types of perturbations on 4 multi-turn dialog datasets and find that commonly used neural dialog architectures like recurrent and transformer-based seq2seq models are rarely sensitive to most perturbations such as missing or reordering utterances, shuffling words, etc. Also, by open-sourcing our code, we believe that it will serve as a useful diagnostic tool for evaluating dialog systems in the future.
Adversarial Computation of Optimal Transport Maps
Jennifer She*
Amjad Almahairi
Sai Rajeswar
Computing optimal transport maps between high-dimensional and continuous distributions is a challenging problem in optimal transport (OT). G… (voir plus)enerative adversarial networks (GANs) are powerful generative models which have been successfully applied to learn maps across high-dimensional domains. However, little is known about the nature of the map learned with a GAN objective. To address this problem, we propose a generative adversarial model in which the discriminator's objective is the
Investigating Biases in Textual Entailment Datasets
The ability to understand logical relationships between sentences is an important task in language understanding. To aid in progress for thi… (voir plus)s task, researchers have collected datasets for machine learning and evaluation of current systems. However, like in the crowdsourced Visual Question Answering (VQA) task, some biases in the data inevitably occur. In our experiments, we find that performing classification on just the hypotheses on the SNLI dataset yields an accuracy of 64%. We analyze the bias extent in the SNLI and the MultiNLI dataset, discuss its implication, and propose a simple method to reduce the biases in the datasets.
Information matrices and generalization
Valentin Thomas
Fabian Pedregosa
Nicolas Roux
This work revisits the use of information criteria to characterize the generalization of deep learning models. In particular, we empirically… (voir plus) demonstrate the effectiveness of the Takeuchi information criterion (TIC), an extension of the Akaike information criterion (AIC) for misspecified models, in estimating the generalization gap, shedding light on why quantities such as the number of parameters cannot quantify generalization. The TIC depends on both the Hessian of the loss H and the covariance of the gradients C. By exploring the similarities and differences between these two matrices as well as the Fisher information matrix F, we study the interplay between noise and curvature in deep models. We also address the question of whether C is a reasonable approximation to F, as is commonly assumed.
Anomaly Detection with Joint Representation Learning of Content and Connection
Social media sites are becoming a key factor in politics. These platforms are easy to manipulate for the purpose of distorting information s… (voir plus)pace to confuse and distract voters. Past works to identify disruptive patterns are mostly focused on analyzing the content of tweets. In this study, we jointly embed the information from both user posted content as well as a user's follower network, to detect groups of densely connected users in an unsupervised fashion. We then investigate these dense sub-blocks of users to flag anomalous behavior. In our experiments, we study the tweets related to the upcoming 2019 Canadian Elections, and observe a set of densely-connected users engaging in local politics in different provinces, and exhibiting troll-like behavior.