Publications

Interpolated Adversarial Training: Achieving Robust Neural Networks without Sacrificing Accuracy
Alex Lamb
Vikas Verma
Juho Kannala
Adversarial robustness has become a central goal in deep learning, both in theory and practice. However, successful methods to improve adver… (see more)sarial robustness (such as adversarial training) greatly hurt generalization performance on the clean data. This could have a major impact on how adversarial robustness affects real world systems (i.e. many may opt to forego robustness if it can improve performance on the clean data). We propose Interpolated Adversarial Training, which employs recently proposed interpolation based training methods in the framework of adversarial training. On CIFAR-10, adversarial training increases clean test error from 5.8% to 16.7%, whereas with our Interpolated adversarial training we retain adversarial robustness while achieving a clean test error of only 6.5%. With our technique, the relative error increase for the robust model is reduced from 187.9% to just 12.1%.
Learning Brain Dynamics from Calcium Imaging with Coupled van der Pol and LSTM
Germán Abrevaya
Aleksandr Y. Aravkin
Guillermo Cecchi
James Kozloski
Pablo Polosecki
Peng Zheng
Silvina Ponce Dawson
Juliana Y. Rhee
David Daniel Cox
Many real-world data sets, especially in biology, are produced by complex nonlinear dynamical systems. In this paper, we focus on brain calc… (see more)ium imaging (CaI) of different organisms (zebrafish and rat), aiming to build a model of joint activation dynamics in large neuronal populations, including the whole brain of zebrafish. We propose a new approach for capturing dynamics of temporal SVD components that uses the coupled (multivariate) van der Pol (VDP) oscillator, a nonlinear ordinary differential equation (ODE) model describing neural activity, with a new parameter estimation technique that combines variable projection optimization and stochastic search. We show that the approach successfully handles nonlinearities and hidden state variables in the coupled VDP. The approach is accurate, achieving 0.82 to 0.94 correlation between the actual and model-generated components, and interpretable, as VDP’s coupling matrix reveals anatomically meaningful positive (excitatory) and negative (inhibitory) interactions across different brain subsystems corresponding to spatial SVD components. Moreover, VDP is comparable to (or sometimes better than) recurrent neural networks (LSTM) for (short-term) prediction of future brain activity; VDP needs less parameters to train, which was a plus on our small training data. Finally, the overall best predictive method, greatly outperforming both VDP and LSTM in shortand long-term predicitve settings on both datasets, was the new hybrid VDP-LSTM approach that used VDP to simulate large domain-specific dataset for LSTM pretraining; note that simple LSTM data-augmentation via noisy versions of training data was much less effective.
Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference
Matthew D Riemer
Ignacio Cases
Robert Ajemian
Miao Liu
Yuhai Tu
Gerald Tesauro
Lack of performance when it comes to continual learning over non-stationary distributions of data remains a major challenge in scaling neura… (see more)l network learning to more human realistic settings. In this work we propose a new conceptualization of the continual learning problem in terms of a temporally symmetric trade-off between transfer and interference that can be optimized by enforcing gradient alignment across examples. We then propose a new algorithm, Meta-Experience Replay (MER), that directly exploits this view by combining experience replay with optimization based meta-learning. This method learns parameters that make interference based on future gradients less likely and transfer based on future gradients more likely. We conduct experiments across continual lifelong supervised learning benchmarks and non-stationary reinforcement learning environments demonstrating that our approach consistently outperforms recently proposed baselines for continual learning. Our experiments show that the gap between the performance of MER and baseline algorithms grows both as the environment gets more non-stationary and as the fraction of the total experiences stored gets smaller.
LF-PPL: A Low-Level First Order Probabilistic Programming Language for Non-Differentiable Models
Yuanshuo Zhou
Bradley Gram-Hansen
Tobias Kohn
Tom Rainforth
Hongseok Yang
We develop a new Low-level, First-order Probabilistic Programming Language~(LF-PPL) suited for models containing a mix of continuous, discre… (see more)te, and/or piecewise-continuous variables. The key success of this language and its compilation scheme is in its ability to automatically distinguish parameters the density function is discontinuous with respect to, while further providing runtime checks for boundary crossings. This enables the introduction of new inference engines that are able to exploit gradient information, while remaining efficient for models which are not everywhere differentiable. We demonstrate this ability by incorporating a discontinuous Hamiltonian Monte Carlo (DHMC) inference engine that is able to deliver automated and efficient inference for non-differentiable models. Our system is backed up by a mathematical formalism that ensures that any model expressed in this language has a density with measure zero discontinuities to maintain the validity of the inference engine.
Reducing the variance in online optimization by transporting past gradients
Sébastien M. R. Arnold
Pierre-Antoine Manzagol
Reza Babanezhad Harikandeh
Most stochastic optimization methods use gradients once before discarding them. While variance reduction methods have shown that reusing pas… (see more)t gradients can be beneficial when there is a finite number of datapoints, they do not easily extend to the online setting. One issue is the staleness due to using past gradients. We propose to correct this staleness using the idea of implicit gradient transport (IGT) which transforms gradients computed at previous iterates into gradients evaluated at the current iterate without using the Hessian explicitly. In addition to reducing the variance and bias of our updates over time, IGT can be used as a drop-in replacement for the gradient estimate in a number of well-understood methods such as heavy ball or Adam. We show experimentally that it achieves state-of-the-art results on a wide range of architectures and benchmarks. Additionally, the IGT gradient estimator yields the optimal asymptotic convergence rate for online stochastic optimization in the restricted setting where the Hessians of all component functions are equal.
La science, un droit humain. Mettre en œuvre le principe d’une science participative, équitable et accessible à tous
The Termination Critic
Anna Harutyunyan
Will Dabney
Diana Borsa
Nicolas Heess
Remi Munos
In this work, we consider the problem of autonomously discovering behavioral abstractions, or options, for reinforcement learning agents. We… (see more) propose an algorithm that focuses on the termination function, as opposed to - as is common - the policy. The termination function is usually trained to optimize a control objective: an option ought to terminate if another has better value. We offer a different, information-theoretic perspective, and propose that terminations should focus instead on the compressibility of the option’s encoding - arguably a key reason for using abstractions.To achieve this algorithmically, we leverage the classical options framework, and learn the option transition model as a “critic” for the termination function. Using this model, we derive gradients that optimize the desired criteria. We show that the resulting options are non-trivial, intuitively meaningful, and useful for learning.
Toward Requirements Specification for Machine-Learned Components
Mona Rahimi
Sahar Kokaly
Marsha Chechik
In current practice, the behavior of Machine-Learned Components (MLCs) is not sufficiently specified by the predefined requirements. Instead… (see more), they "learn" existing patterns from the available training data, and make predictions for unseen data when deployed. On the surface, their ability to extract patterns and to behave accordingly is specifically useful for hard-to-specify concepts in certain safety critical domains (e.g., the definition of a pedestrian in a pedestrian detection component in a vehicle). However, the lack of requirements specifications on their behaviors makes further software engineering tasks challenging for such components. This is especially concerning for tasks such as safety assessment and assurance. In this position paper, we call for more attention from the requirements engineering community on supporting the specification of requirements for MLCs in safety critical domains. Towards that end, we propose an approach to improve the process of requirements specification in which an MLC is developed and operates by explicitly specifying domain-related concepts. Our approach extracts a universally accepted benchmark for hard-to-specify concepts (e.g., "pedestrian") and can be used to identify gaps in the associated dataset and the constructed machine-learned model.
Building a Neural Semantic Parser from a Domain Ontology
Jianpeng Cheng
Mirella Lapata
Semantic parsing is the task of converting natural language utterances into machine interpretable meaning representations which can be execu… (see more)ted against a real-world environment such as a database. Scaling semantic parsing to arbitrary domains faces two interrelated challenges: obtaining broad coverage training data effectively and cheaply; and developing a model that generalizes to compositional utterances and complex intentions. We address these challenges with a framework which allows to elicit training data from a domain ontology and bootstrap a neural parser which recursively builds derivations of logical forms. In our framework meaning representations are described by sequences of natural language templates, where each template corresponds to a decomposed fragment of the underlying meaning representation. Although artificial, templates can be understood and paraphrased by humans to create natural utterances, resulting in parallel triples of utterances, meaning representations, and their decompositions. These allow us to train a neural semantic parser which learns to compose rules in deriving meaning representations. We crowdsource training data on six domains, covering both single-turn utterances which exhibit rich compositionality, and sequential utterances where a complex task is procedurally performed in steps. We then develop neural semantic parsers which perform such compositional tasks. In general, our approach allows to deploy neural semantic parsers quickly and cheaply from a given domain ontology.
Clustering-Oriented Representation Learning with Attractive-Repulsive Loss
Kian Kenyon-Dean
Andre Cianflone
Lucas Caccia
The standard loss function used to train neural network classifiers, categorical cross-entropy (CCE), seeks to maximize accuracy on the trai… (see more)ning data; building useful representations is not a necessary byproduct of this objective. In this work, we propose clustering-oriented representation learning (COREL) as an alternative to CCE in the context of a generalized attractive-repulsive loss framework. COREL has the consequence of building latent representations that collectively exhibit the quality of natural clustering within the latent space of the final hidden layer, according to a predefined similarity function. Despite being simple to implement, COREL variants outperform or perform equivalently to CCE in a variety of scenarios, including image and news article classification using both feed-forward and convolutional neural networks. Analysis of the latent spaces created with different similarity functions facilitates insights on the different use cases COREL variants can satisfy, where the Cosine-COREL variant makes a consistently clusterable latent space, while Gaussian-COREL consistently obtains better classification accuracy than CCE.
Learning Typed Entailment Graphs with Global Soft Constraints
Mohammad Javad Hosseini
Nathanael Chambers
Xavier R. Holt
Shay B. Cohen
Mark Johnson
Mark Steedman
This paper presents a new method for learning typed entailment graphs from text. We extract predicate-argument structures from multiple-sour… (see more)ce news corpora, and compute local distributional similarity scores to learn entailments between predicates with typed arguments (e.g., person contracted disease). Previous work has used transitivity constraints to improve local decisions, but these constraints are intractable on large graphs. We instead propose a scalable method that learns globally consistent similarity scores based on new soft constraints that consider both the structures across typed entailment graphs and inside each graph. Learning takes only a few hours to run over 100K predicates and our results show large improvements over local similarity scores on two entailment data sets. We further show improvements over paraphrases and entailments from the Paraphrase Database, and prior state-of-the-art entailment graphs. We show that the entailment graphs improve performance in a downstream task.
Contextual Bandits for Adapting Treatment in a Mouse Model of de Novo Carcinogenesis
Charis Achilleos
Demetris C Iacovides
Katerina Strati
Georgios D. Mitsis
In this work, we present a specific case study where we aim to design effective treatment allocation strategies and validate these using a m… (see more)ouse model of skin cancer. Collecting data for modelling treatments effectiveness on animal models is an expensive and time consuming process. Moreover, acquiring this information during the full range of disease stages is hard to achieve with a conventional random treatment allocation procedure, as poor treatments cause deterioration of subject health. We therefore aim to design an adaptive allocation strategy to improve the efficiency of data collection by allocating more samples for exploring promising treatments. We cast this application as a contextual bandit problem and introduce a simple and practical algorithm for exploration-exploitation in this framework. The work builds on a recent class of approaches for non-contextual bandits that relies on subsampling to compare treatment options using an equivalent amount of information. On the technical side, we extend the subsampling strategy to the case of bandits with context, by applying subsampling within Gaussian Process regression. On the experimental side, preliminary results using 10 mice with skin tumours suggest that the proposed approach extends by more than 50% the subjects life duration compared with baseline strategies: no treatment, random treatment allocation, and constant chemotherapeutic agent. By slowing the tumour growth rate, the adaptive procedure gathers information about treatment effectiveness on a broader range of tumour volumes, which is crucial for eventually deriving sequential pharmacological treatment strategies for cancer.