Publications

Automatic differentiation in ML: Where we are and where we should be going
Bart van Merriënboer
Olivier Breuleux
Arnaud Bergeron
Pascal Lamblin
We review the current state of automatic differentiation (AD) for array programming in machine learning (ML), including the different approa… (see more)ches such as operator overloading (OO) and source transformation (ST) used for AD, graph-based intermediate representations for programs, and source languages. Based on these insights, we introduce a new graph-based intermediate representation (IR) which specifically aims to efficiently support fully-general AD for array programming. Unlike existing dataflow programming representations in ML frameworks, our IR naturally supports function calls, higher-order functions and recursion, making ML models easier to implement. The ability to represent closures allows us to perform AD using ST without a tape, making the resulting derivative (adjoint) program amenable to ahead-of-time optimization using tools from functional language compilers, and enabling higher-order derivatives. Lastly, we introduce a proof of concept compiler toolchain called Myia which uses a subset of Python as a front end.
Challenging Conventional Segmentation Evaluation Metrics Focal Pathology ( Lesion and Tumour ) Segmentation from Patient Images
Frank-Wolfe Splitting via Augmented Lagrangian Method
Minimizing a function over an intersection of convex sets is an important task in optimization that is often much more challenging than mini… (see more)mizing it over each individual constraint set. While traditional methods such as Frank-Wolfe (FW) or proximal gradient descent assume access to a linear or quadratic oracle on the intersection, splitting techniques take advantage of the structure of each sets, and only require access to the oracle on the individual constraints. In this work, we develop and analyze the Frank-Wolfe Augmented Lagrangian (FW-AL) algorithm, a method for minimizing a smooth function over convex compact sets related by a "linear consistency" constraint that only requires access to a linear minimization oracle over the individual constraints. It is based on the Augmented Lagrangian Method (ALM), also known as Method of Multipliers, but unlike most existing splitting methods, it only requires access to linear (instead of quadratic) minimization oracles. We use recent advances in the analysis of Frank-Wolfe and the alternating direction method of multipliers algorithms to prove a sublinear convergence rate for FW-AL over general convex compact sets and a linear convergence rate for polytopes.
A Hierarchical Neural Attention-based Text Classifier
Koustuv Sinha
Yue Dong
Derek Ruths
Deep neural networks have been displaying superior performance over traditional supervised classifiers in text classification. They learn to… (see more) extract useful features automatically when sufficient amount of data is presented. However, along with the growth in the number of documents comes the increase in the number of categories, which often results in poor performance of the multiclass classifiers. In this work, we use external knowledge in the form of topic category taxonomies to aide the classification by introducing a deep hierarchical neural attention-based classifier. Our model performs better than or comparable to state-of-the-art hierarchical models at significantly lower computational cost while maintaining high interpretability.
How can we do better ? Pitfalls in biomedical challenge design and how to address them
Annika Reinke
Matthias Eisenmann
Sinan Onogur
Marko Stankovic
Patrick Scholz
Hrvoje Bogunovic
Andrew P. Bradley
Aaron
Carass
Carolin Feldmann
Alejandro F. Frangi
Peter M. Full
Bram Ginneken Van
Ginneken
Allan Hanbury
Katrin Honauer
Michal Kozubek
Adam Bennett
Landman … (see 22 more)
Keno März
Oskar Maier
Klaus Maier-Hein
Bjoern Menze
Henning Müller
Peter F. Neher
Wiro Niessen
Nasir Rajpoot
Catherine Gregory
Sharp
Korsuk Sirinukunwattana
Stefanie Speidel
Christian Stock
Danail
Stoyanov
Abdel Aziz Taha
F. V. D. Sommen
Ching-Wei Wang
Marc-André Weber
Guoyan Zheng
Pierre Jannin
Lena Maier-Hein
Since the first MICCAI grand challenge was organized in 2007 [1], the impact of biomedical image analysis challenges on both the research fi… (see more)eld as well as on individual careers has been steadily growing. For example, the acceptance of a journal article today often depends on the performance of a new algorithm being assessed against the state-ofthe-art work on publicly available challenge datasets. Furthermore, the results are also important for the individuals scientific careers as well as the potential that algorithms can be translated into clinical practice. Yet, while the publication of papers in scientific journals and prestigious conferences, such as MICCAI, undergoes strict quality control, the design and organization of challenges do not. To investigate the effect of common practice, we have formed an international initiative dedicated to analyzing and improving a variety of aspects related to biomedical challenge design, execution and reporting [2]. In the first part of our abstract presentation at LABELS workshop, we are going to present some of the major pitfalls related to biomedical image analysis challenges today. Specifically, we will look at the following research questions: RQ1: How robust are challenge rankings? What is the effect of – the specific test cases used? – the specific metric variant(s) applied? – the rank aggregation method chosen (e.g. aggregation of metric values with the mean vs median)? ? Shared first/senior authors.
Learning Graph Weighted Models on Pictures
Philip Amortila
Graph Weighted Models (GWMs) have recently been proposed as a natural generalization of weighted automata over strings and trees to arbitrar… (see more)y families of labeled graphs (and hypergraphs). A GWM generically associates a labeled graph with a tensor network and computes a value by successive contractions directed by its edges. In this paper, we consider the problem of learning GWMs defined over the graph family of pictures (or 2-dimensional words). As a proof of concept, we consider regression and classification tasks over the simple Bars & Stripes and Shifting Bits picture languages and provide an experimental study investigating whether these languages can be learned in the form of a GWM from positive and negative examples using gradient-based methods. Our results suggest that this is indeed possible and that investigating the use of gradient-based methods to learn picture series and functions computed by GWMs over other families of graphs could be a fruitful direction.
Nash equilibria for integer programming games
João Pedro Pedroso
In this paper, we develop algorithmic approaches for a recently defined class of games, the integer programming games. Two general methods t… (see more)o approximate an equilibrium are presented and enhanced in order to improve their practical efficiency. Their performance is analysed through computational experiments in a knapsack game and a competitive lot-sizing game. To the best of our knowledge, this is the first time that equilibria computation methods for general integer programming games are build and computationally tested.
Negative eigenvalues of the Hessian in deep neural networks
Guillaume Alain
Pierre-Antoine Manzagol
The loss function of deep networks is known to be non-convex but the precise nature of this nonconvexity is still an active area of research… (see more). In this work, we study the loss landscape of deep networks through the eigendecompositions of their Hessian matrix. In particular, we examine how important the negative eigenvalues are and the benefits one can observe in handling them appropriately.
Nonlinear Weighted Finite Automata
Weighted finite automata (WFA) can expressively model functions defined over strings but are inherently linear models. Given the recent succ… (see more)esses of nonlinear models in machine learning, it is natural to wonder whether extending WFA to the nonlinear setting would be beneficial. In this paper, we propose a novel model of neural network based nonlinear WFA model (NL-WFA) along with a learning algorithm. Our learning algorithm is inspired by the spectral learning algorithm for WFA and relies on a nonlinear decomposition of the so-called Hankel matrix, by means of an auto-encoder network. The expressive power of NL-WFA and the proposed learning algorithm are assessed on both synthetic and real world data, showing that NL-WFA can lead to smaller model sizes and infer complex grammatical structures from data.
Optimizing Home Energy Management and Electric Vehicle Charging with Reinforcement Learning
Di Wu
Vincent Francois-Lavet
Benoit Boulet
Smart grids are advancing the management efficiency and security of power grids with the integration of energy storage, distributed controll… (see more)ers, and advanced meters. In particular, with the increasing prevalence of residential automation devices and distributed renewable energy generation, residential energy management is now drawing more attention. Meanwhile, the increasing adoption of electric vehicle (EV) brings more challenges and opportunities for smart residential energy management. This paper formalizes energy management for the residential home with EV charging as a Markov Decision Process and proposes reinforcement learning (RL) based control algorithms to address it. The objective of the proposed algorithms is to minimize the long-term operating cost. We further use a recurrent neural network (RNN) to model the electricity demand as a preprocessing step. Both the RNN prediction and latent representations are used as additional state features for the RL based control algorithms. Experiments on real-world data show that the proposed algorithms can significantly reduce the operating cost and peak power consumption compared to baseline control algorithms.
Streaming kernel regression with provably adaptive mean, variance, and regularization
Odalric-Ambrym Maillard
We consider the problem of streaming kernel regression, when the observations arrive sequentially and the goal is to recover the underlying … (see more)mean function, assumed to belong to an RKHS. The variance of the noise is not assumed to be known. In this context, we tackle the problem of tuning the regularization parameter adaptively at each time step, while maintaining tight confidence bounds estimates on the value of the mean function at each point. To this end, we first generalize existing results for finite-dimensional linear regression with fixed regularization and known variance to the kernel setup with a regularization parameter allowed to be a measurable function of past observations. Then, using appropriate self-normalized inequalities we build upper and lower bound estimates for the variance, leading to Bersntein-like concentration bounds. The later is used in order to define the adaptive regularization. The bounds resulting from our technique are valid uniformly over all observation points and all time steps, and are compared against the literature with numerical experiments. Finally, the potential of these tools is illustrated by an application to kernelized bandits, where we revisit the Kernel UCB and Kernel Thompson Sampling procedures, and show the benefits of the novel adaptive kernel tuning strategy.
Temporal Regularization for Markov Decision Process
Several applications of Reinforcement Learning suffer from instability due to high variance. This is especially prevalent in high dimensiona… (see more)l domains. Regularization is a commonly used technique in machine learning to reduce variance, at the cost of introducing some bias. Most existing regularization techniques focus on spatial (perceptual) regularization. Yet in reinforcement learning, due to the nature of the Bellman equation, there is an opportunity to also exploit temporal regularization based on smoothness in value estimates over trajectories. This paper explores a class of methods for temporal regularization. We formally characterize the bias induced by this technique using Markov chain concepts. We illustrate the various characteristics of temporal regularization via a sequence of simple discrete and continuous MDPs, and show that the technique provides improvement even in high-dimensional Atari games.