Pascal Vincent

R´ejean Ducharme

Rishi Bommasani

Kelly Davis

Claire Cardie

Billy Chiu

Sampo Pyysalo

Ivan Vuli´c

Extracting knowledge from large, unstruc-001 tured text corpora presents a challenge. Re-002 cently, authors have utilized unsupervised, 003… (voir plus) static word embeddings to uncover "latent 004 knowledge" contained within domain-speciﬁc 005 scientiﬁc corpora. Here semantic-similarity 006 measures between representations of concepts, 007 objects or entities were used to predict re-008 lationships, which were later veriﬁed using 009 physical methods. Static language models 010 have recently been surpassed at most down-011 stream tasks by massively pre-trained, contex-012 tual language models like BERT. Some have 013 postulated that contextualized embeddings po-014 tentially yield word representations superior 015 to static ones for knowledge-discovery pur-016 poses. In an effort to address this ques-017 tion, two biomedically-trained BERT models 018 (BioBERT, SciBERT) were used to encode 019 n = 500, 1000 or 5000 sentences containing 020 words of interest extracted from a biomedical 021 corpus (Coronavirus Open Research Dataset). 022 The n representations for the words of inter-023 est were subsequently extracted and then ag-024 gregated to yield static-equivalent word rep-025 resentations. These words belonged to the 026 vocabularies of intrinsic benchmarking tools 027 for the biomedical domain (Bio-SimVerb and 028 Bio-SimLex), which assess quality of word 029 representations using semantic-similarity and 030 relatedness measures. Using intrinsic bench-031 marking tasks, feasibility of using contextual-032 ized word representations for knowledge dis-033 covery tasks can be assessed: Word represen-034 tations that better encode described reality are 035 expected to perform better (i.e. closer to do-036 main experts). As postulated, BERT embed-037 dings outperform static counterparts

Accounting for Variance in Machine Learning Benchmarks

Mirko Bronzi

Naz Sepah

Edward Raff

Kanika Madan

Vikram Voleti

Samira Ebrahimi Kahou

Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the l… (voir plus)earning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.

2021-01-01

MLSys (publié)

Cooperative Semi-Supervised Transfer Learning of Machine Reading Comprehension

Oliver Bender

Franz Josef Och

R´ejean Ducharme

Kevin Clark

Quoc Minh-Thang Luong

V. Le

Jacob Devlin

Ming-Wei Chang

Kenton Lee

Adam Fisch

Alon Talmor

Robin Jia

Minjoon Seo

Michael R. Glass

A. Gliozzo

Rishav Chakravarti

Ian J Goodfellow

Jean Pouget-Abadie … (voir 39 de plus)

Mehdi Mirza

Serhii Havrylov

Ivan Titov. 2017

Emergence

Jun-Tao He

Jiatao Gu

Jiajun Shen

Marc’Aurelio

Matthew Henderson

I. Casanueva

Nikola Mrkˇsi´c

Pei-hao Su

Tsung-Hsien Wen

Ivan Vuli´c

Yikang Shen

Yi Tay

Che Zheng

Dara Bahri

Donald

Metzler Aaron

Courville

Structformer

Ashish Vaswani

Noam M. Shazeer

Niki Parmar

Thomas Wolf

Lysandre Debut

Julien Victor Sanh

Clement Chaumond

Anthony Delangue

Pier-339 Moi

Tim ric Cistac

R´emi Rault

Morgan Louf

Qizhe Xie

Eduard H. Hovy

Silei Xu

Sina Jandaghi Semnani

Giovanni Campagna

Pretrained language models have signiﬁcantly 001 improved the performance of down-stream 002 language understanding tasks, including ex-00… (voir plus)3 tractive question answering, by providing 004 high-quality contextualized word embeddings. 005 However, training question answering models 006 still requires large amounts of annotated data 007 for speciﬁc domains. In this work, we pro-008 pose a cooperative, self-play learning frame-009 work, REGEX, for automatically generating 010 more non-trivial question-answer pairs to im-011 prove model performance. REGEX is built 012 upon a masked answer extraction task with an 013 interactive learning environment containing an 014 answer entity REcognizer, a question Gener-015 ator, and an answer EXtractor. Given a pas-016 sage with a masked entity, the generator gen-017 erates a question around the entity, and the 018 extractor is trained to extract the masked en-019 tity with the generated question and raw texts. 020 The framework allows the training of question 021 generation and answering models on any text 022 corpora without annotation. We further lever-023 age a reinforcement learning technique to re-024 ward generating high-quality questions and to 025 improve the answer extraction model’s perfor-026 mance. Experiment results show that REGEX 027 outperforms the state-of-the-art (SOTA) pre-028 trained language models and transfer learning 029 approaches on standard question-answering 030 benchmarks, and yields the new SOTA per-031 formance under given model size and transfer 032 learning settings. 033

Tackling Situated Multi-Modal Task-Oriented Dialogs with a Single Transformer Model

−. i.eUT

R´ejean Ducharme

Morgan Kaufmann

Yen-Chun Chen

Linjie Li

Licheng Yu

Matthew Henderson

Blaise Thomson

Ehsan Hosseini-Asl

Bryan McCann

Chien-Sheng Wu

Samuel Humeau

Kurt Shuster

Marie-Anne Lachaux

The Situated Interactive Multi-Modal Conver-001 sations (SIMMC) 2.0 aims to create virtual 002 shopping assistants that can accept complex 0… (voir plus)03 multi-modal inputs, i.e. visual appearances of 004 objects and user utterances. It consists of four 005 subtasks, multi-modal disambiguation (MM-006 Disamb), multi-modal coreference resolution 007 (MM-Coref), multi-modal dialog state tracking 008 (MM-DST), and response retrieval and genera-009 tion. While many task-oriented dialog systems 010 usually tackle each subtask separately, we pro-011 pose a jointly learned encoder-decoder that per-012 forms all four subtasks at once for efficiency. 013 Moreover, we handle the multi-modality of the 014 challenge by representing visual objects as spe-015 cial tokens whose joint embedding is learned 016 via auxiliary tasks. This approach won the MM-017 Coref and response retrieval subtasks and nom-018 inated runner-up for the remaining subtasks 019 using a single unified model. In particular, 020 our model achieved 81.5% MRR, 71.2% R@1, 021 95.0% R@5, 98.2% R@10, and 1.9 mean rank 022 in response retrieval task, setting a high bar for 023 the state-of-the-art result in the SIMMC 2.0 024 track of the Dialog Systems Technology Chal-025 lenge 10 (DSTC10). 026

Implicit Regularization in Deep Learning: A View from Function Space

We approach the problem of implicit regularization in deep learning from a geometrical viewpoint. We highlight a possible regularization eff… (voir plus)ect induced by a dynamical alignment of the neural tangent features introduced by Jacot et al, along a small number of task-relevant directions. By extrapolating a new analysis of Rademacher complexity bounds in linear models, we propose and study a new heuristic complexity measure for neural networks which captures this phenomenon, in terms of sequences of tangent kernel classes along in the learning trajectories.

2020-08-03

ArXiv (preprint)

Implicit Regularization in Deep Learning: A View from Function Space

2020-08-03

ArXiv (prépublication)

Stochastic Neural Network with Kronecker Flow

Chin-wei Huang

Ahmed Touati

Gintare Karolina Dziugaite

Alexandre Lacoste

Aaron Courville

Recent advances in variational inference enable the modelling of highly structured joint distributions, but are limited in their capacity to… (voir plus) scale to the high-dimensional setting of stochastic neural networks. This limitation motivates a need for scalable parameterizations of the noise generation process, in a manner that adequately captures the dependencies among the various parameters. In this work, we address this need and present the Kronecker Flow, a generalization of the Kronecker product to invertible mappings designed for stochastic neural networks. We apply our method to variational Bayesian neural networks on predictive tasks, PAC-Bayes generalization bound estimation, and approximate Thompson sampling in contextual bandits. In all setups, our methods prove to be competitive with existing methods and better than the baselines.

2020-06-03

Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (publié)

proceedings.mlr.press

Alexia Jolicoeur-Martineau

Stochastic Hamiltonian Gradient Methods for Smooth Games

Nicolas Loizou

Hugo Berard

Simon Lacoste-Julien

Ioannis Mitliagkas

The success of adversarial formulations in machine learning has brought renewed motivation for smooth games. In this work, we focus on the c… (voir plus)lass of stochastic Hamiltonian methods and provide the first convergence guarantees for certain classes of stochastic smooth games. We propose a novel unbiased estimator for the stochastic Hamiltonian gradient descent (SHGD) and highlight its benefits. Using tools from the optimization literature we show that SHGD converges linearly to the neighbourhood of a stationary point. To guarantee convergence to the exact solution, we analyze SHGD with a decreasing step-size and we also present the first stochastic variance reduced Hamiltonian method. Our results provide the first global non-asymptotic last-iterate convergence guarantees for the class of stochastic unconstrained bilinear games and for the more general class of stochastic games that satisfy a "sufficiently bilinear" condition, notably including some non-convex non-concave problems. We supplement our analysis with experiments on stochastic bilinear and sufficiently bilinear games, where our theory is shown to be tight, and on simple adversarial machine learning formulations.

2020-01-01

ICML (publié)

proceedings.mlr.press

An Empirical Study of Batch Normalization and Group Normalization in Conditional Computation

Vincent Michalski

Vikram Voleti

Samira Ebrahimi Kahou

Anthony Ortiz

Chris Pal

Doina Precup

Batch normalization has been widely used to improve optimization in deep neural networks. While the uncertainty in batch statistics can act … (voir plus)as a regularizer, using these dataset statistics specific to the training set impairs generalization in certain tasks. Recently, alternative methods for normalizing feature activations in neural networks have been proposed. Among them, group normalization has been shown to yield similar, in some domains even superior performance to batch normalization. All these methods utilize a learned affine transformation after the normalization operation to increase representational power. Methods used in conditional computation define the parameters of these transformations as learnable functions of conditioning information. In this work, we study whether and where the conditional formulation of group normalization can improve generalization compared to conditional batch normalization. We evaluate performances on the tasks of visual question answering, few-shot learning, and conditional image generation.

2019-07-31

ArXiv (prépublication)

Stochastic Neural Network with Kronecker Flow

Chin-wei Huang

Ahmed Touati

Gintare Karolina Dziugaite

Alexandre Lacoste

Aaron Courville

2019-06-10

ArXiv (preprint)

Iteratively unveiling new regions of interest in Deep Learning models

Florian Bordes

Tess Berthier

Lisa Di Jorio

Nicolas Boulanger-Lewandowski

Recent advance of deep learning has been transforming the landscape in many domains. However, understanding the predictions of a deep networ… (voir plus)k remains a challenge, which is especially sensitive in health care domains as interpretability is key. Techniques that rely on saliency maps -highlighting the region of an image that influence the classifier’s decision the mostare often used for that purpose. However, gradients fluctuation make saliency maps noisy and thus difficult to interpret at a human level. Moreover, models tend to focus on one particular influential region of interest (ROI) in the image, even though other regions might be relevant for the decision. We propose a new framework that refines those saliency maps to generate segmentation masks over the ROI on the initial image. In a second contribution, we propose to apply those masks over the original inputs, then evaluate our classifier on the masked inputs to identify previously overlooked ROI. This iterative procedure allows us to emphasize new region of interests by extracting meaningful information from the saliency maps.

2018-04-11

MIDL.amsterdam/2018/Abstract (accepté)

openreview.net

Theano: A Python framework for fast computation of mathematical expressions

Rami Al-rfou'

Guillaume Alain

Amjad Almahairi

Christof Angermüller

Dzmitry Bahdanau

Nicolas Ballas

Frédéric Bastien

Justin S. Bayer

A. Belikov

A. Belopolsky

J. Bergstra

Josh Bleecher Snyder

Xavier Bouthillier

Alexandre De Brébisson

Olivier Breuleux … (voir 92 de plus)

Paul F. Christiano

Myriam Côté

Julien Demouth

Sander Dieleman

M'elanie Ducoffe

Samira Ebrahimi Kahou

Dumitru Erhan

Ziye Fan

Orhan Firat

Mathieu Germain

Xavier Glorot

Ian J. Goodfellow

Matthew Graham

Balázs Hidasi

Arjun Jain

S'ebastien Jean

Kai Jia

Mikhail V. Korobov

Vivek Kulkarni

Alex Lamb

Pascal Lamblin

Eric P. Larsen

César Laurent

S. Lee

Simon-mark Lefrancois

Simon Lemieux

Nicholas Léonard

Zhouhan Lin

J. Livezey

Cory R. Lorenz

Jeremiah L. Lowin

Qianli M. Ma

Pierre-Antoine Manzagol

R. McGibbon

Mehdi Mirza

Alberto Orlandi

Chris Pal

Razvan Pascanu

Mohammad Pezeshki

Colin Raffel

Daniel Renshaw

Matthew David Rocklin

Adriana Romero Soriano

Markus Dr. Roth

Peter Sadowski

John Salvatier

François Savard

Jan Schlüter

John D. Schulman

Gabriel Schwartz

Iulian V. Serban

Dmitriy Serdyuk

Samira Shabanian

Etienne Simon

Sigurd Spieckermann

S. Subramanyam

Jakub Sygnowski

Jeremie Tanguay

Gijs van Tulder

Joseph P. Turian

Sebastian Urban

Dustin J. Webb

M. Willson

Lijun Xue

Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficie… (voir plus)ntly. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, multiple frameworks have been built on top of it and it has been used to produce many state-of-the-art machine learning models. The present article is structured as follows. Section I provides an overview of the Theano software and its community. Section II presents the principal features of Theano and how to use them, and compares them with other similar projects. Section III focuses on recently-introduced functionalities and improvements. Section IV compares the performance of Theano against Torch7 and TensorFlow on several machine learning models. Section V discusses current limitations of Theano and potential ways of improving it.

2016-05-09

ArXiv (preprint)