Portrait de Dmitriy Serdyuk n'est pas disponible

Dmitriy Serdyuk

Alumni

Publications

Accounting for Variance in Machine Learning Benchmarks
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the l… (voir plus)earning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.
Multimodal Audio-textual Architecture for Robust Spoken Language Understanding
Yongqiang Wang
Christian Fue-730
Anuj Kumar
Baiyang Liu
Edwin Simonnet
Sahar Ghannay
Nathalie Camelin
Tandem spoken language understanding 001 (SLU) systems suffer from the so-called 002 automatic speech recognition (ASR) error 003 propagatio… (voir plus)n problem. Additionally, as the 004 ASR is not optimized to extract semantics, but 005 solely the linguistic content, relevant semantic 006 cues might be left out of its transcripts. In 007 this work, we propose a multimodal language 008 understanding (MLU) architecture to mitigate 009 these problems. Our solution is based on 010 two compact unidirectional long short-term 011 memory (LSTM) models that encode speech 012 and text information. A fusion layer is also 013 used to fuse audio and text embeddings. 014 Two fusion strategies are explored: a simple 015 concatenation of these embeddings and a 016 cross-modal attention mechanism that learns 017 the contribution of each modality. The first 018 approach showed to be the optimal solution 019 to robustly extract semantic information from 020 audio-textual data. We found that attention 021 is less effective at testing time when the text 022 modality is corrupted. Our model is evaluated 023 on three SLU datasets and robustness is tested 024 using ASR outputs from three off-the-shelf 025 ASR engines. Results show that the proposed 026 approach effectively mitigates the ASR error 027 propagation problem for all datasets. 028
Twin Regularization for online speech recognition
Mirco Ravanaelli
Online speech recognition is crucial for developing natural human-machine interfaces. This modality, however, is significantly more challeng… (voir plus)ing than off-line ASR, since real-time/low-latency constraints inevitably hinder the use of future information, that is known to be very helpful to perform robust predictions. A popular solution to mitigate this issue consists of feeding neural acoustic models with context windows that gather some future frames. This introduces a latency which depends on the number of employed look-ahead features. This paper explores a different approach, based on estimating the future rather than waiting for it. Our technique encourages the hidden representations of a unidirectional recurrent network to embed some useful information about the future. Inspired by a recently proposed technique called Twin Networks, we add a regularization term that forces forward hidden states to be as close as possible to cotemporal backward ones, computed by a "twin" neural network running backwards in time. The experiments, conducted on a number of datasets, recurrent architectures, input features, and acoustic conditions, have shown the effectiveness of this approach. One important advantage is that our method does not introduce any additional computation at test time if compared to standard unidirectional recurrent networks.
MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation
Stylianos Ioannis Mimilakis
Gerald Schuller
Tuomas Virtanen
Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current st… (voir plus)ate of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel recurrent neural approach that learns long-term temporal patterns and structures of a musical piece. We build upon the recently proposed Masker-Denoiser (MaD) architecture and we enhance it with the Twin Networks, a technique to regularize a recurrent generative network using a backward running copy of the network. We evaluate our method using the Demixing Secret Dataset and we obtain an increment to signal-to-distortion ratio (SDR) of 0.37 dB and to signal-to-interference ratio (SIR) of 0.23 dB, compared to previous SOTA results.
Towards End-to-end Spoken Language Understanding
Yongqiang Wang
Christian Fuegen
Anuj Kumar
Baiyang Liu
Spoken language understanding system is traditionally designed as a pipeline of a number of components. First, the audio signal is processed… (voir plus) by an automatic speech recognizer for transcription or n-best hypotheses. With the recognition results, a natural language understanding system classifies the text to structured data as domain, intent and slots for down-streaming consumers, such as dialog system, hands-free applications. These components are usually developed and optimized independently. In this paper, we present our study on an end-to-end learning system for spoken language understanding. With this unified approach, we can infer the semantic meaning directly from audio features without the intermediate text representation. This study showed that the trained model can achieve reasonable good result and demonstrated that the model can capture the semantic attention directly from the audio features.
Fortified Networks: Improving the Robustness of Deep Networks by Modeling the Manifold of Hidden Representations
Deep networks have achieved impressive results across a variety of important tasks. However a known weakness is a failure to perform well wh… (voir plus)en evaluated on data which differ from the training distribution, even if these differences are very small, as is the case with adversarial examples. We propose Fortified Networks, a simple transformation of existing networks, which fortifies the hidden layers in a deep network by identifying when the hidden states are off of the data manifold, and maps these hidden states back to parts of the data manifold where the network performs well. Our principal contribution is to show that fortifying these hidden states improves the robustness of deep networks and our experiments (i) demonstrate improved robustness to standard adversarial attacks in both black-box and white-box threat models; (ii) suggest that our improvements are not primarily due to the gradient masking problem and (iii) show the advantage of doing this fortification in the hidden layers instead of the input space.
Twin Networks: Matching the Future for Sequence Generation
Nan Rosemary Ke
Adam Trischler
Christopher Pal
We propose a simple technique for encouraging generative RNNs to plan ahead. We train a "backward" recurrent network to generate a given seq… (voir plus)uence in reverse order, and we encourage states of the forward model to predict cotemporal states of the backward model. The backward network is used only during training, and plays no role during sampling or inference. We hypothesize that our approach eases modeling of long-term dependencies by implicitly forcing the forward states to hold information about the longer-term future (as contained in the backward states). We show empirically that our approach achieves 9% relative improvement for a speech recognition task, and achieves significant improvement on a COCO caption generation task.
Twin Networks: Using the Future as a Regularizer
Nan Rosemary Ke
Christopher Pal
Being able to model long-term dependencies in sequential data, such as text, has been among the long-standing challenges of recurrent neural… (voir plus) networks (RNNs). This issue is strictly related to the absence of explicit planning in current RNN architectures. More explicitly, the RNNs are trained to predict only the next token given previous ones. In this paper, we introduce a simple way of encouraging the RNNs to plan for the future. In order to accomplish this, we introduce an additional neural network which is trained to generate the sequence in reverse order, and we require closeness between the states of the forward RNN and backward RNN that predict the same token. At each step, the states of the forward RNN are required to match the future information contained in the backward states. We hypothesize that the approach eases modeling of long-term dependencies thus helping in generating more globally consistent samples. The model trained with conditional generation for a speech recognition task achieved 12\% relative improvement (CER of 6.7 compared to a baseline of 7.6).
Theano: A Python framework for fast computation of mathematical expressions
Rami Al-Rfou
Amjad Almahairi
Christof Angermueller
Frédéric Bastien
Justin Bayer
Anatoly Belikov
Alexander Belopolsky
Josh Bleecher Snyder
Pierre-Luc Carrier
Paul Christiano
Myriam Côté
Yann N. Dauphin
Julien Demouth
Sander Dieleman
Ziye Fan
Mathieu Germain
Matt Graham
Balázs Hidasi
Arjun Jain
Kai Jia
Mikhail Korobov
Vivek Kulkarni
Pascal Lamblin
Eric Larsen
Sean Lee
Simon Lefrancois
Jesse A. Livezey
Cory Lorenz
Jeremiah Lowin
Qianli Ma
Robert T. McGibbon
Mehdi Mirza
Alberto Orlandi
Christopher Pal
Colin Raffel
Daniel Renshaw
Matthew Rocklin
Adriana Romero
Markus Roth
Peter Sadowski
John Salvatier
Jan Schlüter
John Schulman
Gabriel Schwartz
Iulian Vlad Serban
Samira Shabanian
Sigurd Spieckermann
S. Ramana Subramanyam
Gijs van Tulder
Sebastian Urban
Dustin J. Webb
Matthew Willson
Lijun Xue
Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficie… (voir plus)ntly. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, multiple frameworks have been built on top of it and it has been used to produce many state-of-the-art machine learning models. The present article is structured as follows. Section I provides an overview of the Theano software and its community. Section II presents the principal features of Theano and how to use them, and compares them with other similar projects. Section III focuses on recently-introduced functionalities and improvements. Section IV compares the performance of Theano against Torch7 and TensorFlow on several machine learning models. Section V discusses current limitations of Theano and potential ways of improving it.
Task Loss Estimation for Sequence Prediction