Towards Enhancing the Reproducibility of Deep Learning Bugs: An Empirical Study
Mehil B. Shah
Mohammad Masudur Rahman
Context: Deep learning has achieved remarkable progress in various domains. However, like any software system, deep learning systems contain… (voir plus) bugs, some of which can have severe impacts, as evidenced by crashes involving autonomous vehicles. Despite substantial advancements in deep learning techniques, little research has focused on reproducing deep learning bugs, which is an essential step for their resolution. Existing literature suggests that only 3% of deep learning bugs are reproducible, underscoring the need for further research. Objective: This paper examines the reproducibility of deep learning bugs. We identify edit actions and useful information that could improve the reproducibility of deep learning bugs. Method: First, we construct a dataset of 668 deep-learning bugs from Stack Overflow and GitHub across three frameworks and 22 architectures. Second, out of the 668 bugs, we select 165 bugs using stratified sampling and attempt to determine their reproducibility. While reproducing these bugs, we identify edit actions and useful information for their reproduction. Third, we used the Apriori algorithm to identify useful information and edit actions required to reproduce specific types of bugs. Finally, we conducted a user study involving 22 developers to assess the effectiveness of our findings in real-life settings. Results: We successfully reproduced 148 out of 165 bugs attempted. We identified ten edit actions and five useful types of component information that can help us reproduce the deep learning bugs. With the help of our findings, the developers were able to reproduce 22.92% more bugs and reduce their reproduction time by 24.35%. Conclusions: Our research addresses the critical issue of deep learning bug reproducibility. Practitioners and researchers can leverage our findings to improve deep learning bug reproducibility.
Are LLMs Robust for Spoken Dialogues?
Seyed Mahed Mousavi
Gabriel Roccabruna
Simone Alghisi
Massimo Rizzoli
Giuseppe Riccardi
Large Pre-Trained Language Models have demonstrated state-of-the-art performance in different downstream tasks, including dialogue state tra… (voir plus)cking and end-to-end response generation. Nevertheless, most of the publicly available datasets and benchmarks on task-oriented dialogues focus on written conversations. Consequently, the robustness of the developed models to spoken interactions is unknown. In this work, we have evaluated the performance of LLMs for spoken task-oriented dialogues on the DSTC11 test sets. Due to the lack of proper spoken dialogue datasets, we have automatically transcribed a development set of spoken dialogues with a state-of-the-art ASR engine. We have characterized the ASR-error types and their distributions and simulated these errors in a large dataset of dialogues. We report the intrinsic (perplexity) and extrinsic (human evaluation) performance of fine-tuned GPT-2 and T5 models in two subtasks of response generation and dialogue state tracking, respectively. The results show that LLMs are not robust to spoken noise by default, however, fine-tuning/training such models on a proper dataset of spoken TODs can result in a more robust performance.
A primer on the use of machine learning to distil knowledge from data in biological psychiatry.
Thomas P. Quinn
Jonathan L. Hess
Victoria S. Marshe
Michelle M. Barnett
Anne-Christin Hauschild
Malgorzata Maciukiewicz
Samar S. M. Elsheikh
Xiaoyu Men
Emanuel Schwarz
Michael S. Breen
Eric J. Barnett
Yanli Zhang-James
Mehmet Eren Ahsen
Han Cao
Junfang Chen
Jiahui Hou
Asif Salekin
Ping-I Lin
Kristin K. Nicodemus … (voir 7 de plus)
Andreas Meyer-Lindenberg
Isabelle Bichindaritz
Stephen V. Faraone
Murray J. Cairns
Gaurav Pandey
Daniel J. Müller
Stephen J. Glatt
A primer on the use of machine learning to distil knowledge from data in biological psychiatry.
Thomas P. Quinn
Jonathan L. Hess
Victoria S. Marshe
Michelle M. Barnett
Anne-Christin Hauschild
Malgorzata Maciukiewicz
Samar S. M. Elsheikh
Xiaoyu Men
Emanuel Schwarz
Michael S. Breen
Eric J. Barnett
Yanli Zhang-James
Mehmet Eren Ahsen
Han Cao
Junfang Chen
Jiahui Hou
Asif Salekin
Ping-I Lin
Kristin K. Nicodemus … (voir 7 de plus)
Andreas Meyer-Lindenberg
Isabelle Bichindaritz
Stephen V. Faraone
Murray J. Cairns
Gaurav Pandey
Daniel J. Müller
Stephen J. Glatt
AITA: AI trustworthiness assessment
Bertrand Braunschweig
Stefan Buijsman
Faicel Chamroukhi
Fredrik Heintz
Juliette Mattioli
Maximilian Poretschkin
Bag of Tricks for Fully Test-Time Adaptation
Saypraseuth Mounsaveng
Florent Chiaroni
Malik Boudiaf
Ismail Ben Ayed
Fully Test-Time Adaptation (TTA), which aims at adapting models to data drifts, has recently attracted wide interest. Numerous tricks and te… (voir plus)chniques have been proposed to ensure robust learning on arbitrary streams of unlabeled data. However, assessing the true impact of each individual technique and obtaining a fair comparison still constitutes a significant challenge. To help consolidate the community’s knowledge, we present a categorization of selected orthogonal TTA techniques, including small batch normalization, stream rebalancing, reliable sample selection, and network confidence calibration. We meticulously dissect the effect of each approach on different scenarios of interest. Through our analysis, we shed light on trade-offs induced by those techniques between accuracy, the computational power required, and model complexity. We also uncover the synergy that arises when combining techniques and are able to establish new state-of-the-art results.
A Column Generation Scheme for Distributionally Robust Multi-Item Newsvendor Problems
Shanshan Wang
This paper studies a distributionally robust multi-item newsvendor problem, where the demand distribution is unknown but specified with a ge… (voir plus)neral event-wise ambiguity set. Using the event-wise affine decision rules, we can obtain a conservative approximation formulation of the problem, which can typically be further reformulated as a linear program. In order to efficiently solve the resulting large-scale linear program, we develop a column generation-based decomposition scheme and speed up the computational efficiency by exploiting a special column selection strategy and stopping early based on a Karush-Kuhn-Tucker condition test. Focusing on the Wasserstein ambiguity set and the event-wise mean absolute deviation set, a computational study demonstrates both the computational efficiency of the proposed algorithm, which significantly outperforms a commercial solver and a Benders decomposition method, and the out-of-sample superiority of distributionally robust solutions relative to their sample average approximation counterparts. History: Accepted by Nicola Secomandi, Area Editor for Stochastic Models & Reinforcement Learning. Funding: This work was supported by the Natural Sciences and Engineering Research Council of Canada [492997-2016, RGPIN-2016-05208], the National Natural Science Foundation of China [71972012], Alliance de recherche numérique du Canada, and Canada Research Chairs [CRC-2018-00105]. It was also supported by Groupe d’études et de recherche en analyse des décisions (GERAD). Finally, this research was enabled in part by support provided by Digital Research Alliance of Canada ( https://alliancecan.ca/en ). Supplemental Material: The software that supports the findings of this study is available within the paper and its supplemental information ( https://pubsonline.informs.org/doi/suppl/10.1287/ijoc.2022.0010 ) as well as from the IJOC GitHub software repository ( https://github.com/INFORMSJoC/2022.0010 ). The complete IJOC Software and Data Repository is available at https://informsjoc.github.io/ .
Dataset Difficulty and the Role of Inductive Bias
Devin Kwok
Nikhil Anand
Jonathan Frankle
Motivated by the goals of dataset pruning and defect identification, a growing body of methods have been developed to score individual examp… (voir plus)les within a dataset. These methods, which we call"example difficulty scores", are typically used to rank or categorize examples, but the consistency of rankings between different training runs, scoring methods, and model architectures is generally unknown. To determine how example rankings vary due to these random and controlled effects, we systematically compare different formulations of scores over a range of runs and model architectures. We find that scores largely share the following traits: they are noisy over individual runs of a model, strongly correlated with a single notion of difficulty, and reveal examples that range from being highly sensitive to insensitive to the inductive biases of certain model architectures. Drawing from statistical genetics, we develop a simple method for fingerprinting model architectures using a few sensitive examples. These findings guide practitioners in maximizing the consistency of their scores (e.g. by choosing appropriate scoring methods, number of runs, and subsets of examples), and establishes comprehensive baselines for evaluating scores in the future.
Dataset Difficulty and the Role of Inductive Bias
Devin Kwok
Nikhil Anand
Jonathan Frankle
Motivated by the goals of dataset pruning and defect identification, a growing body of methods have been developed to score individual examp… (voir plus)les within a dataset. These methods, which we call"example difficulty scores", are typically used to rank or categorize examples, but the consistency of rankings between different training runs, scoring methods, and model architectures is generally unknown. To determine how example rankings vary due to these random and controlled effects, we systematically compare different formulations of scores over a range of runs and model architectures. We find that scores largely share the following traits: they are noisy over individual runs of a model, strongly correlated with a single notion of difficulty, and reveal examples that range from being highly sensitive to insensitive to the inductive biases of certain model architectures. Drawing from statistical genetics, we develop a simple method for fingerprinting model architectures using a few sensitive examples. These findings guide practitioners in maximizing the consistency of their scores (e.g. by choosing appropriate scoring methods, number of runs, and subsets of examples), and establishes comprehensive baselines for evaluating scores in the future.
Domain Generalization by Rejecting Extreme Augmentations
Masih Aminbeidokhti
Fidel A. Guerrero Peña
Heitor Rapela Medeiros
Thomas Dubail
Eric Granger
GPS-SSL: Guided Positive Sampling to Inject Prior Into Self-Supervised Learning
Aarash Feizi
Randall Balestriero
Arantxa Casanova
We propose Guided Positive Sampling Self-Supervised Learning (GPS-SSL), a general method to inject a priori knowledge into Self-Supervised L… (voir plus)earning (SSL) positive samples selection. Current SSL methods leverage Data-Augmentations (DA) for generating positive samples and incorporate prior knowledge - an incorrect, or too weak DA will drastically reduce the quality of the learned representation. GPS-SSL proposes instead to design a metric space where Euclidean distances become a meaningful proxy for semantic relationship. In that space, it is now possible to generate positive samples from nearest neighbor sampling. Any prior knowledge can now be embedded into that metric space independently from the employed DA. From its simplicity, GPS-SSL is applicable to any SSL method, e.g. SimCLR or BYOL. A key benefit of GPS-SSL is in reducing the pressure in tailoring strong DAs. For example GPS-SSL reaches 85.58% on Cifar10 with weak DA while the baseline only reaches 37.51%. We therefore move a step forward towards the goal of making SSL less reliant on DA. We also show that even when using strong DAs, GPS-SSL outperforms the baselines on under-studied domains. We evaluate GPS-SSL along with multiple baseline SSL methods on numerous downstream datasets from different domains when the models use strong or minimal data augmentations. We hope that GPS-SSL will open new avenues in studying how to inject a priori knowledge into SSL in a principled manner.
HalluciDet: Hallucinating RGB Modality for Person Detection Through Privileged Information
Heitor Rapela Medeiros
Fidel A. Guerrero Peña
Masih Aminbeidokhti
Thomas Dubail
Eric Granger
A powerful way to adapt a visual recognition model to a new domain is through image translation. However, common image translation approache… (voir plus)s only focus on generating data from the same distribution as the target domain. Given a cross-modal application, such as pedestrian detection from aerial images, with a considerable shift in data distribution between infrared (IR) to visible (RGB) images, a translation focused on generation might lead to poor performance as the loss focuses on irrelevant details for the task. In this paper, we propose HalluciDet, an IR-RGB image translation model for object detection. Instead of focusing on reconstructing the original image on the IR modality, it seeks to reduce the detection loss of an RGB detector, and therefore avoids the need to access RGB data. This model produces a new image representation that enhances objects of interest in the scene and greatly improves detection performance. We empirically compare our approach against state-of-the-art methods for image translation and for fine-tuning on IR, and show that our HalluciDet improves detection accuracy in most cases by exploiting the privileged information encoded in a pre-trained RGB detector. Code: https://github.com/heitorrapela/HalluciDet.