Foutse Khomh

Towards Understanding the Impact of Data Bugs on Deep Learning Models in Software Engineering

Mehil B. Shah

Mohammad Masudur Rahman

Deep learning (DL) techniques have achieved significant success in various software engineering tasks (e.g., code completion by Copilot). Ho… (voir plus)wever, DL systems are prone to bugs from many sources, including training data. Existing literature suggests that bugs in training data are highly prevalent, but little research has focused on understanding their impacts on the models used in software engineering tasks. In this paper, we address this research gap through a comprehensive empirical investigation focused on three types of data prevalent in software engineering tasks: code-based, text-based, and metric-based. Using state-of-the-art baselines, we compare the models trained on clean datasets with those trained on datasets with quality issues and without proper preprocessing. By analysing the gradients, weights, and biases from neural networks under training, we identify the symptoms of data quality and preprocessing issues. Our analysis reveals that quality issues in code data cause biased learning and gradient instability, whereas problems in text data lead to overfitting and poor generalisation of models. On the other hand, quality issues in metric data result in exploding gradients and model overfitting, and inadequate preprocessing exacerbates these effects across all three data types. Finally, we demonstrate the validity and generalizability of our findings using six new datasets. Our research provides a better understanding of the impact and symptoms of data bugs in software engineering datasets. Practitioners and researchers can leverage these findings to develop better monitoring systems and data-cleaning methods to help detect and resolve data bugs in deep learning systems.

2024-11-19

ArXiv (prépublication)

Fault Localization in Deep Learning-based Software: A System-level Approach

Mohammad Mehdi Morovati

Amin Nikanjam

Over the past decade, Deep Learning (DL) has become an integral part of our daily lives. This surge in DL usage has heightened the need for … (voir plus)developing reliable DL software systems. Given that fault localization is a critical task in reliability assessment, researchers have proposed several fault localization techniques for DL-based software, primarily focusing on faults within the DL model. While the DL model is central to DL components, there are other elements that significantly impact the performance of DL components. As a result, fault localization methods that concentrate solely on the DL model overlook a large portion of the system. To address this, we introduce FL4Deep, a system-level fault localization approach considering the entire DL development pipeline to effectively localize faults across the DL-based systems. In an evaluation using 100 faulty DL scripts, FL4Deep outperformed four previous approaches in terms of accuracy for three out of six DL-related faults, including issues related to data (84%), mismatched libraries between training and deployment (100%), and loss function (69%). Additionally, FL4Deep demonstrated superior precision and recall in fault localization for five categories of faults including three mentioned fault types in terms of accuracy, plus insufficient training iteration and activation function.

2024-11-12

ArXiv (prépublication)

Fault Localization in Deep Learning-based Software: A System-level Approach

Mohammad Mehdi Morovati

Amin Nikanjam

2024-11-12

ArXiv (prépublication)

Impact of LLM-based Review Comment Generation in Practice: A Mixed Open-/Closed-source User Study

Doriane Olewicki

Léuson M. P. Da Silva

Suhaib Mujahid

Arezou Amini

Benjamin Mah

Marco Castelluccio

Sarra Habchi

Bram Adams

We conduct a large-scale empirical user study in a live setup to evaluate the acceptance of LLM-generated comments and their impact on the r… (voir plus)eview process. This user study was performed in two organizations, Mozilla (which has its codebase available as open source) and Ubisoft (fully closed-source). Inside their usual review environment, participants were given access to RevMate, an LLM-based assistive tool suggesting generated review comments using an off-the-shelf LLM with Retrieval Augmented Generation to provide extra code and review context, combined with LLM-as-a-Judge, to auto-evaluate the generated comments and discard irrelevant cases. Based on more than 587 patch reviews provided by RevMate, we observed that 8.1% and 7.2%, respectively, of LLM-generated comments were accepted by reviewers in each organization, while 14.6% and 20.5% other comments were still marked as valuable as review or development tips. Refactoring-related comments are more likely to be accepted than Functional comments (18.2% and 18.6% compared to 4.8% and 5.2%). The extra time spent by reviewers to inspect generated comments or edit accepted ones (36/119), yielding an overall median of 43s per patch, is reasonable. The accepted generated comments are as likely to yield future revisions of the revised patch as human-written comments (74% vs 73% at chunk-level).

2024-11-11

ArXiv (prépublication)

Towards Enhancing the Reproducibility of Deep Learning Bugs: An Empirical Study

Mehil B. Shah

Mohammad Masudur Rahman

2024-11-09

Empirical Software Engineering (publié)

Towards Optimizing SQL Generation via LLM Routing

Mohammadhossein Malekpour

Nour Shaheen

Amine Mhedhbi

Text-to-SQL enables users to interact with databases through natural language, simplifying access to structured data. Although highly capabl… (voir plus)e large language models (LLMs) achieve strong accuracy for complex queries, they incur unnecessary latency and dollar cost for simpler ones. In this paper, we introduce the first LLM routing approach for Text-to-SQL, which dynamically selects the most cost-effective LLM capable of generating accurate SQL for each query. We present two routing strategies (score- and classification-based) that achieve accuracy comparable to the most capable LLM while reducing costs. We design the routers for ease of training and efficient inference. In our experiments, we highlight a practical and explainable accuracy-cost trade-off on the BIRD dataset.

2024-11-06

ArXiv (prépublication)

Towards Optimizing SQL Generation via LLM Routing

Mohammadhossein Malekpour

Nour Shaheen

Amine Mhedhbi

Text-to-SQL enables users to interact with databases through natural language, simplifying access to structured data. Although highly capabl… (voir plus)e large language models (LLMs) achieve strong accuracy for complex queries, they incur unnecessary latency and dollar cost for simpler ones. In this paper, we introduce the first LLM routing approach for Text-to-SQL, which dynamically selects the most cost-effective LLM capable of generating accurate SQL for each query. We present two routing strategies (score- and classification-based) that achieve accuracy comparable to the most capable LLM while reducing costs. We design the routers for ease of training and efficient inference. In our experiments, we highlight a practical and explainable accuracy-cost trade-off on the BIRD dataset.

2024-11-06

ArXiv (prépublication)

Impact of LLM-based Review Comment Generation in Practice: A Mixed Open-/Closed-source User Study

Doriane Olewicki

Leuson Da Silva

Suhaib Mujahid

Arezou Amini

Benjamin Mah

Marco Castelluccio

Sarra Habchi

Bram Adams

2024-11-01

arXiv (publié)

Tracing Optimization for Performance Modeling and Regression Detection

Kaveh Shahedi

Heng Li

Maxime Lamothe

Software performance modeling plays a crucial role in developing and maintaining software systems. A performance model analytically describe… (voir plus)s the relationship between the performance of a system and its runtime activities. This process typically examines various aspects of a system's runtime behavior, such as the execution frequency of functions or methods, to forecast performance metrics like program execution time. By using performance models, developers can predict expected performance and thereby effectively identify and address unexpected performance regressions when actual performance deviates from the model's predictions. One common and precise method for capturing performance behavior is software tracing, which involves instrumenting the execution of a program, either at the kernel level (e.g., system calls) or application level (e.g., function calls). However, due to the nature of tracing, it can be highly resource-intensive, making it impractical for production environments where resources are limited. In this work, we propose statistical approaches to reduce tracing overhead by identifying and excluding performance-insensitive code regions, particularly application-level functions, from tracing while still building accurate performance models that can capture performance degradations. By selecting an optimal set of functions to be traced, we can construct optimized performance models that achieve an R-2 score of up to 99% and, sometimes, outperform full tracing models (models using non-optimized tracing data), while significantly reducing the tracing overhead by more than 80% in most cases. Our optimized performance models can also capture performance regressions in our studied programs effectively, demonstrating their usefulness in real-world scenarios. Our approach is fully automated, making it ready to be used in production environments with minimal human effort.

2024-11-01

arXiv (publié)

Doctoral Symposium Committee

Anthony Cleve

Christian Lange

Silvia Breu

Manar H. Alalfi

Mario Luca Bernardi

Cornelia Boldyreff

Marco D'Ambros

Simon Denier

Natalia Dragan

Ekwa Duala-Ekoko

Fausto Fasano

Adnane Ghannem

Carmine Gravino

Maen Hammad

Imed Hammouda

Salima Hassaine

Yue Jia

Zhen Ming (Jack) Jiang

Adam Kiezun … (voir 11 de plus)

Jay Kothari

Jonathan Memaitre

Naouel Moha

Rocco Oliveto

Denys Poshyvanyk

Michele Risi

Giuseppe Scanniello

Bonita Sharif

Andrew Sutton

Anis Yousefi

Eugenio Zimeo

Manar H. Alalfi Mario Luca Bernardi Cornelia Boldyreff Anthony Cleve Marco D'Ambros Simon Denier Natalia Dragan Ekwa Duala-Ekoko Fausto Fasa… (voir plus)no Adnane Ghannem Carmine Gravino Maen Hammad Imed Hammouda Salima Hassaine Yue Jia Zhen Ming Jiang Foutse Khomh Adam Kiezun Jay Kothari Jonathan Memaitre Naouel Moha Rocco Oliveto Denys Poshyvanyk Michele Risi Giuseppe Scanniello Bonita Sharif Andrew Sutton Anis Yousefi Eugenio Zimeo

2024-10-28

2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW) (publié)

Doctoral Symposium Committee

Anthony Cleve

Christian Lange

Silvia Breu

Manar H. Alalfi

Mario Luca Bernardi

Cornelia Boldyreff

Marco D'Ambros

Simon Denier

Natalia Dragan

Ekwa Duala-Ekoko

Fausto Fasano

Adnane Ghannem

Carmine Gravino

Maen Hammad

Imed Hammouda

Salima Hassaine

Yue Jia

Zhen Ming Jiang

Adam Kiezun … (voir 11 de plus)

Jay Kothari

Jonathan Memaitre

Naouel Moha

Rocco Oliveto

Denys Poshyvanyk

Michele Risi

Giuseppe Scanniello

Bonita Sharif

Andrew Sutton

Anis Yousefi

Eugenio Zimeo

Manar H. Alalfi Mario Luca Bernardi Cornelia Boldyreff Anthony Cleve Marco D'Ambros Simon Denier Natalia Dragan Ekwa Duala-Ekoko Fausto Fasa… (voir plus)no Adnane Ghannem Carmine Gravino Maen Hammad Imed Hammouda Salima Hassaine Yue Jia Zhen Ming Jiang Foutse Khomh Adam Kiezun Jay Kothari Jonathan Memaitre Naouel Moha Rocco Oliveto Denys Poshyvanyk Michele Risi Giuseppe Scanniello Bonita Sharif Andrew Sutton Anis Yousefi Eugenio Zimeo

2024-10-28

2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW) (publié)