Portrait de Foutse Khomh

Foutse Khomh

Membre académique associé
Chaire en IA Canada-CIFAR
Professeur, Polytechnique Montréal, Département de génie informatique et génie logiciel
Sujets de recherche
Apprentissage de la programmation
Apprentissage par renforcement
Apprentissage profond
Exploration des données
Modèles génératifs
Systèmes distribués
Traitement du langage naturel

Biographie

Foutse Khomh est professeur titulaire de génie logiciel à Polytechnique Montréal, titulaire d'une chaire en IA Canada-CIFAR dans le domaine des systèmes logiciels d'apprentissage automatique fiables, et titulaire d'une chaire de recherche FRQ-IVADO sur l'assurance qualité des logiciels pour les applications d'apprentissage automatique.

Il a obtenu un doctorat en génie logiciel de l'Université de Montréal en 2011, avec une bourse d'excellence. Il a également reçu le prix CS-Can/Info-Can du meilleur jeune chercheur en informatique en 2019. Ses recherches portent sur la maintenance et l'évolution des logiciels, l'ingénierie des systèmes d'apprentissage automatique, l'ingénierie en nuage et l’IA/apprentissage automatique fiable et digne de confiance.

Ses travaux ont été récompensés par quatre prix de l’article le plus important Most Influential Paper en dix ans et six prix du meilleur article ou de l’article exceptionnel (Best/Distinguished Paper). Il a également siégé au comité directeur de plusieurs conférences et rencontres : SANER (comme président), MSR, PROMISE, ICPC (comme président) et ICSME (en tant que vice-président). Il a initié et coorganisé le symposium Software Engineering for Machine Learning Applications (SEMLA) et la série d'ateliers Release Engineering (RELENG).

Il est cofondateur du projet CRSNG CREATE SE4AI : A Training Program on the Development, Deployment, and Servicing of Artificial Intelligence-based Software Systems et l'un des chercheurs principaux du projet Dependable Explainable Learning (DEEL). Il est également cofondateur de l'initiative québécoise sur l'IA digne de confiance (Confiance IA Québec). Il fait partie du comité de rédaction de plusieurs revues internationales de génie logiciel (dont IEEE Software, EMSE, JSEP) et est membre senior de l'Institute of Electrical and Electronics Engineers (IEEE).

Étudiants actuels

Postdoctorat - Polytechnique
Doctorat - Polytechnique
Doctorat - Polytechnique
Maîtrise recherche - Polytechnique
Maîtrise recherche - Polytechnique
Maîtrise recherche - Polytechnique
Maîtrise recherche - Polytechnique

Publications

Toward Debugging Deep Reinforcement Learning Programs with RLExplorer
Deep reinforcement learning (DRL) has shown success in diverse domains such as robotics, computer games, and recommendation systems. However… (voir plus), like any other software system, DRL-based software systems are susceptible to faults that pose unique challenges for debugging and diagnosing. These faults often result in unexpected behavior without explicit failures and error messages, making debugging difficult and time-consuming. Therefore, automating the monitoring and diagnosis of DRL systems is crucial to alleviate the burden on developers. In this paper, we propose RLExplorer, the first fault diagnosis approach for DRL-based software systems. RLExplorer automatically monitors training traces and runs diagnosis routines based on properties of the DRL learning dynamics to detect the occurrence of DRL-specific faults. It then logs the results of these diagnoses as warnings that cover theoretical concepts, recommended practices, and potential solutions to the identified faults. We conducted two sets of evaluations to assess RLExplorer. Our first evaluation of faulty DRL samples from Stack Overflow revealed that our approach can effectively diagnose real faults in 83% of the cases. Our second evaluation of RLExplorer with 15 DRL experts/developers showed that (1) RLExplorer could identify 3.6 times more defects than manual debugging and (2) RLExplorer is easily integrated into DRL applications.
Triage Software Update Impact via Release Notes Classification
Solomon Berhe
Vanessa Kan
Omhier Khan
Nathan Pader
Ali Zain Farooqui
Marc Maynard
Validation of Vigilance Decline Capability in A Simulated Test Environment: A Preliminary Step Towards Neuroadaptive Control
Andra Mahu
Amandeep Singh
Florian Tambon
Benoit Ouellette
Jean-françois Delisle
Tanya Paul
Alexandre Marois
Philippe Doyon-poulin
Vigilance is the ability to sustain attention. It is crucial in tasks like piloting and driving that involve the ability to sustain attentio… (voir plus)n. However, cognitive performance often falters with prolonged tasks, leading to reduced efficiency, slower reactions, and increased error likelihood. Identifying and addressing diminished vigilance is essential for enhancing driving safety. Neuro-physiological indicators have shown promising results to monitor vigilance, paving the way for neuroadaptive control of vigilance. In fact, the collection of vigilance-related physiological markers could allow, using neuroadaptive intelligent systems, a real-time adaption of tasks or the presentation of countermeasures to prevent errors that would ensue from such hypovigilant situations. Before reaching this goal, one must however collect valid data truly representative of hypovigilance which, in turn, can be used to develop prediction models of the vigilant state. This study serves as a proof of concept to assess validity of a testbed to induce and measure vigilance decline through a simulated test environment, validating controlled induction, and evaluating its impact on participants’ performance and subjective experiences. In total, 28 participants (10 females, 18 males) aged 18 to 35 (M = 23.75 years), were recruited. All participants held valid driving licenses and had corrected-to-normal vision. Data collection involved Psychomotor Vigilance Task (PVT), Karolinska Sleepiness Scale (KSS) and the Stanford Sleepiness Scale (SSS) along with neuro-physiological specialized equipment: Enobio 8 EEG, Empatica E4, Polar H10 and Tobii Nano Pro eye tracker. Notably, this study is limited to demonstrating the results of PVT, KSS, and SSS, with the aim of assessing the effectiveness of the test setup. Participants self-reported their loss of vigilance by pressing a marker on the steering wheel. To induce hypovigilance, participants drove an automatic car in a low-traffic, monotonous environment for 60 minutes, featuring empty fields of grass and desert, employing specific in-game procedures. The driving task included instructions for lane-keeping, indicator usage, and maintaining speeds of up to 80 km/h, with no traffic lights or stop signs present. Experiments were conducted before lunch, between 9 am and 12 pm, ensuring maximum participant alertness, with instructions to abstain from caffeine, alcohol, nicotine, and cannabis on the experiment day. Results showed that the mean reaction time (RT) increased from 257.7 ms before driving to 276.8 ms after driving, t = 4.82, p < .0001, d = -0.61 whereas the median RT changed from 246.07 ms to 260.89 ms, t = 3.58, p = 0.0013, d= -0.53 indicating a statistically significant alteration in participant's psychomotor performance. The mean number of minor lapses in attention (RT >500ms) to the PVT increased from 1.11 before driving to 1.67 after driving, but was not statistically significant t = 1.66, p = 0.11, d = -0.28. KSS showed a considerable rise of sleepiness, with a mean of 4.11 (rather alert) before driving increasing to 5.96 (some signs of sleepiness) after driving, t = 5.65, p < .0001, d = -1.04. Similarly, the SSS demonstrated an increase in mean values from 2.57 (able to concentrate) before driving to 3.96 (somewhat foggy) after driving, t = 8.42, p < .0001, d = -1.20, signifying an increased perception of sleepiness following the driving activity. Lastly, the mean time of the first marker press was 17:38 minutes (SD = 9:47 minutes) indicating that the self-reported loss of vigilance occurred during the first 30 minutes of the driving task. The observed increase in PVT reaction time aligns with the declined alertness reported on both the KSS and SSS responses, suggesting a consistent decline in vigilance and alertness post-driving. In conclusion, the study underscores the effectiveness and validity of the simulated test environment in inducing vigilance decline, providing valuable insights into the impact on both objective and subjective measures. At the same time, the research sets the stage for exploring neuroadaptive control strategies, aiming to enhance task performance and safety. Ultimately, this will contribute to the development of a non-invasive artificial intelligence system capable of detecting vigilance states in extreme/challenging environments, e.g. for pilots and drivers.
An empirical study of testing machine learning in the wild
Moses Openja
Armstrong Foundjem
Zhen Ming
Mouna Abidi
Ahmed E. Hassan
Recently, machine and deep learning (ML/DL) algorithms have been increasingly adopted in many software systems. Due to their inductive natur… (voir plus)e, ensuring the quality of these systems remains a significant challenge for the research community. Unlike traditional software built deductively by writing explicit rules, ML/DL systems infer rules from training data. Recent research in ML/DL quality assurance has adapted concepts from traditional software testing, such as mutation testing, to improve reliability. However, it is unclear if these proposed testing techniques are adopted in practice, or if new testing strategies have emerged from real-world ML deployments. There is little empirical evidence about the testing strategies. To fill this gap, we perform the first fine-grained empirical study on ML testing in the wild to identify the ML properties being tested, the testing strategies, and their implementation throughout the ML workflow. We conducted a mixed-methods study to understand ML software testing practices. We analyzed test files and cases from 11 open-source ML/DL projects on GitHub. Using open coding, we manually examined the testing strategies, tested ML properties, and implemented testing methods to understand their practical application in building and releasing ML/DL software systems. Our findings reveal several key insights: 1.) The most common testing strategies, accounting for less than 40%, are Grey-box and White-box methods, such as Negative Testing, Oracle Approximation and Statistical Testing. 2.) A wide range of 17 ML properties are tested, out of which only 20% to 30% are frequently tested, including Consistency, Correctness}, and Efficiency. 3.) Bias and Fairness is more tested in Recommendation, while Security & Privacy is tested in Computer Vision (CV) systems, Application Platforms, and Natural Language Processing (NLP) systems.
Detection and evaluation of bias-inducing features in machine learning
Harnessing Predictive Modeling and Software Analytics in the Age of LLM-Powered Software Development (Invited Talk)
Bug Characterization in Machine Learning-based Systems
Mohammad Mehdi Morovati
Amin Nikanjam
Florian Tambon
Z. Jiang
Rapid growth of applying Machine Learning (ML) in different domains, especially in safety-critical areas, increases the need for reliable ML… (voir plus) components, i.e., a software component operating based on ML. Understanding the bugs characteristics and maintenance challenges in ML-based systems can help developers of these systems to identify where to focus maintenance and testing efforts, by giving insights into the most error-prone components, most common bugs, etc. In this paper, we investigate the characteristics of bugs in ML-based software systems and the difference between ML and non-ML bugs from the maintenance viewpoint. We extracted 447,948 GitHub repositories that used one of the three most popular ML frameworks, i.e., TensorFlow, Keras, and PyTorch. After multiple filtering steps, we select the top 300 repositories with the highest number of closed issues. We manually investigate the extracted repositories to exclude non-ML-based systems. Our investigation involved a manual inspection of 386 sampled reported issues in the identified ML-based systems to indicate whether they affect ML components or not. Our analysis shows that nearly half of the real issues reported in ML-based systems are ML bugs, indicating that ML components are more error-prone than non-ML components. Next, we thoroughly examined 109 identified ML bugs to identify their root causes, symptoms, and calculate their required fixing time. The results also revealed that ML bugs have significantly different characteristics compared to non-ML bugs, in terms of the complexity of bug-fixing (number of commits, changed files, and changed lines of code). Based on our results, fixing ML bugs are more costly and ML components are more error-prone, compared to non-ML bugs and non-ML components respectively. Hence, paying a significant attention to the reliability of the ML components is crucial in ML-based systems.
A Machine Learning Based Approach to Detect Machine Learning Design Patterns
Weitao Pan
Hironori Washizaki
Nobukazu Yoshioka
Yoshiaki Fukazawa
Yann‐Gaël Guéhéneuc
As machine learning expands to various domains, the demand for reusable solutions to similar problems increases. Machine learning design pat… (voir plus)terns are reusable solutions to design problems of machine learning applications. They can significantly enhance programmers' productivity in programming that requires machine learning algorithms. Given the critical role of machine learning design patterns, the automated detection of them becomes equally vital. However, identifying design patterns can be time-consuming and error-prone. We propose an approach to detect their occurrences in Python files. Our approach uses an Abstract Syntax Tree (AST) of Python files to build a corpus of data and train a refined Text-CNN model to automatically identify machine learning design patterns. We empirically validate our approach by conducting an exploratory study to detect four common machine learning design patterns: Embedding, Multilabel, Feature Cross, and Hashed Feature. We manually label 450 Python code files containing these design patterns from repositories of projects in GitHub. Our approach achieves accuracy values ranging from 80 % to 92% for each of the four patterns.
A large-scale exploratory study of android sports apps in the google play store
Bhagya Chembakottu
Heng Li
Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow
Florian Tambon
Amin Nikanjam
Le An
Giuliano Antoniol
Deep Learning (DL) frameworks are now widely used, simplifying the creation of complex models as well as their integration to various applic… (voir plus)ations even to non DL experts. However, like any other programs, they are prone to bugs. This paper deals with the subcategory of bugs named silent bugs: they lead to wrong behavior but they do not cause system crashes or hangs, nor show an error message to the user. Such bugs are even more dangerous in DL applications and frameworks due to the "black-box" and stochastic nature of the systems (the end user can not understand how the model makes decisions). This paper presents the first empirical study of Keras and TensorFlow silent bugs, and their impact on users' programs. We extracted closed issues related to Keras from the TensorFlow GitHub repository. Out of the 1,168 issues that we gathered, 77 were reproducible silent bugs affecting users' programs. We categorized the bugs based on the effects on the users' programs and the components where the issues occurred, using information from the issue reports. We then derived a threat level for each of the issues, based on the impact they had on the users' programs. To assess the relevance of identified categories and the impact scale, we conducted an online survey with 103 DL developers. The participants generally agreed with the significant impact of silent bugs in DL libraries and acknowledged our findings (i.e., categories of silent bugs and the proposed impact scale). Finally, leveraging our analysis, we provide a set of guidelines to facilitate safeguarding against such bugs in DL frameworks.
Assessing the Security of GitHub Copilot's Generated Code - A Targeted Replication Study
Vahid Majdinasab
Michael Joshua Bishop
Shawn Rasheed
Amjed Tahir
Studying the characteristics of AIOps projects on GitHub
Roozbeh Aghili
Heng Li