Portrait de Foutse Khomh

Foutse Khomh

Membre académique associé
Chaire en IA Canada-CIFAR
Professeur, Polytechnique Montréal, Département de génie informatique et génie logiciel
Sujets de recherche
Apprentissage de la programmation
Apprentissage par renforcement
Apprentissage profond
Exploration des données
Modèles génératifs
Systèmes distribués
Traitement du langage naturel

Biographie

Foutse Khomh est professeur titulaire de génie logiciel à Polytechnique Montréal, titulaire d'une chaire en IA Canada-CIFAR dans le domaine des systèmes logiciels d'apprentissage automatique fiables, et titulaire d'une chaire de recherche FRQ-IVADO sur l'assurance qualité des logiciels pour les applications d'apprentissage automatique.

Il a obtenu un doctorat en génie logiciel de l'Université de Montréal en 2011, avec une bourse d'excellence. Il a également reçu le prix CS-Can/Info-Can du meilleur jeune chercheur en informatique en 2019. Ses recherches portent sur la maintenance et l'évolution des logiciels, l'ingénierie des systèmes d'apprentissage automatique, l'ingénierie en nuage et l’IA/apprentissage automatique fiable et digne de confiance.

Ses travaux ont été récompensés par quatre prix de l’article le plus important Most Influential Paper en dix ans et six prix du meilleur article ou de l’article exceptionnel (Best/Distinguished Paper). Il a également siégé au comité directeur de plusieurs conférences et rencontres : SANER (comme président), MSR, PROMISE, ICPC (comme président) et ICSME (en tant que vice-président). Il a initié et coorganisé le symposium Software Engineering for Machine Learning Applications (SEMLA) et la série d'ateliers Release Engineering (RELENG).

Il est cofondateur du projet CRSNG CREATE SE4AI : A Training Program on the Development, Deployment, and Servicing of Artificial Intelligence-based Software Systems et l'un des chercheurs principaux du projet Dependable Explainable Learning (DEEL). Il est également cofondateur de l'initiative québécoise sur l'IA digne de confiance (Confiance IA Québec). Il fait partie du comité de rédaction de plusieurs revues internationales de génie logiciel (dont IEEE Software, EMSE, JSEP) et est membre senior de l'Institute of Electrical and Electronics Engineers (IEEE).

Étudiants actuels

Collaborateur·rice alumni - Polytechnique
Doctorat - Polytechnique
Doctorat - Polytechnique
Postdoctorat - Polytechnique
Co-superviseur⋅e :
Maîtrise recherche - Polytechnique
Maîtrise recherche - Polytechnique
Maîtrise recherche - Polytechnique

Publications

Evaluating machine learning-driven intrusion detection systems in IoT: Performance and energy consumption
Saeid Jamshidi
Kawser Wazed Nafi
Amin Nikanjam
MLOps, LLMOps, FMOps, and Beyond
Chakkrit Tantithamthavorn
Fabio Palomba
Joselito Joey Chua
MLOps, LLMOps, FMOps, and Beyond
Chakkrit Kla Tantithamthavorn
Fabio Palomba
Joselito Joey Chua
MLOps, LLMOps, FMOps, and Beyond
Chakkrit Tantithamthavorn
Fabio Palomba
Joselito Joey Chua
Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code
Vahid Majdinasab
Amin Nikanjam
Code auditing ensures that the developed code adheres to standards, regulations, and copyright protection by verifying that it does not cont… (voir plus)ain code from protected sources. The recent advent of Large Language Models (LLMs) as coding assistants in the software development process poses new challenges for code auditing. The dataset for training these models is mainly collected from publicly available sources. This raises the issue of intellectual property infringement as developers' codes are already included in the dataset. Therefore, auditing code developed using LLMs is challenging, as it is difficult to reliably assert if an LLM used during development has been trained on specific copyrighted codes, given that we do not have access to the training datasets of these models. Given the non-disclosure of the training datasets, traditional approaches such as code clone detection are insufficient for asserting copyright infringement. To address this challenge, we propose a new approach, TraWiC; a model-agnostic and interpretable method based on membership inference for detecting code inclusion in an LLM's training dataset. We extract syntactic and semantic identifiers unique to each program to train a classifier for detecting code inclusion. In our experiments, we observe that TraWiC is capable of detecting 83.87% of codes that were used to train an LLM. In comparison, the prevalent clone detection tool NiCad is only capable of detecting 47.64%. In addition to its remarkable performance, TraWiC has low resource overhead in contrast to pair-wise clone detection that is conducted during the auditing process of tools like CodeWhisperer reference tracker, across thousands of code snippets.
Editorial: Special Issue on Software Engineering and AI for Data Quality
Andreas Metzger
Phu Nguyen
Sagar Sen
This editorial summarizes the content of the Special Issue on Software Engineering and AI for Data Quality of the Journal of Data and Inform… (voir plus)ation Quality (JDIQ).
Editorial: Special Issue on Software Engineering and AI for Data Quality
Andreas Metzger
Phu H. Nguyen
Sagar Sen
This editorial summarizes the content of the Special Issue on Software Engineering and AI for Data Quality of the Journal of Data and Inform… (voir plus)ation Quality (JDIQ).
Editorial: Special Issue on Software Engineering and AI for Data Quality
Andreas Metzger
Phu Nguyen
Sagar Sen
This editorial summarizes the content of the Special Issue on Software Engineering and AI for Data Quality of the Journal of Data and Inform… (voir plus)ation Quality (JDIQ).
Leveraging Data Characteristics for Bug Localization in Deep Learning Programs
Ruchira Manke
Mohammad Wardat
Hridesh Rajan
Deep Learning (DL) is a class of machine learning algorithms that are used in a wide variety of applications. Like any software system, DL p… (voir plus)rograms can have bugs. To support bug localization in DL programs, several tools have been proposed in the past. As most of the bugs that occur due to improper model structure known as structural bugs lead to inadequate performance during training, it is challenging for developers to identify the root cause and address these bugs. To support bug detection and localization in DL programs, in this paper, we propose Theia, which detects and localizes structural bugs in DL programs. Unlike the previous works, Theia considers the training dataset characteristics to automatically detect bugs in DL programs developed using two deep learning libraries, Keras and PyTorch . Since training the DL models is a time-consuming process, Theia detects these bugs at the beginning of the training process and alerts the developer with informative messages containing the bug's location and actionable fixes which will help them to improve the structure of the model. We evaluated Theia on a benchmark of 40 real-world buggy DL programs obtained from Stack Overflow . Our results show that Theia successfully localizes 57/75 structural bugs in 40 buggy programs, whereas NeuraLint, a state-of-the-art approach capable of localizing structural bugs before training localizes 17/75 bugs.
Continuously Learning Bug Locations
Paulina Stevia Nouwou Mindom
Léuson M. P. Da Silva
Amin Nikanjam
Automatically locating buggy changesets associated with bug reports is crucial in the software development process. Deep Learning (DL)-based… (voir plus) techniques show promising results by leveraging structural information from the code and learning links between changesets and bug reports. However, since source code associated with changesets evolves, the performance of such models tends to degrade over time due to concept drift. Aiming to address this challenge, in this paper, we evaluate the potential of using Continual Learning (CL) techniques in multiple sub-tasks setting for bug localization (each of which operates on either stationary or non-stationary data), comparing it against a bug localization technique that leverages the BERT model, a deep reinforcement learning-based technique that leverages the A2C algorithm, and a DL-based function-level interaction model for semantic bug localization. Additionally, we enhanced the CL techniques by using logistic regression to identify and integrate the most significant bug-inducing factors. Our empirical evaluation across seven widely used software projects shows that CL techniques perform better than DL-based techniques by up to 61% in terms of Mean Reciprocal Rank (MRR), 44% in terms of Mean Average Precision (MAP), 83% in terms of top@1, 56% in terms of top@5, and 66% in terms of top@10 metrics in non-stationary setting. Further, we show that the CL techniques we studied are effective at localizing changesets relevant to a bug report while being able to mitigate catastrophic forgetting across the studied tasks and require up to 5x less computational effort during training. Our findings demonstrate the potential of adopting CL for bug localization in non-stationary settings, and we hope it helps to improve bug localization activities in Software Engineering using CL techniques.
Continuously Learning Bug Locations
Paulina Stevia Nouwou Mindom
Leuson Da Silva
Amin Nikanjam
Automatically locating buggy changesets associated with bug reports is crucial in the software development process. Deep Learning (DL)-based… (voir plus) techniques show promising results by leveraging structural information from the code and learning links between changesets and bug reports. However, since source code associated with changesets evolves, the performance of such models tends to degrade over time due to concept drift. Aiming to address this challenge, in this paper, we evaluate the potential of using Continual Learning (CL) techniques in multiple sub-tasks setting for bug localization (each of which operates on either stationary or non-stationary data), comparing it against a bug localization technique that leverages the BERT model, a deep reinforcement learning-based technique that leverages the A2C algorithm, and a DL-based function-level interaction model for semantic bug localization. Additionally, we enhanced the CL techniques by using logistic regression to identify and integrate the most significant bug-inducing factors. Our empirical evaluation across seven widely used software projects shows that CL techniques perform better than DL-based techniques by up to 61% in terms of Mean Reciprocal Rank (MRR), 44% in terms of Mean Average Precision (MAP), 83% in terms of top@1, 56% in terms of top@5, and 66% in terms of top@10 metrics in non-stationary setting. Further, we show that the CL techniques we studied are effective at localizing changesets relevant to a bug report while being able to mitigate catastrophic forgetting across the studied tasks and require up to 5x less computational effort during training. Our findings demonstrate the potential of adopting CL for bug localization in non-stationary settings, and we hope it helps to improve bug localization activities in Software Engineering using CL techniques.
Continuously Learning Bug Locations
Paulina Stevia Nouwou Mindom
Léuson M. P. Da Silva
Amin Nikanjam
Automatically locating buggy changesets associated with bug reports is crucial in the software development process. Deep Learning (DL)-based… (voir plus) techniques show promising results by leveraging structural information from the code and learning links between changesets and bug reports. However, since source code associated with changesets evolves, the performance of such models tends to degrade over time due to concept drift. Aiming to address this challenge, in this paper, we evaluate the potential of using Continual Learning (CL) techniques in multiple sub-tasks setting for bug localization (each of which operates on either stationary or non-stationary data), comparing it against a bug localization technique that leverages the BERT model, a deep reinforcement learning-based technique that leverages the A2C algorithm, and a DL-based function-level interaction model for semantic bug localization. Additionally, we enhanced the CL techniques by using logistic regression to identify and integrate the most significant bug-inducing factors. Our empirical evaluation across seven widely used software projects shows that CL techniques perform better than DL-based techniques by up to 61% in terms of Mean Reciprocal Rank (MRR), 44% in terms of Mean Average Precision (MAP), 83% in terms of top@1, 56% in terms of top@5, and 66% in terms of top@10 metrics in non-stationary setting. Further, we show that the CL techniques we studied are effective at localizing changesets relevant to a bug report while being able to mitigate catastrophic forgetting across the studied tasks and require up to 5x less computational effort during training. Our findings demonstrate the potential of adopting CL for bug localization in non-stationary settings, and we hope it helps to improve bug localization activities in Software Engineering using CL techniques.