Portrait of Foutse Khomh

Foutse Khomh

Associate Academic Member
Canada CIFAR AI Chair
Professor, Polytechnique Montréal, Department of Computer Engineering and Software Engineering
Research Topics
Data Mining
Deep Learning
Distributed Systems
Generative Models
Learning to Program
Natural Language Processing
Reinforcement Learning

Biography

Foutse Khomh is a full professor of software engineering at Polytechnique Montréal, a Canada CIFAR AI Chair – Trustworthy Machine Learning Software Systems, and an FRQ-IVADO Research Chair in Software Quality Assurance for Machine Learning Applications. Khomh completed a PhD in software engineering at Université de Montréal in 2011, for which he received an Award of Excellence. He was also awarded a CS-Can/Info-Can Outstanding Young Computer Science Researcher Prize in 2019.

His research interests include software maintenance and evolution, machine learning systems engineering, cloud engineering, and dependable and trustworthy ML/AI. His work has received four Ten-year Most Influential Paper (MIP) awards, and six Best/Distinguished Paper Awards. He has served on the steering committee of numerous organizations in software engineering, including SANER (chair), MSR, PROMISE, ICPC (chair), and ICSME (vice-chair). He initiated and co-organized Polytechnique Montréal‘s Software Engineering for Machine Learning Applications (SEMLA) symposium and the RELENG (release engineering) workshop series.

Khomh co-founded the NSERC CREATE SE4AI: A Training Program on the Development, Deployment and Servicing of Artificial Intelligence-based Software Systems, and is a principal investigator for the DEpendable Explainable Learning (DEEL) project.

He also co-founded Confiance IA, a Quebec consortium focused on building trustworthy AI, and is on the editorial board of multiple international software engineering journals, including IEEE Software, EMSE and JSEP. He is a senior member of IEEE.

Current Students

Master's Research - Polytechnique Montréal
Master's Research - Polytechnique Montréal
PhD - Polytechnique Montréal
PhD - Polytechnique Montréal
Postdoctorate - Polytechnique Montréal
Master's Research - Polytechnique Montréal
PhD - Polytechnique Montréal

Publications

Reinforcement Learning Informed Evolutionary Search for Autonomous Systems Testing
Dmytro Humeniuk
Giuliano Antoniol
Evolutionary search-based techniques are commonly used for testing autonomous robotic systems. However, these approaches often rely on compu… (see more)tationally expensive simulator-based models for test scenario evaluation. To improve the computational efficiency of the search-based testing, we propose augmenting the evolutionary search (ES) with a reinforcement learning (RL) agent trained using surrogate rewards derived from domain knowledge. In our approach, known as RIGAA (Reinforcement learning Informed Genetic Algorithm for Autonomous systems testing), we first train an RL agent to learn useful constraints of the problem and then use it to produce a certain part of the initial population of the search algorithm. By incorporating an RL agent into the search process, we aim to guide the algorithm towards promising regions of the search space from the start, enabling more efficient exploration of the solution space. We evaluate RIGAA on two case studies: maze generation for an autonomous ant robot and road topology generation for an autonomous vehicle lane keeping assist system. In both case studies, RIGAA converges faster to fitter solutions and produces a better test suite (in terms of average test scenario fitness and diversity). RIGAA also outperforms the state-of-the-art tools for vehicle lane keeping assist system testing, such as AmbieGen and Frenetic.
An empirical study of testing machine learning in the wild
Moses Openja
Armstrong Foundjem
Zhen Ming (Jack) Jiang
Zhenyou Jiang
Mouna Abidi
Ahmed E. Hassan
Background: Recently, machine and deep learning (ML/DL) algorithms have been increasingly adopted in many software systems. Due to their in… (see more)ductive nature, ensuring the quality of these systems remains a significant challenge for the research community. Traditionally, software systems were constructed deductively, by writing explicit rules that govern the behavior of the system as program code. However, ML/DL systems infer rules from training data i.e., they are generated inductively). Recent research in ML/DL quality assurance has adapted concepts from traditional software testing, such as mutation testing, to improve reliability. However, it is unclear if these proposed testing techniques are adopted in practice, or if new testing strategies have emerged from real-world ML deployments. There is little empirical evidence about the testing strategies. Aims: To fill this gap, we perform the first fine-grained empirical study on ML testing in the wild to identify the ML properties being tested, the testing strategies, and their implementation throughout the ML workflow. Method: We conducted a mixed-methods study to understand ML software testing practices. We analyzed test files and cases from 11 open-source ML/DL projects on GitHub. Using open coding, we manually examined the testing strategies, tested ML properties, and implemented testing methods to understand their practical application in building and releasing ML/DL software systems. Results: Our findings reveal several key insights: 1.) The most common testing strategies, accounting for less than 40%, are Grey-box and White-box methods, such as Negative Testing , Oracle Approximation , and Statistical Testing . 2.) A wide range of \(17\) ML properties are tested, out of which only 20% to 30% are frequently tested, including Consistency , Correctness , and Efficiency . 3.) Bias and Fairness is more tested in Recommendation (6%) and CV (3.9%) systems, while Security & Privacy is tested in CV (2%), Application Platforms (0.9%), and NLP (0.5%). 4.) We identified 13 types of testing methods, such as Unit Testing , Input Testing , and Model Testing . Conclusions: This study sheds light on the current adoption of software testing techniques and highlights gaps and limitations in existing ML testing practices.
DeepCodeProbe: Towards Understanding What Models Trained on Code Learn
Vahid Majdinasab
Amin Nikanjam
Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs
Sylvain Kouemo Ngassom
Arghavan Moradi Dakhel
Florian Tambon
Design smells in multi-language systems and bug-proneness: a survival analysis
Mouna Abidi
Md Saidur Rahman
Moses Openja
A Context-Driven Approach for Co-Auditing Smart Contracts with The Support of GPT-4 code interpreter
Mohamed Salah Bouafif
Chen Zheng
Ilham Qasse
Ed Zulkoski
Mohammad Hamdaqa
The surge in the adoption of smart contracts necessitates rigorous auditing to ensure their security and reliability. Manual auditing, altho… (see more)ugh comprehensive, is time-consuming and heavily reliant on the auditor's expertise. With the rise of Large Language Models (LLMs), there is growing interest in leveraging them to assist auditors in the auditing process (co-auditing). However, the effectiveness of LLMs in smart contract co-auditing is contingent upon the design of the input prompts, especially in terms of context description and code length. This paper introduces a novel context-driven prompting technique for smart contract co-auditing. Our approach employs three techniques for context scoping and augmentation, encompassing code scoping to chunk long code into self-contained code segments based on code inter-dependencies, assessment scoping to enhance context description based on the target assessment goal, thereby limiting the search space, and reporting scoping to force a specific format for the generated response. Through empirical evaluations on publicly available vulnerable contracts, our method demonstrated a detection rate of 96\% for vulnerable functions, outperforming the native prompting approach, which detected only 53\%. To assess the reliability of our prompting approach, manual analysis of the results was conducted by expert auditors from our partner, Quantstamp, a world-leading smart contract auditing company. The experts' analysis indicates that, in unlabeled datasets, our proposed approach enhances the proficiency of the GPT-4 code interpreter in detecting vulnerabilities.
GIST: Generated Inputs Sets Transferability in Deep Learning
Florian Tambon
Giuliano Antoniol
PathOCL: Path-Based Prompt Augmentation for OCL Generation with GPT-4
Seif Abukhalaf
Mohammad Hamdaqa
The rapid progress of AI-powered programming assistants, such as GitHub Copilot, has facilitated the development of software applications. T… (see more)hese assistants rely on large language models (LLMs), which are foundation models (FMs) that support a wide range of tasks related to understanding and generating language. LLMs have demonstrated their ability to express UML model specifications using formal languages like the Object Constraint Language (OCL). However, the context size of the prompt is limited by the number of tokens an LLM can process. This limitation becomes significant as the size of UML class models increases. In this study, we introduce PathOCL, a novel path-based prompt augmentation technique designed to facilitate OCL generation. PathOCL addresses the limitations of LLMs, specifically their token processing limit and the challenges posed by large UML class models. PathOCL is based on the concept of chunking, which selectively augments the prompts with a subset of UML classes relevant to the English specification. Our findings demonstrate that PathOCL, compared to augmenting the complete UML class model (UML-Augmentation), generates a higher number of valid and correct OCL constraints using the GPT-4 model. Moreover, the average prompt size crafted using PathOCL significantly decreases when scaling the size of the UML class models.
Characterizing and Classifying Developer Forum Posts with their Intentions
Xingfang Wu
Eric Laufer
Heng Li
Santhosh Srinivasan
Jayden Luo
With the rapid growth of the developer community, the amount of posts on online technical forums has been growing rapidly, which poses diffi… (see more)culties for users to filter useful posts and find important information. Tags provide a concise feature dimension for users to locate their interested posts and for search engines to index the most relevant posts according to the queries. However, most tags are only focused on the technical perspective (e.g., program language, platform, tool). In most cases, forum posts in online developer communities reveal the author's intentions to solve a problem, ask for advice, share information, etc. The modeling of the intentions of posts can provide an extra dimension to the current tag taxonomy. By referencing previous studies and learning from industrial perspectives, we create a refined taxonomy for the intentions of technical forum posts. Through manual labeling and analysis on a sampled post dataset extracted from online forums, we understand the relevance between the constitution of posts (code, error messages) and their intentions. Furthermore, inspired by our manual study, we design a pre-trained transformer-based model to automatically predict post intentions. The best variant of our intention prediction framework, which achieves a Micro F1-score of 0.589, Top 1-3 accuracy of 62.6% to 87.8%, and an average AUC of 0.787, outperforms the state-of-the-art baseline approach. Our characterization and automated classification of forum posts regarding their intentions may help forum maintainers or third-party tool developers improve the organization and retrieval of posts on technical forums. We have released our annotated dataset and codes in our supplementary material package.
Towards a Reliable French Speech Recognition Tool for an Automated Diagnosis of Learning Disabilities
Jihene Rezgui
Félix Jobin
Younes Kechout
Chritine Turgeon
Dyslexia, characterized by severe challenges in reading and spelling acquisition, presents a substantial barrier to proficient literacy, res… (see more)ulting in significantly reduced reading speed (2 to 3 times slower) and diminished text comprehension. With a prevalence ranging from 5G to 10% in the population, early intervention by speech and language pathologists (SLPs) can mitigate dyslexia's effects, but the diagnosis bottleneck impedes timely support. To address this, we propose leveraging machine learning tools to expedite the diagnosis process, focusing on automating phonetic transcription, a critical step in dyslexia assessment. We investigated the practicality of two model configurations utilizing Google's speech-to-text API with children speech in evaluation scenarios and compared their results against transcriptions crafted by experts. The first configuration focuses on Google API's speech-to-text while the second integrates Phonemizer, a text-to-phonemes tool based on a dictionary. Results analysis indicate that our Google-Phonemizer model yields reading accuracies comparable to those computed from human-made transcriptions, offering promise for clinical application. These findings underscore the potential of AI-driven solutions to enhance dyslexia diagnosis efficiency, paving the way for improved accessibility to vital SLP services.
Mining Action Rules for Defect Reduction Planning
Khouloud Oueslati
gabriel laberge
Maxime Lamothe
Defect reduction planning plays a vital role in enhancing software quality and minimizing software maintenance costs. By training a black bo… (see more)x machine learning model and"explaining"its predictions, explainable AI for software engineering aims to identify the code characteristics that impact maintenance risks. However, post-hoc explanations do not always faithfully reflect what the original model computes. In this paper, we introduce CounterACT, a Counterfactual ACTion rule mining approach that can generate defect reduction plans without black-box models. By leveraging action rules, CounterACT provides a course of action that can be considered as a counterfactual explanation for the class (e.g., buggy or not buggy) assigned to a piece of code. We compare the effectiveness of CounterACT with the original action rule mining algorithm and six established defect reduction approaches on 9 software projects. Our evaluation is based on (a) overlap scores between proposed code changes and actual developer modifications; (b) improvement scores in future releases; and (c) the precision, recall, and F1-score of the plans. Our results show that, compared to competing approaches, CounterACT's explainable plans achieve higher overlap scores at the release level (median 95%) and commit level (median 85.97%), and they offer better trade-off between precision and recall (median F1-score 88.12%). Finally, we venture beyond planning and explore leveraging Large Language models (LLM) for generating code edits from our generated plans. Our results show that suggested LLM code edits supported by our plans are actionable and are more likely to pass relevant test cases than vanilla LLM code recommendations.
Generative AI in Software Engineering Must Be Human-Centered: The Copenhagen Manifesto
Daniel Russo
Sebastian Baltes
Niels van Berkel
Paris Avgeriou
Fabio Calefato
Beatriz Cabrero-Daniel
Gemma Catolino
Jürgen Cito
Neil Ernst
Thomas Fritz
Hideaki Hata
Reid Holmes
Maliheh Izadi
Mikkel Baun Kjærgaard
Grischa Liebel
Alberto Lluch Lafuente
Stefano Lambiase
Walid Maalej
Gail Murphy … (see 15 more)
Nils Brede Moe
Gabrielle O'Brien
Elda Paja
Mauro Pezzè
John Stouby Persson
Rafael Prikladnicki
Paul Ralph
Martin P. Robillard
Thiago Rocha Silva
Klaas-Jan Stol
Margaret-Anne Storey
Viktoria Stray
Paolo Tell
Christoph Treude
Bogdan Vasilescu