Portrait of Foutse Khomh

Foutse Khomh

Associate Academic Member
Canada CIFAR AI Chair
Professor, Polytechnique Montréal, Department of Computer Engineering and Software Engineering
Research Topics
Data Mining
Deep Learning
Distributed Systems
Generative Models
Learning to Program
Natural Language Processing
Reinforcement Learning

Biography

Foutse Khomh is a full professor of software engineering at Polytechnique Montréal, a Canada CIFAR AI Chair – Trustworthy Machine Learning Software Systems, and an FRQ-IVADO Research Chair in Software Quality Assurance for Machine Learning Applications. Khomh completed a PhD in software engineering at Université de Montréal in 2011, for which he received an Award of Excellence. He was also awarded a CS-Can/Info-Can Outstanding Young Computer Science Researcher Prize in 2019.

His research interests include software maintenance and evolution, machine learning systems engineering, cloud engineering, and dependable and trustworthy ML/AI. His work has received four Ten-year Most Influential Paper (MIP) awards, and six Best/Distinguished Paper Awards. He has served on the steering committee of numerous organizations in software engineering, including SANER (chair), MSR, PROMISE, ICPC (chair), and ICSME (vice-chair). He initiated and co-organized Polytechnique Montréal‘s Software Engineering for Machine Learning Applications (SEMLA) symposium and the RELENG (release engineering) workshop series.

Khomh co-founded the NSERC CREATE SE4AI: A Training Program on the Development, Deployment and Servicing of Artificial Intelligence-based Software Systems, and is a principal investigator for the DEpendable Explainable Learning (DEEL) project.

He also co-founded Confiance IA, a Quebec consortium focused on building trustworthy AI, and is on the editorial board of multiple international software engineering journals, including IEEE Software, EMSE and JSEP. He is a senior member of IEEE.

Current Students

Master's Research - Polytechnique Montréal
PhD - Polytechnique Montréal
PhD - Polytechnique Montréal
Master's Research - Polytechnique Montréal
Postdoctorate - Polytechnique Montréal
Co-supervisor :
Postdoctorate - Polytechnique Montréal
Master's Research - Polytechnique Montréal
PhD - Polytechnique Montréal
Master's Research - Polytechnique Montréal

Publications

Introducing v0.5 of the AI Safety Benchmark from MLCommons
Bertie Vidgen
Adarsh Agrawal
Ahmed M. Ahmed
Victor Akinwande
Namir Al-nuaimi
Najla Alfaraj
Elie Alhajjar
Lora Aroyo
Trupti Bavalatti
Borhane Blili-Hamelin
K. Bollacker
Rishi Bomassani
Marisa Ferrara Boston
Sim'eon Campos
Kal Chakra
Canyu Chen
Cody Coleman
Zacharie Delpierre Coudert
Leon Strømberg Derczynski
Debojyoti Dutta … (see 77 more)
Ian Eisenberg
James R. Ezick
Heather Frase
Brian Fuller
Ram Gandikota
Agasthya Gangavarapu
Ananya Gangavarapu
James Gealy
Rajat Ghosh
James Goel
Usman Gohar
Sujata Goswami
Scott A. Hale
Wiebke Hutiri
Joseph Marvin Imperial
Surgan Jandial
Nicholas C. Judd
Felix Juefei-Xu
Bhavya Kailkhura
Hannah Rose Kirk
Kevin Klyman
Chris Knotz
Michael Kuchnik
Shachi H. Kumar
Chris Lengerich
Bo Li
Zeyi Liao
Eileen Peters Long
Victor Lu
Yifan Mai
Priyanka Mary Mammen
Kelvin Manyeki
Sean McGregor
Virendra Mehta
Shafee Mohammed
Emanuel Moss
Lama Nachman
Dinesh Jinenhally Naganna
Amin Nikanjam
Besmira Nushi
Luis Oala
Iftach Orr
Alicia Parrish
Çigdem Patlak
William Pietri
Forough Poursabzi-Sangdeh
Eleonora Presani
Fabrizio Puletti
Paul Rottger
Saurav Sahay
Tim Santos
Nino Scherrer
Alice Schoenauer Sebag
Patrick Schramowski
Abolfazl Shahbazi
Vin Sharma
Xudong Shen
Vamsi Sistla
Leonard Tang
Davide Testuggine
Vithursan Thangarasa
Elizabeth A Watkins
Rebecca Weiss
Christoper A. Welty
Tyler Wilbers
Adina Williams
Carole-Jean Wu
Poonam Yadav
Xianjun Yang
Yi Zeng
Wenhui Zhang
Fedor Zhdanov
Jiacheng Zhu
Percy Liang
Peter Mattson
Joaquin Vanschoren
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchm… (see more)ark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
Introducing v0.5 of the AI Safety Benchmark from MLCommons
Bertie Vidgen
Adarsh Agrawal
Ahmed M. Ahmed
Victor Akinwande
Namir Al-nuaimi
Najla Alfaraj
Elie Alhajjar
Lora Aroyo
Trupti Bavalatti
Borhane Blili-Hamelin
K. Bollacker
Rishi Bomassani
Marisa Ferrara Boston
Sim'eon Campos
Kal Chakra
Canyu Chen
Cody Coleman
Zacharie Delpierre Coudert
Leon Strømberg Derczynski
Debojyoti Dutta … (see 77 more)
Ian Eisenberg
James R. Ezick
Heather Frase
Brian Fuller
Ram Gandikota
Agasthya Gangavarapu
Ananya Gangavarapu
James Gealy
Rajat Ghosh
James Goel
Usman Gohar
Sujata Goswami
Scott A. Hale
Wiebke Hutiri
Joseph Marvin Imperial
Surgan Jandial
Nicholas C. Judd
Felix Juefei-Xu
Bhavya Kailkhura
Hannah Rose Kirk
Kevin Klyman
Chris Knotz
Michael Kuchnik
Shachi H. Kumar
Chris Lengerich
Bo Li
Zeyi Liao
Eileen Peters Long
Victor Lu
Yifan Mai
Priyanka Mary Mammen
Kelvin Manyeki
Sean McGregor
Virendra Mehta
Shafee Mohammed
Emanuel Moss
Lama Nachman
Dinesh Jinenhally Naganna
Amin Nikanjam
Besmira Nushi
Luis Oala
Iftach Orr
Alicia Parrish
Çigdem Patlak
William Pietri
Forough Poursabzi-Sangdeh
Eleonora Presani
Fabrizio Puletti
Paul Rottger
Saurav Sahay
Tim Santos
Nino Scherrer
Alice Schoenauer Sebag
Patrick Schramowski
Abolfazl Shahbazi
Vin Sharma
Xudong Shen
Vamsi Sistla
Leonard Tang
Davide Testuggine
Vithursan Thangarasa
Elizabeth A Watkins
Rebecca Weiss
Christoper A. Welty
Tyler Wilbers
Adina Williams
Carole-Jean Wu
Poonam Yadav
Xianjun Yang
Yi Zeng
Wenhui Zhang
Fedor Zhdanov
Jiacheng Zhu
Percy Liang
Peter Mattson
Joaquin Vanschoren
Introducing v0.5 of the AI Safety Benchmark from MLCommons
Bertie Vidgen
Adarsh Agrawal
Ahmed M. Ahmed
Victor Akinwande
Namir Al-nuaimi
Najla Alfaraj
Elie Alhajjar
Lora Aroyo
Trupti Bavalatti
Borhane Blili-Hamelin
K. Bollacker
Rishi Bomassani
Marisa Ferrara Boston
Sim'eon Campos
Kal Chakra
Canyu Chen
Cody Coleman
Zacharie Delpierre Coudert
Leon Strømberg Derczynski
Debojyoti Dutta … (see 77 more)
Ian Eisenberg
James R. Ezick
Heather Frase
Brian Fuller
Ram Gandikota
Agasthya Gangavarapu
Ananya Gangavarapu
James Gealy
Rajat Ghosh
James Goel
Usman Gohar
Sujata Goswami
Scott A. Hale
Wiebke Hutiri
Joseph Marvin Imperial
Surgan Jandial
Nicholas C. Judd
Felix Juefei-Xu
Bhavya Kailkhura
Hannah Rose Kirk
Kevin Klyman
Chris Knotz
Michael Kuchnik
Shachi H. Kumar
Chris Lengerich
Bo Li
Zeyi Liao
Eileen Peters Long
Victor Lu
Yifan Mai
Priyanka Mary Mammen
Kelvin Manyeki
Sean McGregor
Virendra Mehta
Shafee Mohammed
Emanuel Moss
Lama Nachman
Dinesh Jinenhally Naganna
Amin Nikanjam
Besmira Nushi
Luis Oala
Iftach Orr
Alicia Parrish
Çigdem Patlak
William Pietri
Forough Poursabzi-Sangdeh
Eleonora Presani
Fabrizio Puletti
Paul Rottger
Saurav Sahay
Tim Santos
Nino Scherrer
Alice Schoenauer Sebag
Patrick Schramowski
Abolfazl Shahbazi
Vin Sharma
Xudong Shen
Vamsi Sistla
Leonard Tang
Davide Testuggine
Vithursan Thangarasa
Elizabeth A Watkins
Rebecca Weiss
Christoper A. Welty
Tyler Wilbers
Adina Williams
Carole-Jean Wu
Poonam Yadav
Xianjun Yang
Yi Zeng
Wenhui Zhang
Fedor Zhdanov
Jiacheng Zhu
Percy Liang
Peter Mattson
Joaquin Vanschoren
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchm… (see more)ark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
Tackling the XAI Disagreement Problem with Regional Explanations
gabriel laberge
Yann Batiste Pequignot
Mario Marchand
PathOCl: Path-Based Prompt Augmentation for OCL Generation with GPT-4
Seif Abukhalaf
Mohammad Hamdaqa
The rapid progress of AI-powered programming assistants, such as GitHub Copilot, has facilitated the development of software applications. T… (see more)hese assistants rely on large language models (LLMs), which are foundation models (FMs) that support a wide range of tasks related to understanding and generating language. LLMs have demonstrated their ability to express UML model specifications using formal languages like the Object Constraint Language (OCL). However, the context size of the prompt is limited by the number of tokens an LLM can process. This limitation becomes significant as the size of UML class models increases. In this study, we intro-duce PathOCL, a novel path-based prompt augmentation technique designed to facilitate OCL generation. PathOCL addresses the limi-tations of LLMs, specifically their token processing limit and the challenges posed by large UML class models. PathOCL is based on the concept of chunking, which selectively augments the prompts with a subset of UML classes relevant to the English specification. Our findings demonstrate that PathOCL, compared to augmenting the complete UML class model (UML-Augmentation), generates a higher number of valid and correct OCL constraints using the GPT-4 model. Moreover, the average prompt size crafted using PathOCL significantly decreases when scaling the size of the UML class models.
Machine Learning Robustness: A Primer
Houssem Ben Braiek
This chapter explores the foundational concept of robustness in Machine Learning (ML) and its integral role in establishing trustworthiness … (see more)in Artificial Intelligence (AI) systems. The discussion begins with a detailed definition of robustness, portraying it as the ability of ML models to maintain stable performance across varied and unexpected environmental conditions. ML robustness is dissected through several lenses: its complementarity with generalizability; its status as a requirement for trustworthy AI; its adversarial vs non-adversarial aspects; its quantitative metrics; and its indicators such as reproducibility and explainability. The chapter delves into the factors that impede robustness, such as data bias, model complexity, and the pitfalls of underspecified ML pipelines. It surveys key techniques for robustness assessment from a broad perspective, including adversarial attacks, encompassing both digital and physical realms. It covers non-adversarial data shifts and nuances of Deep Learning (DL) software testing methodologies. The discussion progresses to explore amelioration strategies for bolstering robustness, starting with data-centric approaches like debiasing and augmentation. Further examination includes a variety of model-centric methods such as transfer learning, adversarial training, and randomized smoothing. Lastly, post-training methods are discussed, including ensemble techniques, pruning, and model repairs, emerging as cost-effective strategies to make models more resilient against the unpredictable. This chapter underscores the ongoing challenges and limitations in estimating and achieving ML robustness by existing approaches. It offers insights and directions for future research on this crucial concept, as a prerequisite for trustworthy AI systems.
Machine Learning Robustness: A Primer
Houssem Ben Braiek
This chapter explores the foundational concept of robustness in Machine Learning (ML) and its integral role in establishing trustworthiness … (see more)in Artificial Intelligence (AI) systems. The discussion begins with a detailed definition of robustness, portraying it as the ability of ML models to maintain stable performance across varied and unexpected environmental conditions. ML robustness is dissected through several lenses: its complementarity with generalizability; its status as a requirement for trustworthy AI; its adversarial vs non-adversarial aspects; its quantitative metrics; and its indicators such as reproducibility and explainability. The chapter delves into the factors that impede robustness, such as data bias, model complexity, and the pitfalls of underspecified ML pipelines. It surveys key techniques for robustness assessment from a broad perspective, including adversarial attacks, encompassing both digital and physical realms. It covers non-adversarial data shifts and nuances of Deep Learning (DL) software testing methodologies. The discussion progresses to explore amelioration strategies for bolstering robustness, starting with data-centric approaches like debiasing and augmentation. Further examination includes a variety of model-centric methods such as transfer learning, adversarial training, and randomized smoothing. Lastly, post-training methods are discussed, including ensemble techniques, pruning, and model repairs, emerging as cost-effective strategies to make models more resilient against the unpredictable. This chapter underscores the ongoing challenges and limitations in estimating and achieving ML robustness by existing approaches. It offers insights and directions for future research on this crucial concept, as a prerequisite for trustworthy AI systems.
Bugs in Large Language Models Generated Code: An Empirical Study
Florian Tambon
Arghavan Moradi Dakhel
Amin Nikanjam
Michel C. Desmarais
Giuliano Antoniol
Large Language Models (LLMs) for code have gained significant attention recently. They can generate code in different programming languages … (see more)based on provided prompts, fulfilling a long-lasting dream in Software Engineering (SE), i.e., automatic code generation. Similar to human-written code, LLM-generated code is prone to bugs, and these bugs have not yet been thoroughly examined by the community. Given the increasing adoption of LLM-based code generation tools (e.g., GitHub Copilot) in SE activities, it is critical to understand the characteristics of bugs contained in code generated by LLMs. This paper examines a sample of 333 bugs collected from code generated using three leading LLMs (i.e., CodeGen, PanGu-Coder, and Codex) and identifies the following 10 distinctive bug patterns: Misinterpretations, Syntax Error, Silly Mistake, Prompt-biased code, Missing Corner Case, Wrong Input Type, Hallucinated Object, Wrong Attribute, Incomplete Generation, and Non-Prompted Consideration. The bug patterns are presented in the form of a taxonomy. The identified bug patterns are validated using an online survey with 34 LLM practitioners and researchers. The surveyed participants generally asserted the significance and prevalence of the bug patterns. Researchers and practitioners can leverage these findings to develop effective quality assurance techniques for LLM-generated code. This study sheds light on the distinctive characteristics of LLM-generated code.
Assessing the Security of GitHub Copilot Generated Code - A Targeted Replication Study
Vahid Majdinasab
Michael Joshua Bishop
Shawn Rasheed
Arghavan Moradi Dakhel
Amjed Tahir
Deep Learning Model Reuse in the HuggingFace Community: Challenges, Benefit and Trends
Mina Taraghi
Gianolli Dorcelus
Armstrong Foundjem
Florian Tambon
The ubiquity of large-scale Pre-Trained Models (PTMs) is on the rise, sparking interest in model hubs, and dedicated platforms for hosting P… (see more)TMs. Despite this trend, a comprehensive exploration of the challenges that users encounter and how the community leverages PTMs remains lacking. To address this gap, we conducted an extensive mixed-methods empirical study by focusing on discussion forums and the model hub of HuggingFace, the largest public model hub. Based on our qualitative analysis, we present a taxonomy of the challenges and benefits associated with PTM reuse within this community. We then conduct a quantitative study to track model-type trends and model documentation evolution over time. Our findings highlight prevalent challenges such as limited guidance for beginner users, struggles with model output comprehensibility in training or inference, and a lack of model understanding. We also identified interesting trends among models where some models maintain high upload rates despite a decline in topics related to them. Additionally, we found that despite the introduction of model documentation tools, its quantity has not increased over time, leading to difficulties in model comprehension and selection among users. Our study sheds light on new challenges in reusing PTMs that were not reported before and we provide recommendations for various stakeholders involved in PTM reuse.
Refining GPT-3 Embeddings with a Siamese Structure for Technical Post Duplicate Detection
Xingfang Wu
Heng Li
Nobukazu Yoshioka
Hironori Washizaki
Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code
Vahid Majdinasab
Amin Nikanjam
Code auditing ensures that the developed code adheres to standards, regulations, and copyright protection by verifying that it does not cont… (see more)ain code from protected sources. The recent advent of Large Language Models (LLMs) as coding assistants in the software development process poses new challenges for code auditing. The dataset for training these models is mainly collected from publicly available sources. This raises the issue of intellectual property infringement as developers’ codes are already included in the dataset. Therefore, auditing code developed using LLMs is challenging, as it is difficult to reliably assert if an LLM used during development has been trained on specific copyrighted codes, given that we do not have access to the training datasets of these models. Given the non-disclosure of the training datasets, traditional approaches such as code clone detection are insufficient for asserting copyright infringement. To address this challenge, we propose a new approach, TraWiC; a model-agnostic and interpretable method based on membership inference for detecting code inclusion in an LLM’s training dataset. We extract syntactic and semantic identifiers unique to each program to train a classifier for detecting code inclusion. In our experiments, we observe that TraWiC is capable of detecting 83.87% of codes that were used to train an LLM. In comparison, the prevalent clone detection tool NiCad is only capable of detecting 47.64%. In addition to its remarkable performance, TraWiC has low resource overhead in contrast to pair-wise clone detection that is conducted during the auditing process of tools like CodeWhisperer reference tracker, across thousands of code snippets.