Arghavan Moradi Dakhel

Refactoring with LLMs: Bridging Human Expertise and Machine Understanding

Yonnel Chen Kuang Piao

Jean Carlors Paul

Leuson Da Silva

Mohammad Hamdaqa

2025-10-04

ArXiv (preprint)

Bugs in Large Language Models Generated Code: An Empirical Study

Florian Tambon

Amin Nikanjam

Michel C. Desmarais

Giuliano Antoniol

2025-02-13

Empirical Software Engineering (published)

Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs

Sylvain Kouemo Ngassom

Florian Tambon

2024-07-10

Proceedings of the 1st ACM International Conference on AI-Powered Software (published)

Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs

Sylvain Kouemo Ngassom

Florian Tambon

LLM-based assistants, such as GitHub Copilot and ChatGPT, have the potential to generate code that fulfills a programming task described in … (see more)a natural language description, referred to as a prompt. The widespread accessibility of these assistants enables users with diverse backgrounds to generate code and integrate it into software projects. However, studies show that code generated by LLMs is prone to bugs and may miss various corner cases in task specifications. Presenting such buggy code to users can impact their reliability and trust in LLM-based assistants. Moreover, significant efforts are required by the user to detect and repair any bug present in the code, especially if no test cases are available. In this study, we propose a self-refinement method aimed at improving the reliability of code generated by LLMs by minimizing the number of bugs before execution, without human intervention, and in the absence of test cases. Our approach is based on targeted Verification Questions (VQs) to identify potential bugs within the initial code. These VQs target various nodes within the Abstract Syntax Tree (AST) of the initial code, which have the potential to trigger specific types of bug patterns commonly found in LLM-generated code. Finally, our method attempts to repair these potential bugs by re-prompting the LLM with the targeted VQs and the initial code. Our evaluation, based on programming tasks in the CoderEval dataset, demonstrates that our proposed method outperforms state-of-the-art methods by decreasing the number of targeted errors in the code between 21% to 62% and improving the number of executable code instances to 13%.

2024-05-22

ArXiv (preprint)

Bugs in Large Language Models Generated Code: An Empirical Study

Florian Tambon

Amin Nikanjam

Michel C. Desmarais

Giuliano Antoniol

Large Language Models (LLMs) for code have gained significant attention recently. They can generate code in different programming languages … (see more)based on provided prompts, fulfilling a long-lasting dream in Software Engineering (SE), i.e., automatic code generation. Similar to human-written code, LLM-generated code is prone to bugs, and these bugs have not yet been thoroughly examined by the community. Given the increasing adoption of LLM-based code generation tools (e.g., GitHub Copilot) in SE activities, it is critical to understand the characteristics of bugs contained in code generated by LLMs. This paper examines a sample of 333 bugs collected from code generated using three leading LLMs (i.e., CodeGen, PanGu-Coder, and Codex) and identifies the following 10 distinctive bug patterns: Misinterpretations, Syntax Error, Silly Mistake, Prompt-biased code, Missing Corner Case, Wrong Input Type, Hallucinated Object, Wrong Attribute, Incomplete Generation, and Non-Prompted Consideration. The bug patterns are presented in the form of a taxonomy. The identified bug patterns are validated using an online survey with 34 LLM practitioners and researchers. The surveyed participants generally asserted the significance and prevalence of the bug patterns. Researchers and practitioners can leverage these findings to develop effective quality assurance techniques for LLM-generated code. This study sheds light on the distinctive characteristics of LLM-generated code.

2024-03-13

ArXiv (preprint)

Assessing the Security of GitHub Copilot Generated Code - A Targeted Replication Study

Vahid Majdinasab

Michael Joshua Bishop

Shawn Rasheed

Amjed Tahir

2024-03-12

2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (published)

Assessing the Security of GitHub Copilot's Generated Code - A Targeted Replication Study

Vahid Majdinasab

Michael Joshua Bishop

Shawn Rasheed

Amjed Tahir

2023-11-18

ArXiv (preprint)

Assessing the Security of GitHub Copilot's Generated Code - A Targeted Replication Study

Vahid Majdinasab

Michael Joshua Bishop

Shawn Rasheed

Amjed Tahir

2023-11-18

ArXiv (preprint)

Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing

Amin Nikanjam

Vahid Majdinasab

Michel C. Desmarais

One of the critical phases in software development is software testing. Testing helps with identifying potential bugs and reducing maintenan… (see more)ce costs. The goal of automated test generation tools is to ease the development of tests by suggesting efficient bug-revealing tests. Recently, researchers have leveraged Large Language Models (LLMs) of code to generate unit tests. While the code coverage of generated tests was usually assessed, the literature has acknowledged that the coverage is weakly correlated with the efficiency of tests in bug detection. To improve over this limitation, in this paper, we introduce MuTAP for improving the effectiveness of test cases generated by LLMs in terms of revealing bugs by leveraging mutation testing. Our goal is achieved by augmenting prompts with surviving mutants, as those mutants highlight the limitations of test cases in detecting bugs. MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs). We employ different LLMs within MuTAP and evaluate their performance on different benchmarks. Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets. Among these, 17% remained undetected by both the current state-of-the-art fully automated test generation tool (i.e., Pynguin) and zero-shot/few-shot learning approaches on LLMs. Furthermore, MuTAP achieves a Mutation Score (MS) of 93.57% on synthetic buggy code, outperforming all other approaches in our evaluation. Our findings suggest that although LLMs can serve as a useful tool to generate test cases, they require specific post-processing steps to enhance the effectiveness of the generated test cases which may suffer from syntactic or functional errors and may be ineffective in detecting certain types of bugs and testing corner cases PUTs.

2023-08-31

ArXiv (preprint)

Dev2vec: Representing Domain Expertise of Developers in an Embedding Space

Michel C. Desmarais

2023-07-01

Information and Software Technology (published)

GitHub Copilot AI pair programmer: Asset or Liability?

Vahid Majdinasab

Amin Nikanjam

Michel C. Desmarais

Z. Jiang

Automatic program synthesis is a long-lasting dream in software engineering. Recently, a promising Deep Learning (DL) based solution, called… (see more) Copilot, has been proposed by OpenAI and Microsoft as an industrial product. Although some studies evaluate the correctness of Copilot solutions and report its issues, more empirical evaluations are necessary to understand how developers can benefit from it effectively. In this paper, we study the capabilities of Copilot in two different programming tasks: (i) generating (and reproducing) correct and efficient solutions for fundamental algorithmic problems, and (ii) comparing Copilot's proposed solutions with those of human programmers on a set of programming tasks. For the former, we assess the performance and functionality of Copilot in solving selected fundamental problems in computer science, like sorting and implementing data structures. In the latter, a dataset of programming problems with human-provided solutions is used. The results show that Copilot is capable of providing solutions for almost all fundamental algorithmic problems, however, some solutions are buggy and non-reproducible. Moreover, Copilot has some difficulties in combining multiple methods to generate a solution. Comparing Copilot to humans, our results show that the correct ratio of humans' solutions is greater than Copilot's suggestions, while the buggy solutions generated by Copilot require less effort to be repaired.

2023-01-01

J. Syst. Softw. (published)