Portrait of Jin Guo

Jin Guo

Associate Academic Member
Assistant Professor, McGill University, School of Computer Science
Research Topics
Human-AI interaction
Human-Centered AI
Human-Computer Interaction (HCI)
Privacy
Responsible AI

Biography

Jin L.C. Guo is an assistant professor at the School of Computer Science, McGill University.

She is interested in using AI techniques to solve software engineering problems. Her recent research focuses on mining domain knowledge from software traceability data and using such knowledge to facilitate automated SE tasks, such as trace retrieval and project Q&A.

Guo completed her PhD at the University of Notre Dame. Prior to that, she worked on image processing and computer vision in Fuji Xerox’s research lab.

Current Students

PhD - McGill University
Master's Research - McGill University
PhD - McGill University
Principal supervisor :
PhD - McGill University
Research Intern - McGill University
Principal supervisor :
PhD - McGill University
Co-supervisor :
Postdoctorate - McGill University
Principal supervisor :
Master's Research - McGill University
Master's Research - McGill University

Publications

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Dung Nguyen Manh
Nam Le Hai
Anh T. V. Dau
Anh Minh Nguyen
Khanh Nghiem
Nghi D. Q. Bui
We present The Vault, an open-source dataset of high quality code-text pairs in multiple programming languages for training large language m… (see more)odels to understand and generate code. We propose methods for thoroughly extracting samples that use both rules and deep learning to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. We thoroughly evaluated this dataset and discovered that when used to train common code language models (such as CodeT5, CodeBERT, and CodeGen), it outperforms the same models train on other datasets such as CodeSearchNet. These evaluations included common coding tasks such as code generation, code summarization, and code search. The Vault can be used by researchers and practitioners to train a wide range of big language models that understand code. Alternatively, researchers can use our data cleaning methods and scripts to improve their own datasets. We anticipate that using The Vault to train large language models will improve their ability to understand and generate code, propelling AI research and software development forward. We are releasing our source code and a framework to make it easier for others to replicate our results.