Jin Guo

Membre académique associé

Professeur adjoint, McGill University, École d'informatique

Sujets de recherche

IA centrée sur l'humain

IA responsable

Interaction humain-IA

Interaction humain-machine (IHM)

Vie privée

Site web

Biographie

Jin L.C. Guo a obtenu son doctorat à l'Université de Notre Dame. Elle s'intéresse à l'utilisation des techniques d'intelligence artificielle pour résoudre des problèmes de génie logiciel. Ses recherches récentes portent sur la connaissance du domaine minier à partir des données de traçabilité logicielle et sur l'utilisation de ces connaissances pour faciliter les tâches automatisées de génie logiciel telles que la recherche de traces et les questions et réponses sur les projets. Avant son doctorat, elle a travaillé au laboratoire de recherche de Fuji Xerox dans les domaines du traitement de l'image et de la vision par ordinateur.

Étudiants actuels

Doctorat - McGill

Maîtrise recherche - McGill

Lyu Fuyuan

Doctorat - McGill

Superviseur⋅e principal⋅e :

Doctorat - McGill

Fanny Lasne

Stagiaire de recherche - McGill

Superviseur⋅e principal⋅e :

Jackie Cheung

Github

Tamara Paris Paris

Doctorat - McGill

Co-superviseur⋅e :

AJung Moon

Shalaleh Rismani

Postdoctorat - McGill

Superviseur⋅e principal⋅e :

Maîtrise recherche - McGill

Site web

Github

Veronica Xia

Maîtrise recherche - McGill

Publications

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Dung Nguyen Manh

Nam Le Hai

Anh T. V. Dau

Anh Minh Nguyen

Khanh Nghiem

Jin Guo

Nghi D. Q. Bui

We present The Vault, an open-source dataset of high quality code-text pairs in multiple programming languages for training large language m… (voir plus)odels to understand and generate code. We propose methods for thoroughly extracting samples that use both rules and deep learning to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. We thoroughly evaluated this dataset and discovered that when used to train common code language models (such as CodeT5, CodeBERT, and CodeGen), it outperforms the same models train on other datasets such as CodeSearchNet. These evaluations included common coding tasks such as code generation, code summarization, and code search. The Vault can be used by researchers and practitioners to train a wide range of big language models that understand code. Alternatively, researchers can use our data cleaning methods and scripts to improve their own datasets. We anticipate that using The Vault to train large language models will improve their ability to understand and generate code, propelling AI research and software development forward. We are releasing our source code and a framework to make it easier for others to replicate our results.

2022-12-31

Findings of the Association for Computational Linguistics: EMNLP 2023 (publié)

doi.org

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Jin Guo

Biographie

Étudiants actuels

Publications

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Mots-clés populaires:

Jin Guo

Biographie

Étudiants actuels

Publications