Portrait de Jackie Cheung

Jackie Cheung

Membre académique principal
Chaire en IA Canada-CIFAR
Directeur scientifique adjoint, Mila, Professeur agrégé, McGill University, École d'informatique
Chercheur consultant, Microsoft Research
Sujets de recherche
Apprentissage automatique médical
Apprentissage profond
Raisonnement
Traitement du langage naturel

Biographie

Je suis professeur agrégé à l'École d’informatique de l’Université McGill et chercheur consultant à Microsoft Research.

Mon groupe mène des recherches sur le traitement du langage naturel (NLP), un domaine de l'intelligence artificielle qui implique la construction de modèles informatiques de langages humains tels que l'anglais ou le français. Le but de nos recherches est de développer des méthodes informatiques de compréhension du texte et de la parole, afin de générer un langage fluide et adapté au contexte.

Dans notre laboratoire, nous étudions des techniques statistiques d’apprentissage automatique pour analyser et faire des prédictions sur la langue. Plusieurs projets en cours incluent la synthèse de fiction, l'extraction d'événements à partir d’un texte et l'adaptation de la langue à différents genres.

Étudiants actuels

Doctorat - McGill
Co-superviseur⋅e :
Postdoctorat - McGill
Stagiaire de recherche - McGill
Doctorat - McGill
Doctorat - McGill
Superviseur⋅e principal⋅e :
Maîtrise recherche - McGill
Doctorat - McGill
Stagiaire de recherche - McGill
Doctorat - McGill
Co-superviseur⋅e :
Maîtrise recherche - McGill
Maîtrise recherche - McGill
Co-superviseur⋅e :
Postdoctorat - McGill
Maîtrise recherche - McGill
Maîtrise recherche - McGill
Stagiaire de recherche - McGill University
Stagiaire de recherche - McGill
Doctorat - McGill
Superviseur⋅e principal⋅e :
Maîtrise recherche - McGill
Doctorat - McGill
Maîtrise recherche - McGill
Doctorat - McGill
Baccalauréat - McGill
Doctorat - McGill
Stagiaire de recherche - McGill University
Stagiaire de recherche - McGill

Publications

Investigating Failures to Generalize for Coreference Resolution Models
Ian Porada
Kaheer Suleman
Adam Trischler
Coreference resolution models are often evaluated on multiple datasets. Datasets vary, however, in how coreference is realized -- i.e., how … (voir plus)the theoretical concept of coreference is operationalized in the dataset -- due to factors such as the choice of corpora and annotation guidelines. We investigate the extent to which errors of current coreference resolution models are associated with existing differences in operationalization across datasets (OntoNotes, PreCo, and Winogrande). Specifically, we distinguish between and break down model performance into categories corresponding to several types of coreference, including coreferring generic mentions, compound modifiers, and copula predicates, among others. This break down helps us investigate how state-of-the-art models might vary in their ability to generalize across different coreference types. In our experiments, for example, models trained on OntoNotes perform poorly on generic mentions and copula predicates in PreCo. Our findings help calibrate expectations of current coreference resolution models; and, future work can explicitly account for those types of coreference that are empirically associated with poor generalization when developing models.
GLIMPSE: Pragmatically Informative Multi-Document Summarization for Scholarly Reviews
Maxime Darrin
Ines Arous
Scientific peer review is essential for the quality of academic publications. However, the increasing number of paper submissions to confere… (voir plus)nces has strained the reviewing process. This surge poses a burden on area chairs who have to carefully read an ever-growing volume of reviews and discern each reviewer's main arguments as part of their decision process. In this paper, we introduce \sys, a summarization method designed to offer a concise yet comprehensive overview of scholarly reviews. Unlike traditional consensus-based methods, \sys extracts both common and unique opinions from the reviews. We introduce novel uniqueness scores based on the Rational Speech Act framework to identify relevant sentences in the reviews. Our method aims to provide a pragmatic glimpse into all reviews, offering a balanced perspective on their opinions. Our experimental results with both automatic metrics and human evaluation show that \sys generates more discriminative summaries than baseline methods in terms of human evaluation while achieving comparable performance with these methods in terms of automatic metrics.
When is an Embedding Model More Promising than Another?
Maxime Darrin
Philippe Formont
Ismail Ben Ayed
Embedders play a central role in machine learning, projecting any object into numerical representations that can, in turn, be leveraged to p… (voir plus)erform various downstream tasks. The evaluation of embedding models typically depends on domain-specific empirical approaches utilizing downstream tasks, primarily because of the lack of a standardized framework for comparison. However, acquiring adequately large and representative datasets for conducting these assessments is not always viable and can prove to be prohibitively expensive and time-consuming. In this paper, we present a unified approach to evaluate embedders. First, we establish theoretical foundations for comparing embedding models, drawing upon the concepts of sufficiency and informativeness. We then leverage these concepts to devise a tractable comparison criterion (information sufficiency), leading to a task-agnostic and self-supervised ranking procedure. We demonstrate experimentally that our approach aligns closely with the capability of embedding models to facilitate various downstream tasks in both natural language processing and molecular biology. This effectively offers practitioners a valuable tool for prioritizing model trials.
Performance of generative pre-trained transformers (GPTs) in Certification Examination of the College of Family Physicians of Canada
Mehdi Mousavi
Shabnam Shafiee
Jason M Harley
Introduction The application of large language models such as generative pre-trained transformers (GPTs) has been promising in medical educa… (voir plus)tion, and its performance has been tested for different medical exams. This study aims to assess the performance of GPTs in responding to a set of sample questions of short-answer management problems (SAMPs) from the certification exam of the College of Family Physicians of Canada (CFPC). Method Between August 8th and 25th, 2023, we used GPT-3.5 and GPT-4 in five rounds to answer a sample of 77 SAMPs questions from the CFPC website. Two independent certified family physician reviewers scored AI-generated responses twice: first, according to the CFPC answer key (ie, CFPC score), and second, based on their knowledge and other references (ie, Reviews’ score). An ordinal logistic generalised estimating equations (GEE) model was applied to analyse repeated measures across the five rounds. Result According to the CFPC answer key, 607 (73.6%) lines of answers by GPT-3.5 and 691 (81%) by GPT-4 were deemed accurate. Reviewer’s scoring suggested that about 84% of the lines of answers provided by GPT-3.5 and 93% of GPT-4 were correct. The GEE analysis confirmed that over five rounds, the likelihood of achieving a higher CFPC Score Percentage for GPT-4 was 2.31 times more than GPT-3.5 (OR: 2.31; 95% CI: 1.53 to 3.47; p0.001). Similarly, the Reviewers’ Score percentage for responses provided by GPT-4 over 5 rounds were 2.23 times more likely to exceed th
Ensemble Distillation for Unsupervised Constituency Parsing
Behzad Shayegh
Yanshuai Cao
Xiaodan Zhu
Lili Mou
ECBD: Evidence-Centered Benchmark Design for NLP
Yu Lu Liu
Su Lin Blodgett
Jackie Chi
Q. Vera Liao
Kit Cheung
Ziang Xiao
Benchmarking is seen as critical to assessing progress in NLP. However, creating a benchmark involves many design decisions (e.g., which dat… (voir plus)asets to include, which metrics to use) that often rely on tacit, untested assumptions about what the benchmark is intended to measure or is actually measuring. There is currently no principled way of analyzing these decisions and how they impact the validity of the benchmark's measurements. To address this gap, we draw on evidence-centered design in educational assessments and propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules. ECBD specifies the role each module plays in helping practitioners collect evidence about capabilities of interest. Specifically, each module requires benchmark designers to describe, justify, and support benchmark design choices -- e.g., clearly specifying the capabilities the benchmark aims to measure or how evidence about those capabilities is collected from model responses. To demonstrate the use of ECBD, we conduct case studies with three benchmarks: BoolQ, SuperGLUE, and HELM. Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.
Balaur: Language Model Pretraining with Lexical Semantic Relations
Andrei Mircea
Qualitative Code Suggestion: A Human-Centric Approach to Qualitative Coding
Qualitative coding is a content analysis method in which researchers read through a text corpus and assign descriptive labels or qualitative… (voir plus) codes to passages. It is an arduous and manual process which human-computer interaction (HCI) studies have shown could greatly benefit from NLP techniques to assist qualitative coders. Yet, previous attempts at leveraging language technologies have set up qualitative coding as a fully automatable classification problem. In this work, we take a more assistive approach by defining the task of qualitative code suggestion (QCS) in which a ranked list of previously assigned qualitative codes is suggested from an identified passage. In addition to being user-motivated, QCS integrates previously ignored properties of qualitative coding such as the sequence in which passages are annotated, the importance of rare codes and the differences in annotation styles between coders. We investigate the QCS task by releasing the first publicly available qualitative coding dataset, CVDQuoding, consisting of interviews conducted with women at risk of cardiovascular disease. In addition, we conduct a human evaluation which shows that our systems consistently make relevant code suggestions.
Investigating the Effect of Pre-finetuning BERT Models on NLI Involving Presuppositions
Jad Kabbara
Responsible AI Considerations in Text Summarization Research: A Review of Current Practices
Yu Lu Liu
Meng Cao
Su Lin Blodgett
Adam Trischler
AI and NLP publication venues have increasingly encouraged researchers to reflect on possible ethical considerations, adverse impacts, and o… (voir plus)ther responsible AI issues their work might engender. However, for specific NLP tasks our understanding of how prevalent such issues are, or when and why these issues are likely to arise, remains limited. Focusing on text summarization—a common NLP task largely overlooked by the responsible AI community—we examine research and reporting practices in the current literature. We conduct a multi-round qualitative analysis of 333 summarization papers from the ACL Anthology published between 2020–2022. We focus on how, which, and when responsible AI issues are covered, which relevant stakeholders are considered, and mismatches between stated and realized research goals. We also discuss current evaluation practices and consider how authors discuss the limitations of both prior work and their own work. Overall, we find that relatively few papers engage with possible stakeholders or contexts of use, which limits their consideration of potential downstream adverse impacts or other responsible AI issues. Based on our findings, we make recommendations on concrete practices and research directions.
Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages
Rahul Aralikatte
Ziling Cheng
Sumanth Doddapaneni
We present V\=arta, a large-scale multilingual dataset for headline generation in Indic languages. This dataset includes 41.8 million news a… (voir plus)rticles in 14 different Indic languages (and English), which come from a variety of high-quality sources. To the best of our knowledge, this is the largest collection of curated articles for Indic languages currently available. We use the data collected in a series of experiments to answer important questions related to Indic NLP and multilinguality research in general. We show that the dataset is challenging even for state-of-the-art abstractive models and that they perform only slightly better than extractive baselines. Owing to its size, we also show that the dataset can be used to pretrain strong language models that outperform competitive baselines in both NLU and NLG benchmarks.
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Anya Belz
Craig Thomson
Ehud Reiter
Gavin Abercrombie
Jose M. Alonso-moral
Mohammad Arvan
Mark Cieliebak
Elizabeth Clark
Kees Van Deemter
Tanvi Dinkar
Ondrej Dusek
Steffen Eger
Qixiang Fang
Albert Gatt
Dimitra Gkatzia
Javier Gonz'alez-Corbelle
Dirk Hovy
Manuela Hurlimann
Takumi Ito … (voir 19 de plus)
John D. Kelleher
Filip Klubicka
Huiyuan Lai
Chris van der Lee
Emiel van Miltenburg
Yiru Li
Saad Mahamood
Margot Mieskes
Malvina Nissim
Natalie Paige Parde
Ondvrej Pl'atek
Verena Teresa Rieser
Pablo Mosteiro Romero
Joel Joel Tetreault
Antonio Toral
Xiao-Yi Wan
Leo Wanner
Lewis Joshua Watson
Diyi Yang
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining wha… (voir plus)t makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.