Portrait of Jackie Cheung

Jackie Cheung

Core Academic Member
Canada CIFAR AI Chair
Associate Scientific Director, Mila, Associate Professor, McGill University, School of Computer Science
Consultant Researcher, Microsoft Research
Research Topics
Deep Learning
Medical Machine Learning
Natural Language Processing
Reasoning

Biography

I am an associate professor in the School of Computer Science at McGill University and a consultant researcher at Microsoft Research.

My group investigates natural language processing, an area of AI research that builds computational models of human languages, such as English or French. The goal of our research is to develop computational methods for understanding text and speech in order to generate language that is fluent and context appropriate.

In our lab, we investigate statistical machine learning techniques for analyzing and making predictions about language. Some of my current projects focus on summarizing fiction, extracting events from text, and adapting language across genres.

Current Students

Postdoctorate - McGill University
PhD - McGill University
Co-supervisor :
Postdoctorate - McGill University
Research Intern - McGill University
PhD - McGill University
PhD - McGill University
PhD - McGill University
Principal supervisor :
Master's Research - McGill University
PhD - McGill University
Research Intern - McGill University
PhD - McGill University
Co-supervisor :
Master's Research - McGill University
PhD - McGill University
Co-supervisor :
Postdoctorate - McGill University
Master's Research - McGill University
Master's Research - McGill University
Research Intern - McGill University University
Research Intern - McGill University
PhD - McGill University
Principal supervisor :
Master's Research - McGill University
PhD - McGill University
Master's Research - McGill University
PhD - McGill University
PhD - McGill University
Undergraduate - McGill University
PhD - McGill University
Research Intern - McGill University University
Research Intern - McGill University

Publications

Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages
Rahul Aralikatte
Ziling Cheng
Sumanth Doddapaneni
We present V\=arta, a large-scale multilingual dataset for headline generation in Indic languages. This dataset includes 41.8 million news a… (see more)rticles in 14 different Indic languages (and English), which come from a variety of high-quality sources. To the best of our knowledge, this is the largest collection of curated articles for Indic languages currently available. We use the data collected in a series of experiments to answer important questions related to Indic NLP and multilinguality research in general. We show that the dataset is challenging even for state-of-the-art abstractive models and that they perform only slightly better than extractive baselines. Owing to its size, we also show that the dataset can be used to pretrain strong language models that outperform competitive baselines in both NLU and NLG benchmarks.
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Anya Belz
Craig Thomson
Ehud Reiter
Gavin Abercrombie
Jose M. Alonso-moral
Mohammad Arvan
Mark Cieliebak
Elizabeth Clark
Kees Van Deemter
Tanvi Dinkar
Ondrej Dusek
Steffen Eger
Qixiang Fang
Albert Gatt
Dimitra Gkatzia
Javier Gonz'alez-Corbelle
Dirk Hovy
Manuela Hurlimann
Takumi Ito … (see 19 more)
John D. Kelleher
Filip Klubicka
Huiyuan Lai
Chris van der Lee
Emiel van Miltenburg
Yiru Li
Saad Mahamood
Margot Mieskes
Malvina Nissim
Natalie Paige Parde
Ondvrej Pl'atek
Verena Teresa Rieser
Pablo Mosteiro Romero
Joel Joel Tetreault
Antonio Toral
Xiao-Yi Wan
Leo Wanner
Lewis Joshua Watson
Diyi Yang
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining wha… (see more)t makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.
Systematic Rectification of Language Models via Dead-end Analysis
Meng Cao
Mehdi Fatemi
Samira Shabanian
With adversarial or otherwise normal prompts, existing large language models (LLM) can be pushed to generate toxic discourses. One way to re… (see more)duce the risk of LLMs generating undesired discourses is to alter the training of the LLM. This can be very restrictive due to demanding computation requirements. Other methods rely on rule-based or prompt-based token elimination, which are limited as they dismiss future tokens and the overall meaning of the complete discourse. Here, we center detoxification on the probability that the finished discourse is ultimately considered toxic. That is, at each point, we advise against token selections proportional to how likely a finished text from this point will be toxic. To this end, we formally extend the dead-end theory from the recent reinforcement learning (RL) literature to also cover uncertain outcomes. Our approach, called rectification, utilizes a separate but significantly smaller model for detoxification, which can be applied to diverse LLMs as long as they share the same vocabulary. Importantly, our method does not require access to the internal representations of the LLM, but only the token probability distribution at each decoding step. This is crucial as many LLMs today are hosted in servers and only accessible through APIs. When applied to various LLMs, including GPT-3, our approach significantly improves the generated discourse compared to the base LLMs and other techniques in terms of both the overall language and detoxification performance.
Evaluating Dependencies in Fact Editing for Language Models: Specificity and Implication Awareness
Zichao Li
Ines Arous
The potential of using a large language model (LLM) as a knowledge base (KB) has sparked significant interest. To maintain the knowledge acq… (see more)uired by LLMs, we need to ensure that the editing of learned facts respects internal logical constraints, which are known as dependency of knowledge. Existing work on editing LLMs has partially addressed the issue of dependency, when the editing of a fact should apply to its lexical variations without disrupting irrelevant ones. However, they neglect the dependency between a fact and its logical implications. We propose an evaluation protocol with an accompanying question-answering dataset, StandUp, that provides a comprehensive assessment of the editing process considering the above notions of dependency. Our protocol involves setting up a controlled environment in which we edit facts and monitor their impact on LLMs, along with their implications based on If-Then rules. Extensive experiments on StandUp show that existing knowledge editing methods are sensitive to the surface form of knowledge, and that they have limited performance in inferring the implications of edited facts.
How Useful Are Educational Questions Generated by Large Language Models?
Sabina Elkins
Ekaterina Kochmar
Iulian V. Serban
Responsible AI Considerations in Text Summarization Research: A Review of Current Practices
Yu Lu Liu
Meng Cao
Su Lin Blodgett
Adam Trischler
The KITMUS Test: Evaluating Knowledge Integration from Multiple Sources
Akshatha Arodi
Martin Pömsl
Kaheer Suleman
Adam Trischler
Many state-of-the-art natural language understanding (NLU) models are based on pretrained neural language models. These models often make in… (see more)ferences using information from multiple sources. An important class of such inferences are those that require both background knowledge, presumably contained in a model’s pretrained parameters, and instance-specific information that is supplied at inference time. However, the integration and reasoning abilities of NLU models in the presence of multiple knowledge sources have been largely understudied. In this work, we propose a test suite of coreference resolution subtasks that require reasoning over multiple facts. These subtasks differ in terms of which knowledge sources contain the relevant facts. We also introduce subtasks where knowledge is present only at inference time using fictional knowledge. We evaluate state-of-the-art coreference resolution models on our dataset. Our results indicate that several models struggle to reason on-the-fly over knowledge observed both at pretrain time and at inference time. However, with task-specific training, a subset of models demonstrates the ability to integrate certain knowledge types from multiple sources. Still, even the best performing models seem to have difficulties with reliably integrating knowledge presented only at inference time.
A Multifaceted Framework to Evaluate Evasion, Content Preservation, and Misattribution in Authorship Obfuscation Techniques
Malik H. Altakrori
Thomas Scialom
Does Pre-training Induce Systematic Inference? How Masked Language Models Acquire Commonsense Knowledge
MaskEval: Weighted MLM-Based Evaluation for Text Summarization and Simplification
Yu Lu Liu
Rachel Bawden
Thomas Scaliom
Benoı̂t Sagot
Characterizing Idioms: Conventionality and Contingency
Michaela Socolof
Michael Wagner
Idioms are unlike most phrases in two important ways. First, words in an idiom have non-canonical meanings. Second, the non-canonical meanin… (see more)gs of words in an idiom are contingent on the presence of other words in the idiom. Linguistic theories differ on whether these properties depend on one another, as well as whether special theoretical machinery is needed to accommodate idioms. We define two measures that correspond to the properties above, and we show that idioms fall at the expected intersection of the two dimensions, but that the dimensions themselves are not correlated. Our results suggest that introducing special machinery to handle idioms may not be warranted.
Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization
Meng Cao
Yue Dong