Jackie Cheung

Nadhem Benhadjali

Collaborating researcher

Meng (Caden) Cao

PhD - McGill University

Google Scholar

Aishik Chakraborty

PhD - McGill University

Khaoula Chehbouni

PhD - McGill University

Principal supervisor :

Master's Research - McGill University

Maxime Darrin

PhD - McGill University

Co-supervisor :

PhD - McGill University

Google Scholar

Aylin Erman

PhD - McGill University

Co-supervisor :

Dan Poenaru

Ori Ernst

Postdoctorate - McGill University

Master's Research - McGill University

Google Scholar

Jules Gagnon-marchand

Master's Research - McGill University

Sienna Hsu

Research Intern - McGill University University

Zichao Li

PhD - McGill University

Principal supervisor :

Siva Reddy

Caleb Moses

PhD - McGill University

Ian Porada

PhD - McGill University

PhD - McGill University

Sina Salmannia

Undergraduate - McGill University

Cesare Spinoso-Di Piano

PhD - McGill University

Sihui Wei

Undergraduate - McGill University

Xiyuan Zou

Master's Research - McGill University

Samira Abbasgholizadeh-Rahimi

Publications

Qualitative Code Suggestion: A Human-Centric Approach to Qualitative Coding

Cesare Spinoso-Di Piano

Qualitative coding is a content analysis method in which researchers read through a text corpus and assign descriptive labels or qualitative… (see more) codes to passages. It is an arduous and manual process which human-computer interaction (HCI) studies have shown could greatly benefit from NLP techniques to assist qualitative coders. Yet, previous attempts at leveraging language technologies have set up qualitative coding as a fully automatable classification problem. In this work, we take a more assistive approach by defining the task of qualitative code suggestion (QCS) in which a ranked list of previously assigned qualitative codes is suggested from an identified passage. In addition to being user-motivated, QCS integrates previously ignored properties of qualitative coding such as the sequence in which passages are annotated, the importance of rare codes and the differences in annotation styles between coders. We investigate the QCS task by releasing the first publicly available qualitative coding dataset, CVDQuoding, consisting of interviews conducted with women at risk of cardiovascular disease. In addition, we conduct a human evaluation which shows that our systems consistently make relevant code suggestions.

2023-12-01

Findings of the Association for Computational Linguistics: EMNLP 2023 (published)

Systematic Generalization by Finetuning? Analyzing Pretrained Language Models Using Constituency Tests

Aishik Chakraborty

Timothy O'Donnell

Constituents are groups of words that behave as a syntactic unit. Many linguistic phenomena (e.g., question formation, diathesis alternation… (see more)s) require the manipulation and rearrangement of constituents in a sentence. In this paper, we investigate how different finetuning setups affect the ability of pretrained sequence-to-sequence language models such as BART and T5 to replicate constituency tests — transformations that involve manipulating constituents in a sentence. We design multiple evaluation settings by varying the combinations of constituency tests and sentence types that a model is exposed to during finetuning. We show that models can replicate a linguistic transformation on a specific type of sentence that they saw during finetuning, but performance degrades substantially in other settings, showing a lack of systematic generalization. These results suggest that models often learn to manipulate sentences at a surface level unrelated to the constituent-level syntactic structure, for example by copying the first word of a sentence. These results may partially explain the brittleness of pretrained language models in downstream tasks.

2023-12-01

Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP (published)

Investigating the Effect of Pre-finetuning BERT Models on NLI Involving Presuppositions

Jad Kabbara

2023-10-07

EMNLP/2023/Conference (published)

Responsible AI Considerations in Text Summarization Research: A Review of Current Practices

Yu Lu Liu

Meng Cao

Su Lin Blodgett

Alexandra Olteanu

Adam Trischler

AI and NLP publication venues have increasingly encouraged researchers to reflect on possible ethical considerations, adverse impacts, and o… (see more)ther responsible AI issues their work might engender. However, for specific NLP tasks our understanding of how prevalent such issues are, or when and why these issues are likely to arise, remains limited. Focusing on text summarization—a common NLP task largely overlooked by the responsible AI community—we examine research and reporting practices in the current literature. We conduct a multi-round qualitative analysis of 333 summarization papers from the ACL Anthology published between 2020–2022. We focus on how, which, and when responsible AI issues are covered, which relevant stakeholders are considered, and mismatches between stated and realized research goals. We also discuss current evaluation practices and consider how authors discuss the limitations of both prior work and their own work. Overall, we find that relatively few papers engage with possible stakeholders or contexts of use, which limits their consideration of potential downstream adverse impacts or other responsible AI issues. Based on our findings, we make recommendations on concrete practices and research directions.

2023-10-07

EMNLP/2023/Conference (published)

Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages

Rahul Aralikatte

Ziling Cheng

Sumanth Doddapaneni

We present V\=arta, a large-scale multilingual dataset for headline generation in Indic languages. This dataset includes 41.8 million news a… (see more)rticles in 14 different Indic languages (and English), which come from a variety of high-quality sources. To the best of our knowledge, this is the largest collection of curated articles for Indic languages currently available. We use the data collected in a series of experiments to answer important questions related to Indic NLP and multilinguality research in general. We show that the dataset is challenging even for state-of-the-art abstractive models and that they perform only slightly better than extractive baselines. Owing to its size, we also show that the dataset can be used to pretrain strong language models that outperform competitive baselines in both NLU and NLG benchmarks.

2023-05-10

ArXiv (preprint)

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Anya Belz

Craig Thomson

Ehud Reiter

Gavin Abercrombie

Jose M. Alonso-moral

Mohammad Arvan

Mark Cieliebak

Elizabeth Clark

Kees Van Deemter

Tanvi Dinkar

Ondrej Dusek

Steffen Eger

Qixiang Fang

Albert Gatt

Dimitra Gkatzia

Javier Gonz'alez-Corbelle

Dirk Hovy

Manuela Hurlimann

Takumi Ito … (see 19 more)

John D. Kelleher

Filip Klubicka

Huiyuan Lai

Chris van der Lee

Emiel van Miltenburg

Yiru Li

Saad Mahamood

Margot Mieskes

Malvina Nissim

Natalie Paige Parde

Ondvrej Pl'atek

Verena Teresa Rieser

Pablo Mosteiro Romero

Joel Joel Tetreault

Antonio Toral

Xiao-Yi Wan

Leo Wanner

Lewis Joshua Watson

Diyi Yang

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining wha… (see more)t makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

2023-05-02

ArXiv (preprint)

Unsupervised Layer-wise Score Aggregation for Textual OOD Detection

Maxime DARRIN

Guillaume Staerman

Eduardo Dadalto Câmara Gomes

Pablo Piantanida

Pierre Colombo

2023-02-20

ArXiv (preprint)

Systematic Rectification of Language Models via Dead-end Analysis

Meng Cao

Mehdi Fatemi

Samira Shabanian

With adversarial or otherwise normal prompts, existing large language models (LLM) can be pushed to generate toxic discourses. One way to re… (see more)duce the risk of LLMs generating undesired discourses is to alter the training of the LLM. This can be very restrictive due to demanding computation requirements. Other methods rely on rule-based or prompt-based token elimination, which are limited as they dismiss future tokens and the overall meaning of the complete discourse. Here, we center detoxification on the probability that the finished discourse is ultimately considered toxic. That is, at each point, we advise against token selections proportional to how likely a finished text from this point will be toxic. To this end, we formally extend the dead-end theory from the recent reinforcement learning (RL) literature to also cover uncertain outcomes. Our approach, called rectification, utilizes a separate but significantly smaller model for detoxification, which can be applied to diverse LLMs as long as they share the same vocabulary. Importantly, our method does not require access to the internal representations of the LLM, but only the token probability distribution at each decoding step. This is crucial as many LLMs today are hosted in servers and only accessible through APIs. When applied to various LLMs, including GPT-3, our approach significantly improves the generated discourse compared to the base LLMs and other techniques in terms of both the overall language and detoxification performance.

2023-02-01

ICLR.cc/2023/Conference (poster)

Evaluating Dependencies in Fact Editing for Language Models: Specificity and Implication Awareness

Zichao Li

Ines Arous

Siva Reddy

The potential of using a large language model (LLM) as a knowledge base (KB) has sparked significant interest. To maintain the knowledge acq… (see more)uired by LLMs, we need to ensure that the editing of learned facts respects internal logical constraints, which are known as dependency of knowledge. Existing work on editing LLMs has partially addressed the issue of dependency, when the editing of a fact should apply to its lexical variations without disrupting irrelevant ones. However, they neglect the dependency between a fact and its logical implications. We propose an evaluation protocol with an accompanying question-answering dataset, StandUp, that provides a comprehensive assessment of the editing process considering the above notions of dependency. Our protocol involves setting up a controlled environment in which we edit facts and monitor their impact on LLMs, along with their implications based on If-Then rules. Extensive experiments on StandUp show that existing knowledge editing methods are sensitive to the surface form of knowledge, and that they have limited performance in inferring the implications of edited facts.

2023-01-01

EMNLP (Findings) (published)

How Useful Are Educational Questions Generated by Large Language Models?

Sabina Elkins

Ekaterina Kochmar

Iulian V. Serban

2023-01-01

AIED (Posters/Late Breaking Results/...) (published)

Responsible AI Considerations in Text Summarization Research: A Review of Current Practices

Yu Lu Liu

Meng Cao

Su Lin Blodgett

Alexandra Olteanu

Adam Trischler

2023-01-01

EMNLP (Findings) (published)

The KITMUS Test: Evaluating Knowledge Integration from Multiple Sources

Akshatha Arodi

Martin Pömsl

Kaheer Suleman

Adam Trischler

Alexandra Olteanu

Many state-of-the-art natural language understanding (NLU) models are based on pretrained neural language models. These models often make in… (see more)ferences using information from multiple sources. An important class of such inferences are those that require both background knowledge, presumably contained in a model’s pretrained parameters, and instance-specific information that is supplied at inference time. However, the integration and reasoning abilities of NLU models in the presence of multiple knowledge sources have been largely understudied. In this work, we propose a test suite of coreference resolution subtasks that require reasoning over multiple facts. These subtasks differ in terms of which knowledge sources contain the relevant facts. We also introduce subtasks where knowledge is present only at inference time using fictional knowledge. We evaluate state-of-the-art coreference resolution models on our dataset. Our results indicate that several models struggle to reason on-the-fly over knowledge observed both at pretrain time and at inference time. However, with task-specific training, a subset of models demonstrates the ability to integrate certain knowledge types from multiple sources. Still, even the best performing models seem to have difficulties with reliably integrating knowledge presented only at inference time.

2023-01-01

ACL (1) (published)