Jackie Cheung

Ines Arous

Collaborating Alumni - McGill University

PhD - McGill University

Collaborating Alumni - McGill University

Aishik Chakraborty

PhD - McGill University

Khaoula Chehbouni

PhD - McGill University

Principal supervisor :

Master's Research - McGill University

Maxime Darrin

PhD - McGill University

Co-supervisor :

PhD - McGill University

Aylin Erman

PhD - McGill University

Co-supervisor :

Dan Poenaru

Ori Ernst

Collaborating Alumni - McGill University

Master's Research - McGill University

Jie Gao

Collaborating researcher - McGill University University

Co-supervisor :

Nikki Lobczowski

Langlois Henri

Master's Research - Paris-Saclay University

Principal supervisor :

Pablo Piantanida

Fanny JOURDAN

Postdoctorate - École de technologie suprérieure

Principal supervisor :

Pablo Piantanida

Zichao Li

PhD - McGill University

Principal supervisor :

Siva Reddy

Caleb Moses

PhD - McGill University

Sihan Qin

Undergraduate - McGill University

Shalaleh Rismani

Postdoctorate - McGill University

Co-supervisor :

PhD - McGill University

Sina Salmannia

Undergraduate - McGill University

Cesare Spinoso-Di Piano

PhD - McGill University

Michael Yu

Collaborating researcher - McGill University University

Co-supervisor :

Nikki Lobczowski

Xiyuan Zou

Master's Research - McGill University

Publications

Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs

Ziling Cheng

Meng Cao

Marc-Antoine Rondeau

The widespread success of LLMs on NLP benchmarks has been accompanied by concerns that LLMs function primarily as stochastic parrots that re… (see more)produce texts similar to what they saw during pre-training, often erroneously. But what is the nature of their errors, and do these errors exhibit any regularities? In this work, we examine irrelevant context hallucinations, in which models integrate misleading contextual cues into their predictions. Through behavioral analysis, we show that these errors result from a structured yet flawed mechanism that we term _class-based (mis)generalization_, in which models combine abstract class cues with features extracted from the query or context to derive answers. Furthermore, mechanistic interpretability experiments on Llama-3, Mistral, and Pythia across 39 factual recall relation types reveal that this behavior is reflected in the model's internal computations: (i) abstract class representations are constructed in lower layers before being refined into specific answers in higher layers, (ii) feature selection is governed by two competing circuits --- one prioritizing direct query-based reasoning, the other incorporating contextual cues --- whose relative influences determine the final output. Our findings provide a more nuanced perspective on the stochastic parrot argument: through form-based training, LLMs can exhibit generalization leveraging abstractions, albeit in unreliable ways based on contextual cues — what we term _stochastic chameleons_.

2025-09-24

colmweb.org/COLM/2025/Workshop/INTERPLAY (published)

Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs

Ziling Cheng

Meng Cao

Marc-Antoine Rondeau

2025-09-24

colmweb.org/COLM/2025/Workshop/INTERPLAY (published)

Neither Valid Nor Reliable? Investigating the Use of LLMs as Judges

Khaoula Chehbouni

Mohammed Haddou

Golnoosh Farnadi

2025-09-22

NeurIPS.cc/2025/Workshop/WiML (published)

Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

Khaoula Chehbouni

Mohammed Haddou

Golnoosh Farnadi

Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by t… (see more)he rise of large language models (LLMs) that aims to be general-purpose. Recently, LLMs as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation.

2025-07-25

colmweb.org/COLM/2025/Workshop/SoLaR (poster)

Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

Ziling Cheng

Meng Cao

Leila Pishdad

Yanshuai Cao

Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for r… (see more)easoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models first capture problem abstractions, then handle computation. Causal patching confirms these abstractions are present, transferable, composable, and precede computation. These behavioural and mechanistic findings highlight the need for disentangled evaluation to accurately assess LLM reasoning and to guide future improvements.

2025-07-24

colmweb.org/COLM/2025/Workshop/XLLM-Reason-Plan (published)

Partial Perspectives: How LLMs Handle Logically Inconsistent Knowledge in Reasoning Tasks

Zichao Li

Ines Arous

Most natural language reasoning tasks in the research community assume consistent input knowledge. Nevertheless, real-world scenarios often … (see more)involve inconsistent information, which might lead to divergent conclusions and are typically associated with varying levels of uncertainty. This raises a key research question: can large language models (LLMs) effectively handle uncertainty in their reasoning process to maximize knowledge consistency? In this paper, we propose a framework for evaluating reasoning over inconsistent knowledge. Our approach models uncertainty via weights of logical rules, leveraging Markov logic networks (MLN), which integrate probabilistic reasoning with first-order logic. This enables us to quantify inconsistencies in knowledge bases, and hence rigorously evaluate LLM reasoning. We introduce two tasks using this framework: 1) QA, which involves answering questions by integrating inconsistent knowledge; and 2) knowledge rectification, where we aim to rectify language models' acquired knowledge to improve consistency. We curate a dataset of 3,000 MLN-formatted knowledge bases to implement these tasks. We evaluate state-of-the-art LLMs on these tasks and highlight their limitations in uncertainty-aware reasoning over inconsistent logical knowledge.

2025-07-07

colmweb.org/COLM/2025/Conference (accepted)

Rethinking Full Finetuning from Pretraining Checkpoints in Active Learning for African Languages

Bonaventure F. P. Dossou

Ines Arous

2025-06-22

aclweb.org/ACL/2025/SRW (poster)

Machine-learning-assisted Preoperative Prediction of Pediatric Appendicitis Severity.

Aylin Erman

Julia Ferreira

Waseem Abu Ashour

Elena Guadagno

Etienne St-Louis

Sherif Emil

Dan Poenaru

2025-01-01

Journal of Pediatric Surgery (published)

doi.org

Machine-learning-assisted preoperative prediction of pediatric appendicitis severity

Aylin Erman

Julia Ferreira

Waseem Abu Ashour

Elena Guadagno

Etienne St-Louis

Sherif Emil

Dan Poenaru

2025-01-01

Journal of Pediatric Surgery (published)

doi.org

Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset

Yash More

In an effort to mitigate the harms of large language models (LLMs), learning from human feedback (LHF) has been used to steer LLMs towards o… (see more)utputs that are intended to be both less harmful and more helpful. Despite the widespread adoption of LHF in practice, the quality of this feedback and its effectiveness as a safety mitigation technique remain unclear. This study addresses these issues by auditing the widely-used Helpful and Harmless (HH) dataset by Anthropic. Our work includes: (1) a thorough investigation of the dataset's content through both manual and automated evaluation; (2) experiments demonstrating the dataset's impact on models' safety; and (3) an analysis of the 100 most influential papers citing this dataset. Through our audit, we showcase how conceptualization failures and quality issues identified in the HH dataset can create additional harms by leading to disparate safety behaviors across demographic groups. Our findings highlight the need for more nuanced, context-sensitive approaches to safety mitigation in LLMs.

2024-11-12

ArXiv (preprint)

doi.org

arxiv.org

Do LLMs Build World Representations? Probing Through the Lens of State Abstraction

Zichao Li

Yanshuai Cao

2024-09-25

NeurIPS.cc/2024/Conference (poster)