Ziling Cheng

Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs

Meng Cao

Marc-Antoine Rondeau

Jackie CK Cheung

The widespread success of LLMs on NLP benchmarks has been accompanied by concerns that LLMs function primarily as stochastic parrots that re… (see more)produce texts similar to what they saw during pre-training, often erroneously. But what is the nature of their errors, and do these errors exhibit any regularities? In this work, we examine irrelevant context hallucinations, in which models integrate misleading contextual cues into their predictions. Through behavioral analysis, we show that these errors result from a structured yet flawed mechanism that we term _class-based (mis)generalization_, in which models combine abstract class cues with features extracted from the query or context to derive answers. Furthermore, mechanistic interpretability experiments on Llama-3, Mistral, and Pythia across 39 factual recall relation types reveal that this behavior is reflected in the model's internal computations: (i) abstract class representations are constructed in lower layers before being refined into specific answers in higher layers, (ii) feature selection is governed by two competing circuits --- one prioritizing direct query-based reasoning, the other incorporating contextual cues --- whose relative influences determine the final output. Our findings provide a more nuanced perspective on the stochastic parrot argument: through form-based training, LLMs can exhibit generalization leveraging abstractions, albeit in unreliable ways based on contextual cues — what we term _stochastic chameleons_.

2025-09-23

colmweb.org/COLM/2025/Workshop/INTERPLAY (published)

openreview.net

Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

Ziling Cheng

Meng Cao

Leila Pishdad

Yanshuai Cao

Jackie CK Cheung

Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for r… (see more)easoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models first capture problem abstractions, then handle computation. Causal patching confirms these abstractions are present, transferable, composable, and precede computation. These behavioural and mechanistic findings highlight the need for disentangled evaluation to accurately assess LLM reasoning and to guide future improvements.

2025-07-23

colmweb.org/COLM/2025/Workshop/XLLM-Reason-Plan (published)

openreview.net

Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages

Rahul Aralikatte

Ziling Cheng

Sumanth Doddapaneni

Jackie Chi Kit Cheung

We present Varta, a large-scale multilingual dataset for headline generation in Indic languages. This dataset includes more than 41 million … (see more)pairs of headlines and articles in 14 different Indic languages (and English), which come from a variety of high-quality news sources. To the best of our knowledge, this is the largest collection of curated news articles for Indic languages currently available. We use the collected data in a series of experiments to answer important questions related to Indic NLP and multilinguality research in general. We show that the dataset is challenging even for state-of-the-art abstractive models and that they perform only slightly better than extractive baselines. Owing to its size, we also show that the dataset can be used to pre-train strong language models that outperform competitive baselines in both NLU and NLG benchmarks.

2023-05-09

ArXiv (preprint)

doi.org

arxiv.org

McGill BabyLM Shared Task Submission: The Effects of Data Formatting and Structural Biases

Ziling Cheng

Rahul Aralikatte

Ian Porada

Cesare Spinoso-Di Piano

Jackie CK Cheung

In this study, we describe our submission to the 2023 BabyLM shared-task's strict-small track.Our findings demonstrate the feasibility of tr… (see more)aining high-performing models within the constraints of limited data, computational resources, and time.We provide evidence that the formatting of input can significantly impact downstream performance.Furthermore, the induction of structural biases into the models through the use of part-of-speech trees yields modest benefits.Our most successful model achieves 79% on the BLiMP evaluations and 72% on the SuperGLUE evaluations.

2022-12-31

BabyLM Challenge @ Conference on Computational Natural Language Learning (published)

doi.org

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Ziling Cheng

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Ziling Cheng

Publications