Portrait of Golnoosh Farnadi

Golnoosh Farnadi

Core Academic Member
Canada CIFAR AI Chair
Assistant Professor, McGill University, School of Computer Science
Adjunct Professor, Université de Montréal, Department of Computer Science and Operations Research
Visiting Faculty Researcher, Google
Research Topics
Deep Learning
Generative Models

Biography

Golnoosh Farnadi is an assistant professor at the School of Computer Science, McGill University, and an adjunct professor at Université de Montréal. She is a core academic member of Mila – Quebec Artificial Intelligence Institute and holds a Canada CIFAR AI Chair.

Farnadi founded and is a principal investigator of the EQUAL lab at Mila / McGill University. The EQUAL lab (EQuity & EQuality Using AI and Learning algorithms) is a cutting-edge research laboratory dedicated to advancing the fields of algorithmic fairness and responsible AI.

Current Students

PhD - HEC Montréal
Postdoctorate - McGill University
PhD - McGill University
Co-supervisor :
PhD - McGill University
Co-supervisor :
PhD - McGill University
Co-supervisor :
Master's Research - Université de Montréal
Principal supervisor :
PhD - McGill University
Collaborating researcher - UWindsor
PhD - McGill University
Co-supervisor :
PhD - McGill University
Co-supervisor :
Postdoctorate - McGill University
PhD - Université de Montréal
Co-supervisor :
Master's Research - McGill University

Publications

IDP-Bench: Benchmarking ability of LLMs to protect personal information in interdependent privacy contexts
Nicholas Vincent
Héber Hwang Arcolezi
Large language models (LLMs) are becoming widely deployed as personal AI assistants with access to sensitive user data, making privacy a maj… (see more)or challenge for their design and evaluation. Prior work focuses mainly on individual-level risks, overlooking \textbf{interdependent privacy (IDP)}--where one person's data may be revealed by others without their knowledge or consent. We address this gap by introducing \textbf{IDP-Bench}: the first LLM benchmark for IDP scenarios, grounded in the Contextual Integrity (CI) framework. We evaluate eight open-source LLMs on their understanding of IDP scenarios across three levels of IDP reasoning using two LLM judges. Results show strong co-ownership recognition (6/8 models exceed 90\%) but persistent weaknesses in identifying CI parameters (information attribute, primary subject) and IDP-specific parameters such as secondary subjects, where 7/8 models score below 74\%. Models also struggle to judge sharing appropriateness (5/8 scoring below 77\%). While the ability to judge the appropriateness of sharing improves with scale, performance tends to decline in smaller models, and prompt sensitivity remains high on IDP-specific questions--highlighting the need for more targeted study of IDP in LLM privacy research. Data \& code available \href{https://github.com/tisl-lab/Interdependent_Privacy_Bench}{here}.
Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs
Omar Mahmoud
Aly M. Kassem
Thommen George Karimpanal
Buddhika Laknath Semage
Santu Rana
Backdoor attacks in large language models (LLMs) are often treated as isolated trigger-response failures, motivating defenses tailored to sp… (see more)ecific triggers or behaviors. We show this view is incomplete. Across diverse backdoor behaviors, we identify a shared latent mechanism that can be detected, causally controlled, and suppressed. Using sparse autoencoders (SAEs) on residual-stream activations, we find a small set of latent features consistently activated across jailbreaking, refusal manipulation, password-locking, bias induction, sentiment misclassification, and country-conditioned harmful advice. These features generalize across Qwen3, Gemma~3, and Llama~3.1 models from 4B to 32B parameters, and across both fine-tuning and weight-editing attacks. Through bidirectional activation steering, we show these features are causal: suppressing them reduces attack success, while amplifying them induces target behaviors on clean prompts. We further train lightweight SAE-feature classifiers that generalize zero-shot to unseen backdoors and outperform residual-stream and weight-diffing baselines. Finally, we introduce Concept Ablation Fine-Tuning (CAFT), which suppresses backdoor formation by ablating the shared latent subspace during training. Together, our results suggest that many backdoors rely on a transferable latent mechanism, enabling unified detection and mitigation.
Leveraging Routing Dynamics in Mixture-of-Experts Models for Efficient Language Adaptation
Mixture-of-Experts (MoE) models are widely used to scale language models, yet their expert routing behavior and adaptation in a multilingual… (see more) setting remain underexplored. In this work, we study multilingual routing dynamics during continual pre-training of an English-centric MoE model on a multilingual corpus, analyzing how expert usage varies across languages. We find that continual multilingual pre-training leads to diffused, language-agnostic routing in early and middle layers, with language specialization primarily emerging in the final layers. We also show that token-level vocabulary overlap between languages plays an important role in how languages are routed. Motivated by these findings, we propose a parameter-efficient adaptation strategy that updates language-specific and shared experts in the final MoE layers. Experiments on MultiBLiMP and Belebele show that our method achieves a strong performance-efficiency trade-off, attaining competitive performance relative to fine-tuning complete final layers, while updating less than 2% of the parameters. Overall, our findings provide insights into where and how language specialization emerges in MoEs during continual pre-training and provide practical insights for low-resource multilingual adaptation. Our code is available at https://github.com/aditi184/moe-routing-adaptation.
Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models
Concept erasure in text-to-image diffusion models seeks to remove undesired concepts while preserving overall generative capability. Localiz… (see more)ed erasure methods aim to restrict edits to the spatial region occupied by the target concept. However, we observe that suppressing a concept can unintentionally weaken semantically related neighbor concepts, reducing fidelity in fine-grained domains. We propose Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework designed to better preserve neighboring concepts while removing target concepts. It operates in three stages: (1) a spectrally-weighted embedding modulation that attenuates target concept directions while stabilizing neighbor concept representations, (2) an attention-guided spatial gate that identifies regions exhibiting residual concept activation, and (3) a spatially-gated hard erasure that eliminates remaining traces only where necessary. This neighbor-aware pipeline enables localized concept removal while maintaining the surrounding concept neighborhood structure. Experiments on fine-grained datasets (Oxford Flowers, Stanford Dogs) show that our method effectively removes target concepts while better preserving closely related categories. Additional results on celebrity identity, explicit content and artistic style demonstrate robustness and generalization to broader erasure scenarios.
DELTA-CROSSCODER: ROBUST CROSSCODER IN NARROW FINE-TUNING REGIMES
Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shar… (see more)ed dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines Dual-K BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across synthetic false facts, emergent misalignment, subliminal learning, and taboo word games (Gemma, LLaMA, Qwen; 1B–7B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, substantially outperforming baselines. Our results demonstrate that narrow fine-tuning induces distinctive, recoverable latent shifts and that crosscoder methods remain powerful tools for model diffing.
Mechanics of Bias and Reasoning: Interpreting the Impact of Chain-of-Thought Prompting on Gender Bias in LLMs
Sophia Osborne
Mira Kandlikar-Bloch
Large language models (LLMs) are increasingly deployed in socially sensitive settings despite substantial documentation that they encode gen… (see more)der biases. Chain-of-Thought (CoT) prompting has been proposed as an approach for bias mitigation. However, existing evaluations primarily focus on changes in LLM benchmark performance, providing limited insight into whether apparent bias reductions reflect meaningful changes in a model's internal mechanisms. In this work, we present an investigation of how CoT prompting affects gender bias in LLMs, combining benchmark-based evaluation with mechanistic interpretability techniques, and qualitative analysis of reasoning outputs. Our results confirm a stereotypical bias present in LLM outputs across benchmarks, showing that CoT prompting does not consistently reduce the bias gap. While mechanistic analyses reveal clusters of attention heads whose biased behavior is lessened with CoT, gender bias information remains pervasive throughout hidden representations, indicating any improvements from CoT are superficial and fail to transform internal processing of gender bias. A closer inspection of the reasoning chains themselves shows poor quality CoT by which the models dissociate, hallucinate, and evade the present task rather than meaningfully engage with prompt material.
Objective Misalignment in LLM-based Multi Agent Social Deception Game
Large language model–based multi-agent systems have attracted increasing attention for their strong performance in collaborative tasks and… (see more) social simulations. However, these interactive settings also introduce vulnerabilities, as a single agent's hidden goals and misaligned behavior can propagate misleading or malicious information throughout the system. In this work, we study these risks in the context of social deception games. We focus on the Werewolf Game, which requires agents to reason, communicate, and collaborate under asymmetric and incomplete information. We modify the individual objectives of some agents to induce benevolent, individualistic, and malevolent strategies that can make agents depart from the objectives of their own team. We evaluate how objective divergence affects game outcomes, collaboration, and goal satisfaction. Misaligned agents often succeed in achieving their own objectives, with effects amplified by role-based power asymmetries. Qualitative analyses further show that agents remain coherent and adaptive, strategically adjusting their reasoning, communication, voting behavior, and influence on group dynamics. These results indicate that risks in LLM-based multi-agent systems extend beyond collaborative task settings and persist even in environments where competition is structurally expected.
Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes
Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shar… (see more)ed dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming SAE-based baselines, while matching the Non-SAE-based. Our results demonstrate that crosscoders remain a powerful tool for model diffing.
Position: Auditing Is Not Evaluating; LLM Audit Requires Dynamic, Contextual, Budget-Aware and Reliable Evidence
Auditing large language models (LLMs) is increasingly urgent as these systems are deployed in high-stakes settings, yet existing evaluation … (see more)practices are ill-suited to meet auditing requirements. Directly repurposing standard evaluation tools can yield incomplete or misleading conclusions, e.g. overstating robustness when evidence comes from static prompts rather than adaptive, real-world interactions. This position paper argues that effective LLM audits must instead generate dynamic, context-sensitive, budget-aware, and reliable evidence. To support this position, we analyze how each of these principles can be operationalized through a four-component framework: Auditing Scope, Interactor, Evaluator, and Output. We highlight design requirements, assumptions, limitations and research directions, demonstrating how high-level principles can be translated into concrete, actionable, evidence-based procedures.
Multilingual Amnesia: On the Transferability of Unlearning in Multilingual LLMs
As multilingual large language models become more widely used, ensuring their safety and fairness across diverse linguistic contexts present… (see more)s unique challenges. While existing research on machine unlearning has primarily focused on monolingual settings, typically English, multilingual environments introduce additional complexities due to cross-lingual knowledge transfer and biases embedded in both pretraining and fine-tuning data. In this work, we study multilingual unlearning using the Aya-Expanse 8B model under two settings: (1) data unlearning and (2) concept unlearning. We extend benchmarks for factual knowledge and stereotypes to ten languages through translation: English, French, Arabic, Japanese, Russian, Farsi, Korean, Hindi, Hebrew, and Indonesian. These languages span five language families and a wide range of resource levels. Our experiments show that unlearning in high-resource languages is generally more stable, with asymmetric transfer effects observed between typologically related languages. Furthermore, our analysis of linguistic distances indicates that syntactic similarity is the strongest predictor of cross-lingual unlearning behavior.
Algorithmic Fairness Across Alignment Procedures and Agentic Systems
Zeyu Tang
Awa Dieng
Miriam Rateike
Jamelle Watson-Daniels
Jessica Schrouff
Sanmi Koyejo
AI has transitioned from predictive models to interactive, autonomous agents capable of reasoning, planning, and executing complex goals. As… (see more) the systems increasingly influence social, economic, and scientific decisions, they determine whose interests are represented and whose opportunities are constrained. Ensuring fairness, therefore, is no longer an ethical preference but a practical imperative. As the fairness challenges are fundamentally transformed by advanced AI systems, traditional algorithmic fairness frameworks developed primarily for prediction and/or prediction-based decision-making no longer suffice. This workshop, _Algorithmic Fairness Across Alignment Procedures and Agentic Systems_ (AFAA), emerges at this pivotal moment as a timely forum for rethinking fairness in AI alignment processes and agentic system development. By examining fairness across alignment procedures and agentic systems, this workshop creates a crucial platform for bridging the gap between rapid technical advances in model capabilities and the equally important advances needed in frameworks of algorithmic fairness to govern these powerful systems.
Understanding the role of depth in the neural tangent kernel for overparameterized neural networks