Portrait de Golnoosh Farnadi

Golnoosh Farnadi

Membre académique principal
Chaire en IA Canada-CIFAR
Professeure adjointe, McGill University, École d'informatique
Professeure associée, Université de Montréal, Département d'informatique et de recherche opérationnelle
Chercheuse invitée, Google
Sujets de recherche
Apprentissage profond
Modèles génératifs

Biographie

Golnoosh Farnadi est professeure associée à l'École d'informatique de l'Université McGill et professeure associée à l'Université de Montréal. Elle est membre académique principal à Mila - Institut québécois d'intelligence artificielle et est titulaire d'une chaire CIFAR d'intelligence artificielle au Canada.

Mme Farnadi a fondé le laboratoire EQUAL à Mila / Université McGill, dont elle est l'une des principales chercheuses. Le laboratoire EQUAL (EQuity & EQuality Using AI and Learning algorithms) est un laboratoire de recherche de pointe dédié à l'avancement des domaines de l'équité algorithmique et de l'IA responsable.

Étudiants actuels

Doctorat - HEC
Postdoctorat - McGill
Doctorat - McGill
Co-superviseur⋅e :
Doctorat - McGill
Co-superviseur⋅e :
Doctorat - McGill
Co-superviseur⋅e :
Maîtrise recherche - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - UWindsor
Doctorat - McGill
Co-superviseur⋅e :
Collaborateur·rice de recherche - McGill
Collaborateur·rice alumni - UdeM
Visiteur de recherche indépendant - McGill university
Collaborateur·rice de recherche - McGill
Doctorat - McGill
Co-superviseur⋅e :
Postdoctorat - McGill
Doctorat - UdeM
Co-superviseur⋅e :
Maîtrise recherche - McGill

Publications

Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models
Concept erasure in text-to-image diffusion models seeks to remove undesired concepts while preserving overall generative capability. Localiz… (voir plus)ed erasure methods aim to restrict edits to the spatial region occupied by the target concept. However, we observe that suppressing a concept can unintentionally weaken semantically related neighbor concepts, reducing fidelity in fine-grained domains. We propose Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework designed to better preserve neighboring concepts while removing target concepts. It operates in three stages: (1) a spectrally-weighted embedding modulation that attenuates target concept directions while stabilizing neighbor concept representations, (2) an attention-guided spatial gate that identifies regions exhibiting residual concept activation, and (3) a spatially-gated hard erasure that eliminates remaining traces only where necessary. This neighbor-aware pipeline enables localized concept removal while maintaining the surrounding concept neighborhood structure. Experiments on fine-grained datasets (Oxford Flowers, Stanford Dogs) show that our method effectively removes target concepts while better preserving closely related categories. Additional results on celebrity identity, explicit content and artistic style demonstrate robustness and generalization to broader erasure scenarios.
DELTA-CROSSCODER: ROBUST CROSSCODER IN NARROW FINE-TUNING REGIMES
Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shar… (voir plus)ed dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines Dual-K BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across synthetic false facts, emergent misalignment, subliminal learning, and taboo word games (Gemma, LLaMA, Qwen; 1B–7B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, substantially outperforming baselines. Our results demonstrate that narrow fine-tuning induces distinctive, recoverable latent shifts and that crosscoder methods remain powerful tools for model diffing.
Mechanics of Bias and Reasoning: Interpreting the Impact of Chain-of-Thought Prompting on Gender Bias in LLMs
Sophia Osborne
Mira Kandlikar-Bloch
Large language models (LLMs) are increasingly deployed in socially sensitive settings despite substantial documentation that they encode gen… (voir plus)der biases. Chain-of-Thought (CoT) prompting has been proposed as an approach for bias mitigation. However, existing evaluations primarily focus on changes in LLM benchmark performance, providing limited insight into whether apparent bias reductions reflect meaningful changes in a model's internal mechanisms. In this work, we present an investigation of how CoT prompting affects gender bias in LLMs, combining benchmark-based evaluation with mechanistic interpretability techniques, and qualitative analysis of reasoning outputs. Our results confirm a stereotypical bias present in LLM outputs across benchmarks, showing that CoT prompting does not consistently reduce the bias gap. While mechanistic analyses reveal clusters of attention heads whose biased behavior is lessened with CoT, gender bias information remains pervasive throughout hidden representations, indicating any improvements from CoT are superficial and fail to transform internal processing of gender bias. A closer inspection of the reasoning chains themselves shows poor quality CoT by which the models dissociate, hallucinate, and evade the present task rather than meaningfully engage with prompt material.
Objective Misalignment in LLM-based Multi Agent Social Deception Game
Large language model–based multi-agent systems have attracted increasing attention for their strong performance in collaborative tasks and… (voir plus) social simulations. However, these interactive settings also introduce vulnerabilities, as a single agent's hidden goals and misaligned behavior can propagate misleading or malicious information throughout the system. In this work, we study these risks in the context of social deception games. We focus on the Werewolf Game, which requires agents to reason, communicate, and collaborate under asymmetric and incomplete information. We modify the individual objectives of some agents to induce benevolent, individualistic, and malevolent strategies that can make agents depart from the objectives of their own team. We evaluate how objective divergence affects game outcomes, collaboration, and goal satisfaction. Misaligned agents often succeed in achieving their own objectives, with effects amplified by role-based power asymmetries. Qualitative analyses further show that agents remain coherent and adaptive, strategically adjusting their reasoning, communication, voting behavior, and influence on group dynamics. These results indicate that risks in LLM-based multi-agent systems extend beyond collaborative task settings and persist even in environments where competition is structurally expected.
Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes
Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shar… (voir plus)ed dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming SAE-based baselines, while matching the Non-SAE-based. Our results demonstrate that crosscoders remain a powerful tool for model diffing.
Position: Auditing Is Not Evaluating; LLM Audit Requires Dynamic, Contextual, Budget-Aware and Reliable Evidence
Auditing large language models (LLMs) is increasingly urgent as these systems are deployed in high-stakes settings, yet existing evaluation … (voir plus)practices are ill-suited to meet auditing requirements. Directly repurposing standard evaluation tools can yield incomplete or misleading conclusions, e.g. overstating robustness when evidence comes from static prompts rather than adaptive, real-world interactions. This position paper argues that effective LLM audits must instead generate dynamic, context-sensitive, budget-aware, and reliable evidence. To support this position, we analyze how each of these principles can be operationalized through a four-component framework: Auditing Scope, Interactor, Evaluator, and Output. We highlight design requirements, assumptions, limitations and research directions, demonstrating how high-level principles can be translated into concrete, actionable, evidence-based procedures.
Multilingual Amnesia: On the Transferability of Unlearning in Multilingual LLMs
As multilingual large language models become more widely used, ensuring their safety and fairness across diverse linguistic contexts present… (voir plus)s unique challenges. While existing research on machine unlearning has primarily focused on monolingual settings, typically English, multilingual environments introduce additional complexities due to cross-lingual knowledge transfer and biases embedded in both pretraining and fine-tuning data. In this work, we study multilingual unlearning using the Aya-Expanse 8B model under two settings: (1) data unlearning and (2) concept unlearning. We extend benchmarks for factual knowledge and stereotypes to ten languages through translation: English, French, Arabic, Japanese, Russian, Farsi, Korean, Hindi, Hebrew, and Indonesian. These languages span five language families and a wide range of resource levels. Our experiments show that unlearning in high-resource languages is generally more stable, with asymmetric transfer effects observed between typologically related languages. Furthermore, our analysis of linguistic distances indicates that syntactic similarity is the strongest predictor of cross-lingual unlearning behavior.
Algorithmic Fairness Across Alignment Procedures and Agentic Systems
Zeyu Tang
Awa Dieng
Miriam Rateike
Jamelle Watson-Daniels
Jessica Schrouff
Sanmi Koyejo
AI has transitioned from predictive models to interactive, autonomous agents capable of reasoning, planning, and executing complex goals. As… (voir plus) the systems increasingly influence social, economic, and scientific decisions, they determine whose interests are represented and whose opportunities are constrained. Ensuring fairness, therefore, is no longer an ethical preference but a practical imperative. As the fairness challenges are fundamentally transformed by advanced AI systems, traditional algorithmic fairness frameworks developed primarily for prediction and/or prediction-based decision-making no longer suffice. This workshop, _Algorithmic Fairness Across Alignment Procedures and Agentic Systems_ (AFAA), emerges at this pivotal moment as a timely forum for rethinking fairness in AI alignment processes and agentic system development. By examining fairness across alignment procedures and agentic systems, this workshop creates a crucial platform for bridging the gap between rapid technical advances in model capabilities and the equally important advances needed in frameworks of algorithmic fairness to govern these powerful systems.
Understanding the role of depth in the neural tangent kernel for overparameterized neural networks
Fairness in Federated Learning: Fairness for Whom?
Fairness in federated learning has emerged as a rapidly growing area of research, with numerous works proposing formal definitions and algor… (voir plus)ithmic interventions. Yet, despite this technical progress, fairness in FL is often defined and evaluated in ways that abstract away from the sociotechnical contexts in which these systems are deployed. In this paper, we argue that existing approaches tend to optimize narrow system level metrics, such as performance parity or contribution-based rewards, while overlooking how harms arise throughout the FL lifecycle and how they impact diverse stakeholders. We support this claim through a critical analysis of the literature, based on a systematic annotation of papers for their fairness definitions, design decisions, evaluation practices, and motivating use cases. Our analysis reveals five recurring pitfalls: 1) fairness framed solely through the lens of server client architecture, 2) a mismatch between simulations and motivating use-cases and contexts, 3) definitions that conflate protecting the system with protecting its users, 4) interventions that target isolated stages of the lifecycle while neglecting upstream and downstream effects, 5) and a lack of multi-stakeholder alignment where multiple fairness definitions can be relevant at once. Building on these insights, we propose a harm centered framework that links fairness definitions to concrete risks and stakeholder vulnerabilities. We conclude with recommendations for more holistic, context-aware, and accountable fairness research in FL.
Mitigating Disparate Impact of Differential Privacy in Federated Learning through Robust Clustering
Federated Learning (FL) is a decentralized machine learning (ML) approach that keeps data localized and often incorporates Differential Priv… (voir plus)acy (DP) to enhance privacy guarantees. Similar to previous work on DP in ML, we observed that differentially private federated learning (DPFL) introduces performance disparities, particularly affecting minority groups. Recent work has attempted to address performance fairness in vanilla FL through clustering, but this method remains sensitive and prone to errors, which are further exacerbated by the DP noise in DPFL. To fill this gap, in this paper, we propose a novel clustered DPFL algorithm designed to effectively identify clients' clusters in highly heterogeneous settings while maintaining high accuracy with DP guarantees. To this end, we propose to cluster clients based on both their model updates and training loss values. Our proposed approach also addresses the server's uncertainties in clustering clients' model updates by employing larger batch sizes along with Gaussian Mixture Model (GMM) to alleviate the impact of noise and potential clustering errors, especially in privacy-sensitive scenarios. We provide theoretical analysis of the effectiveness of our proposed approach. We also extensively evaluate our approach across diverse data distributions and privacy budgets and show its effectiveness in mitigating the disparate impact of DP in FL settings with a small computational cost.
Neither Valid Nor Reliable? Investigating the Use of LLMs as Judges
Mohammed Haddou
Jackie CK Cheung