Golnoosh Farnadi

Doctorat - McGill

Co-superviseur⋅e :

Maîtrise recherche - UdeM

Superviseur⋅e principal⋅e :

Doctorat - McGill

Aly Kassem

Collaborateur·rice de recherche - UWindsor

Doctorat - McGill

Co-superviseur⋅e :

Siva Reddy

Google Scholar

Saber Malekmohammadi

Collaborateur·rice de recherche - McGill

Yash Raj More

Collaborateur·rice alumni - UdeM

Edie Pearman

Visiteur de recherche indépendant - McGill university

Ameline Ramesan

Collaborateur·rice de recherche - McGill

Github

Soumya Sharma

Doctorat - McGill

Co-superviseur⋅e :

Adriana Romero Soriano

Site web

Github

Google Scholar

Zhuan Shi

Postdoctorat - McGill

William St-Arnaud

Doctorat - UdeM

Co-superviseur⋅e :

Margarida Carvalho

Emma Tomiuk Tomiuk

Maîtrise recherche - McGill

Publications

Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes

Aly M. Kassem

Thomas Jiralerspong

Negar Rostamzadeh

Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shar… (voir plus)ed dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming SAE-based baselines, while matching the Non-SAE-based. Our results demonstrate that crosscoders remain a powerful tool for model diffing.

2026-02-15

arXiv (prépublication)

Multilingual Amnesia: On the Transferability of Unlearning in Multilingual LLMs

As multilingual large language models become more widely used, ensuring their safety and fairness across diverse linguistic contexts present… (voir plus)s unique challenges. While existing research on machine unlearning has primarily focused on monolingual settings, typically English, multilingual environments introduce additional complexities due to cross-lingual knowledge transfer and biases embedded in both pretraining and fine-tuning data. In this work, we study multilingual unlearning using the Aya-Expanse 8B model under two settings: (1) data unlearning and (2) concept unlearning. We extend benchmarks for factual knowledge and stereotypes to ten languages through translation: English, French, Arabic, Japanese, Russian, Farsi, Korean, Hindi, Hebrew, and Indonesian. These languages span five language families and a wide range of resource levels. Our experiments show that unlearning in high-resource languages is generally more stable, with asymmetric transfer effects observed between typologically related languages. Furthermore, our analysis of linguistic distances indicates that syntactic similarity is the strongest predictor of cross-lingual unlearning behavior.

2026-01-08

arXiv (prépublication)

Understanding the role of depth in the neural tangent kernel for overparameterized neural networks

William St-Arnaud

Margarida Carvalho

2025-10-31

arXiv (publié)

Fairness in Federated Learning: Fairness for Whom?

Afaf Taïk

Khaoula Chehbouni

Fairness in federated learning has emerged as a rapidly growing area of research, with numerous works proposing formal definitions and algor… (voir plus)ithmic interventions. Yet, despite this technical progress, fairness in FL is often defined and evaluated in ways that abstract away from the sociotechnical contexts in which these systems are deployed. In this paper, we argue that existing approaches tend to optimize narrow system level metrics, such as performance parity or contribution-based rewards, while overlooking how harms arise throughout the FL lifecycle and how they impact diverse stakeholders. We support this claim through a critical analysis of the literature, based on a systematic annotation of papers for their fairness definitions, design decisions, evaluation practices, and motivating use cases. Our analysis reveals five recurring pitfalls: 1) fairness framed solely through the lens of server client architecture, 2) a mismatch between simulations and motivating use-cases and contexts, 3) definitions that conflate protecting the system with protecting its users, 4) interventions that target isolated stages of the lifecycle while neglecting upstream and downstream effects, 5) and a lack of multi-stakeholder alignment where multiple fairness definitions can be relevant at once. Building on these insights, we propose a harm centered framework that links fairness definitions to concrete risks and stakeholder vulnerabilities. We conclude with recommendations for more holistic, context-aware, and accountable fairness research in FL.

2025-10-14

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (publié)

Mitigating Disparate Impact of Differential Privacy in Federated Learning through Robust Clustering

Saber Malekmohammadi

Afaf Taïk

Federated Learning (FL) is a decentralized machine learning (ML) approach that keeps data localized and often incorporates Differential Priv… (voir plus)acy (DP) to enhance privacy guarantees. Similar to previous work on DP in ML, we observed that differentially private federated learning (DPFL) introduces performance disparities, particularly affecting minority groups. Recent work has attempted to address performance fairness in vanilla FL through clustering, but this method remains sensitive and prone to errors, which are further exacerbated by the DP noise in DPFL. To fill this gap, in this paper, we propose a novel clustered DPFL algorithm designed to effectively identify clients' clusters in highly heterogeneous settings while maintaining high accuracy with DP guarantees. To this end, we propose to cluster clients based on both their model updates and training loss values. Our proposed approach also addresses the server's uncertainties in clustering clients' model updates by employing larger batch sizes along with Gaussian Mixture Model (GMM) to alleviate the impact of noise and potential clustering errors, especially in privacy-sensitive scenarios. We provide theoretical analysis of the effectiveness of our proposed approach. We also extensively evaluate our approach across diverse data distributions and privacy budgets and show its effectiveness in mitigating the disparate impact of DP in FL settings with a small computational cost.

2025-10-06

TMLR (accepté)

Neither Valid Nor Reliable? Investigating the Use of LLMs as Judges

Khaoula Chehbouni

Mohammed Haddou

Jackie CK Cheung

2025-09-24

NeurIPS.cc/2025/Position_Paper_Track (accepté)

Localized-Attention-Guided Concept Erasure for Text-to-Image Diffusion Models

Zhuan Shi

Rik de Vries

2025-09-23

NeurIPS.cc/2025/Workshop/GenProCC (publié)

Intrinsic Meets Extrinsic Fairness: Assessing the Downstream Impact of Bias Mitigation in Large Language Models

Mina Arzaghi

Florian Carichon

Large Language Models (LLMs) are increasingly deployed in sensitive domains such as finance, where intrinsic representational biases can pro… (voir plus)pagate into extrinsic harms in downstream tasks. High-stakes applications such as credit scoring are especially vulnerable, as biased model behavior can reinforce existing inequities and result in harmful disparities across demographic groups \cite{blodgett2020language}. While prior research has questioned whether intrinsic bias truly translates into extrinsic unfairness \cite{goldfarb2020intrinsic}, this connection remains poorly understood. To address this gap, we propose a four-stage evaluation framework that systematically examines the relationship between intrinsic and extrinsic fairness. In Stage 1, we establish a baseline by training models such as logistic regression, LLM embeddings, and fine-tuned classifiers without any mitigation strategy, providing reference points for fairness and accuracy. In Stage 2, we evaluate task-level mitigation through Counterfactual Data Augmentation (CDA) \cite{gallegos2024bias}, which balances gender representation by generating counterfactual training instances, allowing us to assess improvements in extrinsic fairness. In Stage 3, we adapt concept unlearning \cite{dige2024mitigating} as an intrinsic bias mitigation method, encouraging LLMs to forget socioeconomic stereotypes while preserving fluency and predictive utility, and we evaluate how this intervention impacts downstream fairness. Finally, in Stage 4, we combine CDA with unlearning to test whether dual mitigation further enhances fairness. We conduct experiments on three datasets (Adult Census Income, ACS Employment, and German Credit) using instruction-tuned LLMs (LLaMA-3.1, Phi-3, and Gemma-2) in both frozen embedding and fine-tuned classifier settings, evaluating performance with predictive accuracy and group fairness metrics, including Demographic Parity, Accuracy Parity, and Equality of Odds. Our experiments demonstrate that intrinsic bias mitigation through unlearning is highly effective; in Phi-3, for instance, it reduces gender socioeconomic stereotype gaps by 94.9\% while maintaining language fluency. In downstream tasks, unlearning consistently improves group fairness metrics while preserving predictive accuracy, whereas CDA primarily enhances demographic parity but can introduce accuracy trade-offs. For instance, on the ACS Employment dataset, unlearned Gemma-2 improved Accuracy Parity from 0.199 to 0.104 (48\% gain), and combining CDA with unlearning on Llama-3.1 reduced Demographic Parity from 0.080 to 0.014 (82\% gain). On the Adult dataset, all three models maintained accuracy above 0.82 while showing reduced fairness gaps, and on German Credit, unlearning consistently outperformed CDA by improving group fairness metrics without sacrificing predictive performance. Overall, CDA and unlearning exhibit complementary effects, with their combination yielding the strongest fairness improvements across models and datasets. This work contributes to bias mitigation and fairness in LLMs in two ways. First, we adapt concept unlearning to mitigate socioeconomic stereotyping, showing that intrinsic bias reduction improves both representational and downstream fairness. Second, we introduce a unified evaluation framework that links intrinsic and extrinsic fairness, enabling systematic comparison of mitigation strategies. The framework is flexible, applying to both fine-tuned and frozen LLMs, and offers actionable guidance for deploying fairer models in finance and other high-stakes domains.

2025-09-21

NeurIPS.cc/2025/Workshop/WiML (publié)