Portrait de Golnoosh Farnadi

Golnoosh Farnadi

Membre académique principal
Chaire en IA Canada-CIFAR
Professeure adjointe, McGill University, École d'informatique
Professeure associée, Université de Montréal, Département d'informatique et de recherche opérationnelle
Chercheuse invitée, Google
Sujets de recherche
Apprentissage profond
Modèles génératifs

Biographie

Golnoosh Farnadi est professeure associée à l'École d'informatique de l'Université McGill et professeure associée à l'Université de Montréal. Elle est membre académique principal à Mila - Institut québécois d'intelligence artificielle et est titulaire d'une chaire CIFAR d'intelligence artificielle au Canada.

Mme Farnadi a fondé le laboratoire EQUAL à Mila / Université McGill, dont elle est l'une des principales chercheuses. Le laboratoire EQUAL (EQuity & EQuality Using AI and Learning algorithms) est un laboratoire de recherche de pointe dédié à l'avancement des domaines de l'équité algorithmique et de l'IA responsable.

Étudiants actuels

Doctorat - HEC
Postdoctorat - McGill
Doctorat - McGill
Co-superviseur⋅e :
Doctorat - McGill
Co-superviseur⋅e :
Maîtrise recherche - McGill
Co-superviseur⋅e :
Maîtrise recherche - UdeM
Superviseur⋅e principal⋅e :
Stagiaire de recherche - McGill
Collaborateur·rice de recherche - UWindsor
Doctorat - McGill
Co-superviseur⋅e :
Collaborateur·rice de recherche - McGill
Collaborateur·rice alumni - UdeM
Collaborateur·rice de recherche - McGill
Stagiaire de recherche - McGill
Visiteur de recherche indépendant - McGill university
Stagiaire de recherche - McGill
Collaborateur·rice de recherche - McGill
Doctorat - McGill
Co-superviseur⋅e :
Postdoctorat - McGill
Doctorat - UdeM
Co-superviseur⋅e :
Maîtrise recherche - McGill

Publications

Differentially Private Clustered Federated Learning
Federated learning (FL), which is a decentralized machine learning (ML) approach, often incorporates differential privacy (DP) to provide ri… (voir plus)gorous data privacy guarantees. Previous works attempted to address high structured data heterogeneity in vanilla FL settings through clustering clients (a.k.a clustered FL), but these methods remain sensitive and prone to errors, further exacerbated by the DP noise. This vulnerability makes the previous methods inappropriate for differentially private FL (DPFL) settings with structured data heterogeneity. To address this gap, we propose an algorithm for differentially private clustered FL, which is robust to the DP noise in the system and identifies the underlying clients' clusters correctly. To this end, we propose to cluster clients based on both their model updates and training loss values. Furthermore, for clustering clients' model updates at the end of the first round, our proposed approach addresses the server's uncertainties by employing large batch sizes as well as Gaussian Mixture Models (GMM) to reduce the impact of DP and stochastic noise and avoid potential clustering errors. This idea is efficient especially in privacy-sensitive scenarios with more DP noise. We provide theoretical analysis to justify our approach and evaluate it across diverse data distributions and privacy budgets. Our experimental results show its effectiveness in addressing large structured data heterogeneity in DPFL.
Differentially Private Clustered Federated Learning
Federated learning (FL), which is a decentralized machine learning (ML) approach, often incorporates differential privacy (DP) to provide ri… (voir plus)gorous data privacy guarantees to clients. Previous works attempted to address high structured data heterogeneity in vanilla FL settings through clustering clients (a.k.a clustered FL), but these methods remain sensitive and prone to errors, further exacerbated by the DP noise. This vulnerability makes the previous methods inappropriate for differentially private FL (DPFL) under structured data heterogeneity. To address this gap, we propose an algorithm for differentially private clustered FL, which is robust to the DP noise in the system and identifies the underlying clients’ clusters correctly. To this end, we propose to cluster clients based on both their model updates and training loss values. Furthermore, for clustering clients’ model updates at the end of the first round, our proposed approach addresses the server’s uncertainties by employing large batch sizes as well as Gaussian Mixture Models (GMM) to reduce the impact of DP and stochastic noise and avoid potential clustering errors. We provide theoretical analysis to justify our approach and evaluate it across diverse data distributions and privacy budgets. Our experimental results show the approach’s effectiveness in addressing high structured data heterogeneity in DPFL.
Intrinsic Meets Extrinsic Fairness: Assessing the Downstream Impact of Bias Mitigation in Large Language Models
Large Language Models (LLMs) are increasingly deployed in sensitive domains such as finance, where intrinsic representational biases can pro… (voir plus)pagate into extrinsic harms in downstream tasks. High-stakes applications such as credit scoring are especially vulnerable, as biased model behavior can reinforce existing inequities and result in harmful disparities across demographic groups \cite{blodgett2020language}. While prior research has questioned whether intrinsic bias truly translates into extrinsic unfairness \cite{goldfarb2020intrinsic}, this connection remains poorly understood. To address this gap, we propose a four-stage evaluation framework that systematically examines the relationship between intrinsic and extrinsic fairness. In Stage 1, we establish a baseline by training models such as logistic regression, LLM embeddings, and fine-tuned classifiers without any mitigation strategy, providing reference points for fairness and accuracy. In Stage 2, we evaluate task-level mitigation through Counterfactual Data Augmentation (CDA) \cite{gallegos2024bias}, which balances gender representation by generating counterfactual training instances, allowing us to assess improvements in extrinsic fairness. In Stage 3, we adapt concept unlearning \cite{dige2024mitigating} as an intrinsic bias mitigation method, encouraging LLMs to forget socioeconomic stereotypes while preserving fluency and predictive utility, and we evaluate how this intervention impacts downstream fairness. Finally, in Stage 4, we combine CDA with unlearning to test whether dual mitigation further enhances fairness. We conduct experiments on three datasets (Adult Census Income, ACS Employment, and German Credit) using instruction-tuned LLMs (LLaMA-3.1, Phi-3, and Gemma-2) in both frozen embedding and fine-tuned classifier settings, evaluating performance with predictive accuracy and group fairness metrics, including Demographic Parity, Accuracy Parity, and Equality of Odds. Our experiments demonstrate that intrinsic bias mitigation through unlearning is highly effective; in Phi-3, for instance, it reduces gender socioeconomic stereotype gaps by 94.9\% while maintaining language fluency. In downstream tasks, unlearning consistently improves group fairness metrics while preserving predictive accuracy, whereas CDA primarily enhances demographic parity but can introduce accuracy trade-offs. For instance, on the ACS Employment dataset, unlearned Gemma-2 improved Accuracy Parity from 0.199 to 0.104 (48\% gain), and combining CDA with unlearning on Llama-3.1 reduced Demographic Parity from 0.080 to 0.014 (82\% gain). On the Adult dataset, all three models maintained accuracy above 0.82 while showing reduced fairness gaps, and on German Credit, unlearning consistently outperformed CDA by improving group fairness metrics without sacrificing predictive performance. Overall, CDA and unlearning exhibit complementary effects, with their combination yielding the strongest fairness improvements across models and datasets. This work contributes to bias mitigation and fairness in LLMs in two ways. First, we adapt concept unlearning to mitigate socioeconomic stereotyping, showing that intrinsic bias reduction improves both representational and downstream fairness. Second, we introduce a unified evaluation framework that links intrinsic and extrinsic fairness, enabling systematic comparison of mitigation strategies. The framework is flexible, applying to both fine-tuned and frozen LLMs, and offers actionable guidance for deploying fairer models in finance and other high-stakes domains.
Neither Valid Nor Reliable? Investigating the Use of LLMs as Judges
Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets
Say It Another Way: Auditing LLMs with a User-Grounded Automated Paraphrasing Framework
Cléa Chataigner
Rebecca Ma
Elliot Creager
Towards Democratizing LLMs: Investigating Multilingual Mixture-of-Experts Models
Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets
Recent advances in reasoning with large language models (LLMs) have demonstrated strong performance on complex mathematical tasks, including… (voir plus) combinatorial optimization. Techniques such as Chain-of-Thought and In-Context Learning have further enhanced this capability, making LLMs both powerful and accessible tools for a wide range of users, including non-experts. However, applying LLMs to matching problems, which require reasoning under preferential and structural constraints, remains underexplored. To address this gap, we introduce a novel benchmark of 369 instances of the College Admission Problem, a canonical example of a matching problem with preferences, to evaluate LLMs across key dimensions: feasibility, stability, and optimality. We employ this benchmark to assess the performance of several open-weight LLMs. Our results first reveal that while LLMs can satisfy certain constraints, they struggle to meet all evaluation criteria consistently. They also show that reasoning LLMs, like QwQ and GPT-oss, significantly outperform traditional models such as Llama, Qwen or Mistral, defined here as models used without any dedicated reasoning mechanisms. Moreover, we observed that LLMs reacted differently to the various prompting strategies tested, which include Chain-of-Thought, In-Context Learning and role-based prompting, with no prompt consistently offering the best performance. Finally, we report the performances from iterative prompting with auto-generated feedback and show that they are not monotonic; they can peak early and then significantly decline in later attempts. Overall, this work offers a new perspective on model reasoning performance and the effectiveness of prompting strategies in combinatorial optimization problems with preferential constraints.
Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets
Recent advances in reasoning with large language models (LLMs) have demonstrated strong performance on complex mathematical tasks, including… (voir plus) combinatorial optimization. Techniques such as Chain-of-Thought and In-Context Learning have further enhanced this capability, making LLMs both powerful and accessible tools for a wide range of users, including non-experts. However, applying LLMs to matching problems, which require reasoning under preferential and structural constraints, remains underexplored. To address this gap, we introduce a novel benchmark of 369 instances of the College Admission Problem, a canonical example of a matching problem with preferences, to evaluate LLMs across key dimensions: feasibility, stability, and optimality. We employ this benchmark to assess the performance of several open-weight LLMs. Our results first reveal that while LLMs can satisfy certain constraints, they struggle to meet all evaluation criteria consistently. They also show that reasoning LLMs, like QwQ and GPT-oss, significantly outperform traditional models such as Llama, Qwen or Mistral, defined here as models used without any dedicated reasoning mechanisms. Moreover, we observed that LLMs reacted differently to the various prompting strategies tested, which include Chain-of-Thought, In-Context Learning and role-based prompting, with no prompt consistently offering the best performance. Finally, we report the performances from iterative prompting with auto-generated feedback and show that they are not monotonic; they can peak early and then significantly decline in later attempts. Overall, this work offers a new perspective on model reasoning performance and the effectiveness of prompting strategies in combinatorial optimization problems with preferential constraints.
Adaptation, Comparison and Practical Implementation of Fairness Schemes in Kidney Exchange Programs
In Kidney Exchange Programs (KEPs), each participating patient is registered together with an incompatible donor. Donors without an incompat… (voir plus)ible patient can also register. Then, KEPs typically maximize overall patient benefit through donor exchanges. This aggregation of benefits calls into question potential individual patient disparities in terms of access to transplantation in KEPs. Considering solely this utilitarian objective may become an issue in the case where multiple exchange plans are optimal or near-optimal. In fact, current KEP policies are all-or-nothing, meaning that only one exchange plan is determined. Each patient is either selected or not as part of that unique solution. In this work, we seek instead to find a policy that contemplates the probability of patients of being in a solution. To guide the determination of our policy, we adapt popular fairness schemes to KEPs to balance the usual approach of maximizing the utilitarian objective. Different combinations of fairness and utilitarian objectives are modelled as conic programs with an exponential number of variables. We propose a column generation approach to solve them effectively in practice. Finally, we make an extensive comparison of the different schemes in terms of the balance of utility and fairness score, and validate the scalability of our methodology for benchmark instances from the literature.
Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by t… (voir plus)he rise of large language models (LLMs) that aims to be general-purpose. Recently, LLMs as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation.
Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model Training