Florian Carichon

Intrinsic Meets Extrinsic Fairness: Assessing the Downstream Impact of Bias Mitigation in Large Language Models

Large Language Models (LLMs) are increasingly deployed in sensitive domains such as finance, where intrinsic representational biases can pro… (see more)pagate into extrinsic harms in downstream tasks. High-stakes applications such as credit scoring are especially vulnerable, as biased model behavior can reinforce existing inequities and result in harmful disparities across demographic groups \cite{blodgett2020language}. While prior research has questioned whether intrinsic bias truly translates into extrinsic unfairness \cite{goldfarb2020intrinsic}, this connection remains poorly understood. To address this gap, we propose a four-stage evaluation framework that systematically examines the relationship between intrinsic and extrinsic fairness. In Stage 1, we establish a baseline by training models such as logistic regression, LLM embeddings, and fine-tuned classifiers without any mitigation strategy, providing reference points for fairness and accuracy. In Stage 2, we evaluate task-level mitigation through Counterfactual Data Augmentation (CDA) \cite{gallegos2024bias}, which balances gender representation by generating counterfactual training instances, allowing us to assess improvements in extrinsic fairness. In Stage 3, we adapt concept unlearning \cite{dige2024mitigating} as an intrinsic bias mitigation method, encouraging LLMs to forget socioeconomic stereotypes while preserving fluency and predictive utility, and we evaluate how this intervention impacts downstream fairness. Finally, in Stage 4, we combine CDA with unlearning to test whether dual mitigation further enhances fairness. We conduct experiments on three datasets (Adult Census Income, ACS Employment, and German Credit) using instruction-tuned LLMs (LLaMA-3.1, Phi-3, and Gemma-2) in both frozen embedding and fine-tuned classifier settings, evaluating performance with predictive accuracy and group fairness metrics, including Demographic Parity, Accuracy Parity, and Equality of Odds. Our experiments demonstrate that intrinsic bias mitigation through unlearning is highly effective; in Phi-3, for instance, it reduces gender socioeconomic stereotype gaps by 94.9\% while maintaining language fluency. In downstream tasks, unlearning consistently improves group fairness metrics while preserving predictive accuracy, whereas CDA primarily enhances demographic parity but can introduce accuracy trade-offs. For instance, on the ACS Employment dataset, unlearned Gemma-2 improved Accuracy Parity from 0.199 to 0.104 (48\% gain), and combining CDA with unlearning on Llama-3.1 reduced Demographic Parity from 0.080 to 0.014 (82\% gain). On the Adult dataset, all three models maintained accuracy above 0.82 while showing reduced fairness gaps, and on German Credit, unlearning consistently outperformed CDA by improving group fairness metrics without sacrificing predictive performance. Overall, CDA and unlearning exhibit complementary effects, with their combination yielding the strongest fairness improvements across models and datasets. This work contributes to bias mitigation and fairness in LLMs in two ways. First, we adapt concept unlearning to mitigate socioeconomic stereotyping, showing that intrinsic bias reduction improves both representational and downstream fairness. Second, we introduce a unified evaluation framework that links intrinsic and extrinsic fairness, enabling systematic comparison of mitigation strategies. The framework is flexible, applying to both fine-tuned and frozen LLMs, and offers actionable guidance for deploying fairer models in finance and other high-stakes domains.

2025-09-22

NeurIPS.cc/2025/Workshop/WiML (published)

openreview.net

Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets

2025-09-22

NeurIPS.cc/2025/Workshop/WiML (published)

doi.org

openreview.net

Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets

Recent advances in reasoning with large language models (LLMs) have demonstrated strong performance on complex mathematical tasks, including… (see more) combinatorial optimization. Techniques such as Chain-of-Thought and In-Context Learning have further enhanced this capability, making LLMs both powerful and accessible tools for a wide range of users, including non-experts. However, applying LLMs to matching problems, which require reasoning under preferential and structural constraints, remains underexplored. To address this gap, we introduce a novel benchmark of 369 instances of the College Admission Problem, a canonical example of a matching problem with preferences, to evaluate LLMs across key dimensions: feasibility, stability, and optimality. We employ this benchmark to assess the performance of several open-weight LLMs. Our results first reveal that while LLMs can satisfy certain constraints, they struggle to meet all evaluation criteria consistently. They also show that reasoning LLMs, like QwQ and GPT-oss, significantly outperform traditional models such as Llama, Qwen or Mistral, defined here as models used without any dedicated reasoning mechanisms. Moreover, we observed that LLMs reacted differently to the various prompting strategies tested, which include Chain-of-Thought, In-Context Learning and role-based prompting, with no prompt consistently offering the best performance. Finally, we report the performances from iterative prompting with auto-generated feedback and show that they are not monotonic; they can peak early and then significantly decline in later attempts. Overall, this work offers a new perspective on model reasoning performance and the effectiveness of prompting strategies in combinatorial optimization problems with preferential constraints.

2025-09-16

ArXiv (preprint)

doi.org

arxiv.org

Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets

Recent advances in reasoning with large language models (LLMs) have demonstrated strong performance on complex mathematical tasks, including… (see more) combinatorial optimization. Techniques such as Chain-of-Thought and In-Context Learning have further enhanced this capability, making LLMs both powerful and accessible tools for a wide range of users, including non-experts. However, applying LLMs to matching problems, which require reasoning under preferential and structural constraints, remains underexplored. To address this gap, we introduce a novel benchmark of 369 instances of the College Admission Problem, a canonical example of a matching problem with preferences, to evaluate LLMs across key dimensions: feasibility, stability, and optimality. We employ this benchmark to assess the performance of several open-weight LLMs. Our results first reveal that while LLMs can satisfy certain constraints, they struggle to meet all evaluation criteria consistently. They also show that reasoning LLMs, like QwQ and GPT-oss, significantly outperform traditional models such as Llama, Qwen or Mistral, defined here as models used without any dedicated reasoning mechanisms. Moreover, we observed that LLMs reacted differently to the various prompting strategies tested, which include Chain-of-Thought, In-Context Learning and role-based prompting, with no prompt consistently offering the best performance. Finally, we report the performances from iterative prompting with auto-generated feedback and show that they are not monotonic; they can peak early and then significantly decline in later attempts. Overall, this work offers a new perspective on model reasoning performance and the effectiveness of prompting strategies in combinatorial optimization problems with preferential constraints.

2025-09-01

arXiv (published)

doi.org

Embedding Cultural Diversity in Prototype-based Recommender Systems

Popularity bias in recommender systems can increase cultural overrepresentation by favoring norms from dominant cultures and marginalizing u… (see more)nderrepresented groups. This issue is critical for platforms offering cultural products, as they influence consumption patterns and human perceptions. In this work, we address popularity bias by identifying demographic biases within prototype-based matrix factorization methods. Using the country of origin as a proxy for cultural identity, we link this demographic attribute to popularity bias by refining the embedding space learning process. First, we propose filtering out irrelevant prototypes to improve representativity. Second, we introduce a regularization technique to enforce a uniform distribution of prototypes within the embedding space. Across four datasets, our results demonstrate a 27\% reduction in the average rank of long-tail items and a 2\% reduction in the average rank of items from underrepresented countries. Additionally, our model achieves a 2\% improvement in HitRatio@10 compared to the state-of-the-art, highlighting that fairness is enhanced without compromising recommendation quality. Moreover, the distribution of prototypes leads to more inclusive explanations by better aligning items with diverse prototypes.

2025-01-01

European Conference on Information Retrieval (published)

doi.org

arxiv.org

Embedding Cultural Diversity in Prototype-based Recommender Systems

Popularity bias in recommender systems can increase cultural overrepresentation by favoring norms from dominant cultures and marginalizing u… (see more)nderrepresented groups. This issue is critical for platforms offering cultural products, as they influence consumption patterns and human perceptions. In this work, we address popularity bias by identifying demographic biases within prototype-based matrix factorization methods. Using the country of origin as a proxy for cultural identity, we link this demographic attribute to popularity bias by refining the embedding space learning process. First, we propose filtering out irrelevant prototypes to improve representativity. Second, we introduce a regularization technique to enforce a uniform distribution of prototypes within the embedding space. Across four datasets, our results demonstrate a 27\% reduction in the average rank of long-tail items and a 2\% reduction in the average rank of items from underrepresented countries. Additionally, our model achieves a 2\% improvement in HitRatio@10 compared to the state-of-the-art, highlighting that fairness is enhanced without compromising recommendation quality. Moreover, the distribution of prototypes leads to more inclusive explanations by better aligning items with diverse prototypes.

2024-12-18

ArXiv (preprint)

doi.org

arxiv.org

Embedding Cultural Diversity in Prototype-based Recommender Systems

Popularity bias in recommender systems can increase cultural overrepresentation by favoring norms from dominant cultures and marginalizing u… (see more)nderrepresented groups. This issue is critical for platforms offering cultural products, as they influence consumption patterns and human perceptions. In this work, we address popularity bias by identifying demographic biases within prototype-based matrix factorization methods. Using the country of origin as a proxy for cultural identity, we link this demographic attribute to popularity bias by refining the embedding space learning process. First, we propose filtering out irrelevant prototypes to improve representativity. Second, we introduce a regularization technique to enforce a uniform distribution of prototypes within the embedding space. Across four datasets, our results demonstrate a 27\% reduction in the average rank of long-tail items and a 2\% reduction in the average rank of items from underrepresented countries. Additionally, our model achieves a 2\% improvement in HitRatio@10 compared to the state-of-the-art, highlighting that fairness is enhanced without compromising recommendation quality. Moreover, the distribution of prototypes leads to more inclusive explanations by better aligning items with diverse prototypes.

2024-12-18

ArXiv (preprint)

arxiv.org

Understanding Intrinsic Socioeconomic Biases in Large Language Models

Mina Arzaghi

Florian Carichon

Golnoosh Farnadi

Large Language Models (LLMs) are increasingly integrated into critical decision-making processes, such as loan approvals and visa applicatio… (see more)ns, where inherent biases can lead to discriminatory outcomes. In this paper, we examine the nuanced relationship between demographic attributes and socioeconomic biases in LLMs, a crucial yet understudied area of fairness in LLMs. We introduce a novel dataset of one million English sentences to systematically quantify socioeconomic biases across various demographic groups. Our findings reveal pervasive socioeconomic biases in both established models such as GPT-2 and state-of-the-art models like Llama 2 and Falcon. We demonstrate that these biases are significantly amplified when considering intersectionality, with LLMs exhibiting a remarkable capacity to extract multiple demographic attributes from names and then correlate them with specific socioeconomic biases. This research highlights the urgent necessity for proactive and robust bias mitigation techniques to safeguard against discriminatory outcomes when deploying these powerful models in critical real-world applications.

2024-01-01

AIES (1) (published)

doi.org

arxiv.org

Opening Conference | Building Safer AI for Youth Mental Health

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

Indigenous Pathfinders in AI

Florian Carichon

Publications

Opening Conference | Building Safer AI for Youth Mental Health

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

Indigenous Pathfinders in AI

Popular keywords:

Florian Carichon

Publications