Negar Rostamzadeh

Website

Github

Publications

Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes

Aly M. Kassem

Thomas Jiralerspong

Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shar… (see more)ed dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming SAE-based baselines, while matching the Non-SAE-based. Our results demonstrate that crosscoders remain a powerful tool for model diffing.

2026-02-15

arXiv (preprint)

Alireza Dehghanpour Farashah

Multilingual Amnesia: On the Transferability of Unlearning in Multilingual LLMs

As multilingual large language models become more widely used, ensuring their safety and fairness across diverse linguistic contexts present… (see more)s unique challenges. While existing research on machine unlearning has primarily focused on monolingual settings, typically English, multilingual environments introduce additional complexities due to cross-lingual knowledge transfer and biases embedded in both pretraining and fine-tuning data. In this work, we study multilingual unlearning using the Aya-Expanse 8B model under two settings: (1) data unlearning and (2) concept unlearning. We extend benchmarks for factual knowledge and stereotypes to ten languages through translation: English, French, Arabic, Japanese, Russian, Farsi, Korean, Hindi, Hebrew, and Indonesian. These languages span five language families and a wide range of resource levels. Our experiments show that unlearning in high-resource languages is generally more stable, with asymmetric transfer effects observed between typologically related languages. Furthermore, our analysis of linguistic distances indicates that syntactic similarity is the strongest predictor of cross-lingual unlearning behavior.

2026-01-08

arXiv (preprint)

Measuring What Matters: Connecting AI Ethics Evaluations to System Attributes, Hazards, and Harms

Renee Shelby

Over the past decade, an ecosystem of measures has emerged to evaluate the social and ethical implications of AI systems, largely shaped by … (see more)high-level ethics principles. These measures are developed and used in fragmented ways, without adequate attention to how they are situated in AI systems. In this paper, we examine how existing measures used in the computing literature map to AI system components, attributes, hazards, and harms. Our analysis draws on a scoping review resulting in nearly 800 measures corresponding to 11 AI ethics principles. We find that most measures focus on four principles – fairness, transparency, privacy, and trust – and primarily assess model or output system components. Few measures account for interactions across system elements, and only a narrow set of hazards is typically considered for each harm type. Many measures are disconnected from where harm is experienced and lack guidance for setting meaningful thresholds. These patterns reveal how current evaluation practices remain fragmented, measuring in pieces rather than capturing how harms emerge across systems. Framing measures with respect to system attributes, hazards, and harms can strengthen regulatory oversight, support actionable practices in industry, and ground future research in systems-level understanding.

2025-10-14

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (published)

Bias-inducing geometries: an exactly solvable data model with fairness implications

Stefano Sarao Mannelli

Federica Gerace

Luca Saglietti

2025-08-06

Physical Review E (published)

Alireza Dehghanpour Farashah

UNLEARNING GEO-CULTURAL STEREOTYPES IN MULTILINGUAL LLMS

Aditi Khandelwal

As multilingual generative models become more widely used, most safety and fairness evaluation techniques still focus on English-language re… (see more)sources, while overlooking important cross-cultural factors. This limitation raises concerns about fairness and safety, particularly regarding geoculturally situated stereotypes that hinder the models’ global inclusivity. In this work, we present preliminary findings on the impact of stereotype unlearning across languages, specifically in English, French, and Hindi. Using an adapted version of the SeeGULL dataset, we analyze how unlearning stereotypes in one language influences other languages within multilingual large language models. Our study evaluates two model families, Llama-3.1-8B and Aya-Expanse-8B, to assess whether unlearning in one linguistic context transfers across languages, potentially mitigating or exacerbating biases in multilingual settings.

2025-03-04

ICLR.cc/2025/Workshop/BuildingTrust (accepted)

Nteasee: A mixed methods study of expert and general population perspectives on deploying AI for health in African countries

Mercy Nyamewaa Asiedu

Iskandar Haykel

Awa Dieng

K. Kauer

Tousif Ahmed

Florence Ofori

Charisma Chan

Stephen R. Pfohl

Katherine Heller

2024-12-31

FAccT (published)

The Case for Globalizing Fairness: A Mixed Methods Study on Colonialism, AI, and Health in Africa

Mercy Nyamewaa Asiedu

Awa Dieng

Iskandar Haykel

Stephen R. Pfohl

Chirag Nagpal

Maria Nagawa

Abigail Oppong

Sanmi Koyejo

Katherine Heller

With growing application of machine learning (ML) technologies in healthcare, there have been calls for developing techniques to understand … (see more)and mitigate biases these systems may exhibit. Fair-ness considerations in the development of ML-based solutions for health have particular implications for Africa, which already faces inequitable power imbalances between the Global North and South.This paper seeks to explore fairness for global health, with Africa as a case study. We conduct a scoping review to propose axes of disparities for fairness consideration in the African context and delineate where they may come into play in different ML-enabled medical modalities. We then conduct qualitative research studies with 672 general population study participants and 28 experts inML, health, and policy focused on Africa to obtain corroborative evidence on the proposed axes of disparities. Our analysis focuses on colonialism as the attribute of interest and examines the interplay between artificial intelligence (AI), health, and colonialism. Among the pre-identified attributes, we found that colonial history, country of origin, and national income level were specific axes of disparities that participants believed would cause an AI system to be biased.However, there was also divergence of opinion between experts and general population participants. Whereas experts generally expressed a shared view about the relevance of colonial history for the development and implementation of AI technologies in Africa, the majority of the general population participants surveyed did not think there was a direct link between AI and colonialism. Based on these findings, we provide practical recommendations for developing fairness-aware ML solutions for health in Africa.

2024-10-28

Proceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (published)

What Secrets Do Your Manifolds Hold? Understanding the Local Geometry of Generative Models

Ahmed Imtiaz Humayun

Candice Schumann

2024-08-14

ArXiv (preprint)

Position: Cracking the Code of Cascading Disparity Towards Marginalized Communities

Mohammad Havaei

2024-07-07

Proceedings of the 41st International Conference on Machine Learning (published)

proceedings.mlr.press

Bias-inducing geometries: exactly solvable data model with fairness implications

Stefano Sarao Mannelli

Federica Gerace

Luca Saglietti

Machine learning (ML) may be oblivious to human bias but it is not immune to its perpetuation. Marginalisation and iniquitous group represen… (see more)tation are often traceable in the very data used for training, and may be reflected or even enhanced by the learning models. In this abstract, we aim to clarify the role played by data geometry in the emergence of ML bias. We introduce an exactly solvable high-dimensional model of data imbalance, where parametric control over the many bias-inducing factors allows for an extensive exploration of the bias inheritance mechanism. Through the tools of statistical physics, we analytically characterise the typical properties of learning models trained in this synthetic framework and obtain exact predictions for the observables that are commonly employed for fairness assessment. Simplifying the nature of the problem to its minimal components, we can retrace and unpack typical unfairness behaviour observed on real-world datasets

2024-06-16

ICML.cc/2024/Workshop/GRaM (published)

On The Local Geometry of Deep Generative Manifolds

Ahmed Imtiaz Humayun

Candice Schumann

In this paper, we study theoretically inspired local geometric descriptors of the data manifolds approximated by pre-trained generative mode… (see more)ls. The descriptors – local scaling (ψ), local rank (ν), and local complexity (δ) — characterize the uncertainty, dimensionality, and smoothness on the learned manifold, using only the network weights and architecture. We investigate and emphasize their critical role in understanding generative models. Our analysis reveals that the local geometry is intricately linked to the quality and diversity of generated outputs. Additionally, we see that the geometric properties are distinct for out-of-distribution (OOD) inputs as well as for prompts memorized by Stable Diffusion, showing the possible application of our proposed descriptors for downstream detection and assessment of pre-trained generative models.

2024-06-16

ICML.cc/2024/Workshop/GRaM (published)

The value of standards for health datasets in artificial intelligence-based applications

Anmol Arora

Joseph E. Alderman

Joanne Palmer

Shaswath Ganapathi

Elinor Laws

Melissa D. McCradden

Lauren Oakden-Rayner

Stephen R. Pfohl

Marzyeh Ghassemi

Francis McKay

Darren Treanor

Bilal Mateen

Jacqui Gath

Adewole O. Adebajo

Stephanie Kuku

Rubeta Matin

Katherine Heller

Elizabeth Sapey

Neil J. Sebire … (see 4 more)

Heather Cole-Lewis

Melanie Calvert

Alastair Denniston

Xiaoxuan Liu

2023-10-25

Nature Medicine (published)