Gintare Karolina Dziugaite

Unlearning in- vs. out-of-distribution data in LLMs under gradient-based methods

Teodora Baluta

Pascal Lamblin

Daniel Tarlow

Fabian Pedregosa

Machine unlearning aims to solve the problem of removing the influence of selected training examples from a learned model. Despite the incre… (see more)asing attention to this problem, it remains an open research question how to evaluate unlearning in large language models (LLMs), and what are the critical properties of the data to be unlearned that affect the quality and efficiency of unlearning. This work formalizes a metric to evaluate unlearning quality in generative models, and uses it to assess the trade-offs between unlearning quality and performance. We demonstrate that unlearning out-of-distribution examples requires more unlearning steps but overall presents a better trade-off overall. For in-distribution examples, however, we observe a rapid decay in performance as unlearning progresses. We further evaluate how example's memorization and difficulty affect unlearning under a classical gradient ascent-based approach.

2024-10-12

NeurIPS.cc/2024/Workshop/SafeGenAi (poster)

Evaluating Interventional Reasoning Capabilities of Large Language Models

Tejas Kasetty

Divyat Mahajan

Alexandre Drouin

Dhanya Sridhar

Numerous decision-making tasks require estimating causal effects under interventions on different parts of a system. As practitioners consid… (see more)er using large language models (LLMs) to automate decisions, studying their causal reasoning capabilities becomes crucial. A recent line of work evaluates LLMs ability to retrieve commonsense causal facts, but these evaluations do not sufficiently assess how LLMs reason about interventions. Motivated by the role that interventions play in causal inference, in this paper, we conduct empirical analyses to evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types, and enable a study of intervention-based reasoning. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts. Our analysis on four LLMs highlights that while GPT- 4 models show promising accuracy at predicting the intervention effects, they remain sensitive to distracting factors in the prompts.

2024-10-10

NeurIPS.cc/2024/Workshop/CALM (poster)

Linear Weight Interpolation Leads to Transient Performance Gains

Gaurav Iyer

David Rolnick

2024-09-28

TMLR (accepted)

Robust Knowledge Unlearning via Mechanistic Localizations

Phillip Huang Guo

Aaquib Syed

Abhay Sheshadri

Aidan Ewart

2024-06-28

ICML.cc/2024/Workshop/NextGenAISafety (poster)

Mixture of Experts in a Mixture of RL settings

Timon Willi

Johan Samir Obando Ceron

Jakob Nicolaus Foerster

Pablo Samuel Castro

Mixtures of Experts (MoEs) have gained prominence in (self-)supervised learning due to their enhanced inference efficiency, adaptability to … (see more)distributed training, and modularity. Previous research has illustrated that MoEs can significantly boost Deep Reinforcement Learning (DRL) performance by expanding the network's parameter count while reducing dormant neurons, thereby enhancing the model's learning capacity and ability to deal with non-stationarity. In this work, we shed more light on MoEs' ability to deal with non-stationarity and investigate MoEs in DRL settings with"amplified"non-stationarity via multi-task training, providing further evidence that MoEs improve learning capacity. In contrast to previous work, our multi-task results allow us to better understand the underlying causes for the beneficial effect of MoE in DRL training, the impact of the various MoE components, and insights into how best to incorporate them in actor-critic-based DRL networks. Finally, we also confirm results from previous work.

2024-06-26

ArXiv (preprint)

Robust Unlearning via Mechanistic Localizations

Phillip Huang Guo

Aaquib Syed

Abhay Sheshadri

Aidan Ewart

Methods for machine unlearning in large language models seek to remove undesirable knowledge or capabilities without compromising general la… (see more)nguage modeling performance. This work investigates the use of mechanistic interpretability to improve the precision and effectiveness of unlearning. We demonstrate that localizing unlearning to components with particular mechanisms in factual recall leads to more robust unlearning across different input/output formats, relearning, and latent knowledge, and reduces unintended side effects compared to nonlocalized unlearning. Additionally, we analyze the strengths and weaknesses of different automated (rather than manual) interpretability methods for guiding unlearning, finding that their corresponding unlearned models require smaller edit sizes to achieve unlearning but are much less robust.

2024-06-24

ICML.cc/2024/Workshop/MI (spotlight)

Robust Unlearning via Mechanistic Localizations

Phillip Huang Guo

Aaquib Syed

Abhay Sheshadri

Aidan Ewart

Methods for machine unlearning in large language models seek to remove undesirable knowledge or capabilities without compromising general la… (see more)nguage modeling performance. This work investigates the use of mechanistic interpretability to improve the precision and effectiveness of unlearning. We demonstrate that localizing unlearning to components with particular mechanisms in factual recall leads to more robust unlearning across different input/output formats, relearning, and latent knowledge, and reduces unintended side effects compared to nonlocalized unlearning. Additionally, we analyze the strengths and weaknesses of different automated (rather than manual) interpretability methods for guiding unlearning, finding that their corresponding unlearned models require smaller edit sizes to achieve unlearning but are much less robust.

2024-06-24

ICML.cc/2024/Workshop/MI (spotlight)

Linear Weight Interpolation Leads to Transient Performance Gains

Gaurav Iyer

David Rolnick

2024-06-16

ICML.cc/2024/Workshop/HiLD (poster)

Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition

Eleni Triantafillou

Peter Kairouz

Fabian Pedregosa

Jamie Hayes

Meghdad Kurmanji

Kairan Zhao

Vincent Dumoulin

Julio C. S. Jacques Junior

Ioannis Mitliagkas

Jun Wan

Lisheng Sun-Hosoya

Sergio Escalera

Peter Triantafillou

Isabelle Guyon

We present the findings of the first NeurIPS competition on unlearning, which sought to stimulate the development of novel algorithms and in… (see more)itiate discussions on formal and robust evaluation methodologies. The competition was highly successful: nearly 1,200 teams from across the world participated, and a wealth of novel, imaginative solutions with different characteristics were contributed. In this paper, we analyze top solutions and delve into discussions on benchmarking unlearning, which itself is a research problem. The evaluation methodology we developed for the competition measures forgetting quality according to a formal notion of unlearning, while incorporating model utility for a holistic evaluation. We analyze the effectiveness of different instantiations of this evaluation framework vis-a-vis the associated compute cost, and discuss implications for standardizing evaluation. We find that the ranking of leading methods remains stable under several variations of this framework, pointing to avenues for reducing the cost of evaluation. Overall, our findings indicate progress in unlearning, with top-performing competition entries surpassing existing algorithms under our evaluation framework. We analyze trade-offs made by different algorithms and strengths or weaknesses in terms of generalizability to new datasets, paving the way for advancing both benchmarking and algorithm development in this important area.

2024-06-13

ArXiv (preprint)

Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition

Eleni Triantafillou

Peter Kairouz

Fabian Pedregosa

Jamie Hayes

Meghdad Kurmanji

Kairan Zhao

Vincent Dumoulin

Julio C. S. Jacques Junior

Ioannis Mitliagkas

Jun Wan

Lisheng Sun-Hosoya

Sergio Escalera

Peter Triantafillou

Isabelle Guyon

We present the findings of the first NeurIPS competition on unlearning, which sought to stimulate the development of novel algorithms and in… (see more)itiate discussions on formal and robust evaluation methodologies. The competition was highly successful: nearly 1,200 teams from across the world participated, and a wealth of novel, imaginative solutions with different characteristics were contributed. In this paper, we analyze top solutions and delve into discussions on benchmarking unlearning, which itself is a research problem. The evaluation methodology we developed for the competition measures forgetting quality according to a formal notion of unlearning, while incorporating model utility for a holistic evaluation. We analyze the effectiveness of different instantiations of this evaluation framework vis-a-vis the associated compute cost, and discuss implications for standardizing evaluation. We find that the ranking of leading methods remains stable under several variations of this framework, pointing to avenues for reducing the cost of evaluation. Overall, our findings indicate progress in unlearning, with top-performing competition entries surpassing existing algorithms under our evaluation framework. We analyze trade-offs made by different algorithms and strengths or weaknesses in terms of generalizability to new datasets, paving the way for advancing both benchmarking and algorithm development in this important area.

2024-06-13

ArXiv (preprint)

Data Selection for Transfer Unlearning

Nazanin Mohammadi Sepahvand

Vincent Dumoulin

Eleni Triantafillou

2024-05-16

ArXiv (preprint)