David Scott Krueger

Andrew Paverd

Ahmed Salem

Shruti Tople

Lukas Wutschitz

Menglin Xia

Santiago Zanella-Beguelin

Large Language Models (LLMs) are rapidly becoming commodity components of larger software systems. This poses natural security and privacy p… (voir plus)roblems: poisoned data retrieved from one component can change the model's behavior and compromise the entire system, including coercing the model to spread confidential data to untrusted components. Assuming each piece of information comes with an additional meta-label (such as low/high integrity labels), one promising approach is to tackle this problem at the system level via dynamic information flow (aka taint) tracking. Unfortunately, this approach of propagating the most restrictive input label to the output is too conservative for applications where LLMs operate on inputs retrieved from diverse sources. In this paper, we propose a novel, more permissive approach to propagate information flow labels through LLM queries. The key idea behind our approach is to propagate only the input labels that were \emph{influential} in generating the model output and to eliminate the labels of unnecessary inputs. We implement and investigate the effectiveness of two variations of this approach, based on (i) prompt-based retrieval augmentation, and (ii) a

2025-10-11

TMLR (accepté)

Permissive Information-Flow Analysis for Large Language Models

Shoaib Ahmed Siddiqui

Radhika Gaonkar

Boris Köpf

Andrew Paverd

Ahmed Salem

Shruti Tople

Lukas Wutschitz

Menglin Xia

Santiago Zanella-Beguelin

Large Language Models (LLMs) are rapidly becoming commodity components of larger software systems. This poses natural security and privacy p… (voir plus)roblems: poisoned data retrieved from one component can change the model's behavior and compromise the entire system, including coercing the model to spread confidential data to untrusted components. Assuming each piece of information comes with an additional meta-label (such as low/high integrity labels), one promising approach is to tackle this problem at the system level via dynamic information flow (aka taint) tracking. Unfortunately, this approach of propagating the most restrictive input label to the output is too conservative for applications where LLMs operate on inputs retrieved from diverse sources. In this paper, we propose a novel, more permissive approach to propagate information flow labels through LLM queries. The key idea behind our approach is to propagate only the input labels that were \emph{influential} in generating the model output and to eliminate the labels of unnecessary inputs. We implement and investigate the effectiveness of two variations of this approach, based on (i) prompt-based retrieval augmentation, and (ii) a

2025-10-11

TMLR (accepté)

PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data

Tingchen Fu

Mrinank Sharma

Philip Torr

Shay B. Cohen

Fazl Barez

Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To addre… (voir plus)ss this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 22 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not always enhance resilience against poisoning attacks and the influence on model resilience varies among different model suites. (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data

Tingchen Fu

Mrinank Sharma

Philip Torr

Shay B Cohen

Fazl Barez

Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To addre… (voir plus)ss this concern, we introduce PoisonBench, a benchmark for evaluating large language models’ susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 22 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not always enhance resilience against poisoning attacks and the influence on model resilience varies among different model suites. (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

Position: Humanity Faces Existential Risk from Gradual Disempowerment

Jan Kulveit

Raymond Douglas

Nora Ammann

Deger Turan

David Duvenaud

This paper examines the systemic risks posed by incremental advancements in artificial intelligence, developing the concept of ‘gradual di… (voir plus)sempowerment’, in contrast to the abrupt takeover scenarios commonly discussed in AI safety. We analyze how even incremental improvements in AI capabilities can undermine human influence over large-scale systems that society depends on, including the economy, culture, and nation-states. As AI increasingly replaces human labor and cognition in these domains, it can weaken both explicit human control mechanisms (like voting and consumer choice) and the implicit alignments with human preferences that often arise from societal systems’ reliance on human participation to function. Furthermore, AI systems may amplify existing misalignments with human preferences by optimizing these systems more powerfully. These distortions across domains may be mutually reinforcing: economic power shapes cultural narratives and political decisions, while cultural shifts alter economic and political behavior. We argue that this dynamic could lead to an effectively irreversible loss of human influence over crucial societal systems, precipitating an existential catastrophe through the permanent disempowerment of humanity. This analysis suggests the need for both technical research and governance approaches that specifically address the risk of incremental erosion of human influence across interconnected societal systems.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

Position: Probabilistic Modelling is Sufficient for Causal Inference

Bruno Mlodozeniec

Richard E. Turner

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

Position: Probabilistic Modelling is Sufficient for Causal Inference

Bruno Mlodozeniec

Richard E Turner

Causal inference is a key research area in machine learning, yet confusion reigns over the tools needed to tackle it. There are prevalent cl… (voir plus)aims in the machine learning literature that you need a bespoke causal framework or notation to answer causal questions. In this paper, we make it clear that you can answer any causal inference question within the realm of probabilistic modelling and inference, without causal-specific tools or notation. Through concrete examples, we demonstrate how causal questions can be tackled by writing down the probability of everything. We argue for the advantages of the generality of the probabilistic modelling lens, when compared to bespoke causal frameworks. Lastly, we reinterpret causal tools as emerging from standard probabilistic modelling and inference, elucidating their necessity and utility.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

Jan Wehner

Sahar Abdelnabi

Daniel Tan

Mario Fritz

Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs… (voir plus) or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models' performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.

2025-09-23

TMLR (accepté)

Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

Jan Wehner

Sahar Abdelnabi

Daniel Chee Hian Tan

Mario Fritz

Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs… (voir plus) or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models' performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.

2025-09-23

TMLR (accepté)

Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

Jan Wehner

Sahar Abdelnabi

Daniel Chee Hian Tan

Mario Fritz

Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs… (voir plus) or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models' performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.

2025-09-23

TMLR (accepté)

Fresh in memory: Training-order recency is linearly encoded in language model activations

Dmitrii Krasheninnikov

Richard E. Turner

We show that language models' activations linearly encode when information was learned during training. Our setup involves creating a model … (voir plus)with a known training order by sequentially fine-tuning Llama-3.2-1B on six disjoint but otherwise similar datasets about named entities. We find that the average activations of test samples corresponding to the six training datasets encode the training order: when projected into a 2D subspace, these centroids are arranged exactly in the order of training and lie on a straight line. Further, we show that linear probes can accurately (~90%) distinguish "early" vs. "late" entities, generalizing to entities unseen during the probes' own training. The model can also be fine-tuned to explicitly report an unseen entity's training stage (~80% accuracy). Interestingly, the training-order encoding does not seem attributable to simple differences in activation magnitudes, losses, or model confidence. Our paper demonstrates that models are capable of differentiating information by its acquisition time, and carries significant implications for how they might manage conflicting data and respond to knowledge modifications.

2025-09-17

ArXiv (prépublication)

doi.org

arxiv.org

Fresh in memory: Training-order recency is linearly encoded in language model activations

Dmitrii Krasheninnikov

Richard E. Turner