Kellin Pelrine

Avijit Ghosh

Andrew Strait

Robert Kirk

Dan Hendrycks

Peter Henderson

J. Zico Kolter

Geoffrey Irving

Yarin Gal … (voir 2 de plus)

Yoshua Bengio

Dylan Hadfield-Menell

2024-12-31

SSRN Electronic Journal (accepté)

A Simulation System Towards Solving Societal-Scale Manipulation

Maximilian Puelma Touzel

Sneheel Sarangi

Austin Welch

Gayatri K

Dan Zhao

Hao Yu

Ethan Kosak-Hine

Tom Gibbs

Andreea Musulan

Camille Thibault

Busra Tugce Gurbuz

The rise of AI-driven manipulation poses significant risks to societal trust and democratic processes. Yet, studying these effects in real-w… (voir plus)orld settings at scale is ethically and logistically impractical, highlighting a need for simulation tools that can model these dynamics in controlled settings to enable experimentation with possible defenses. We present a simulation environment designed to address this. We elaborate upon the Concordia framework that simulates offline, `real life' activity by adding online interactions to the simulation through social media with the integration of a Mastodon server. We improve simulation efficiency and information flow, and add a set of measurement tools, particularly longitudinal surveys. We demonstrate the simulator with a tailored example in which we track agents' political positions and show how partisan manipulation of agents can affect election results.

2024-10-16

ArXiv (prépublication)

Epistemic Integrity in Large Language Models

Bijean Ghafouri

Shahrad Mohammadzadeh

James Zhou

Pratheeksha Nair

Jacob-Junqi Tian

Mayank Goel

Large language models are increasingly relied upon as sources of information, but their propensity for generating false or misleading statem… (voir plus)ents with high confidence poses risks for users and society. In this paper, we confront the critical problem of epistemic miscalibration—where a model's linguistic assertiveness fails to reflect its true internal certainty. We introduce a new human-labeled dataset and a novel method for measuring the linguistic assertiveness of Large Language Models which cuts error rates by over 50% relative to previous benchmarks. Validated across multiple datasets, our method reveals a stark misalignment between how confidently models linguistically present information and their actual accuracy. Further human evaluations confirm the severity of this miscalibration. This evidence underscores the urgent risk of the overstated certainty Large Language Models hold which may mislead users on a massive scale. Our framework provides a crucial step forward in diagnosing and correcting this miscalibration, offering a path to safer and more trustworthy AI across domains.

2024-10-11

NeurIPS.cc/2024/Workshop/SafeGenAi (poster)

Simulation System Towards Solving Societal-Scale Manipulation

Maximilian Puelma Touzel

Sneheel Sarangi

Austin Welch

Gayatri K

Dan Zhao

Hao Yu

Tom Gibbs

Ethan Kosak-Hine

Andreea Musulan

Camille Thibault

Busra Tugce Gurbuz

The rise of AI-driven manipulation poses significant risks to societal trust and democratic processes. Yet, studying these effects in real-w… (voir plus)orld settings at scale is ethically and logistically impractical, highlighting a need for simulation tools that can model these dynamics in controlled settings to enable experimentation with possible defenses. We present a simulation environment designed to address this. We elaborate upon the Concordia framework that simulates offline, `real life' activity by adding online interactions to the simulation through social media with the integration of a Mastodon server. Through a variety of means we then improve simulation efficiency and information flow, and add a set of measurement tools, particularly longitudinal surveys of the agents' political positions. We demonstrate the simulator with a tailored example of how partisan manipulation of agents can affect election results.

2024-10-11

NeurIPS.cc/2024/Workshop/SafeGenAi (poster)

The Structural Safety Generalization Problem

Tom Gibbs

Julius Broomfield

George Ingebretsen

Ethan Kosak-Hine

Tia Nasir

Jason Zhang

Reihaneh Iranmanesh

Sara Pieri

It is widely known that AI is vulnerable to adversarial examples, from pixel perturbations to jailbreaks. We propose that there is a key, ea… (voir plus)sier class of problems that is also still unsolved: failures of safety to generalize over structure, despite semantic equivalence. We demonstrate this vulnerability by showing how recent AI systems are differently vulnerable both to multi-turn and multi-image attacks, compared to their single-turn and single-image counterparts with equivalent meaning. We suggest this is the same class of vulnerability as that found in yet unconnected threads of the literature: vulnerabilities to low-resource languages and indefensibility of strongly superhuman Go AIs to cyclic attacks. When viewed together, these reveal a common picture: models that are not only vulnerable to attacks, but vulnerable to attacks with near identical meaning in their benign and harmful components both, and only different in structure. In contrast to attacks with identical benign input (e.g., pictures that look like cats) but unknown semanticity of the harmful component (e.g., diverse noise that is all unintelligible to humans), these represent a class of attacks where semantic understanding and defense against one version should guarantee defense against others—yet current AI safety measures do not. This vulnerability represents a necessary but not sufficient condition towards defending against attacks whose harmful component has arbitrary semanticity. Consequently, by building on the data and approaches we highlight, we frame an intermediate problem for AI safety to solve, that represents a critical checkpoint towards safe AI while being far more tractable than trying to solve it directly and universally.

2024-10-11

NeurIPS.cc/2024/Workshop/SafeGenAi (poster)

Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries

Julius Broomfield

George Ingebretsen

Reihaneh Iranmanesh

Sara Pieri

Ethan Kosak-Hine

Tom Gibbs

Large Language Models have been extensively studied for their vulnerabilities, particularly in the context of adversarial attacks. However, … (voir plus)the emergence of Vision Language Models introduces new modalities of risk that have not yet been thoroughly explored, especially when processing multiple images simultaneously. In this paper, we introduce two black-box jailbreak methods that leverage multi-image inputs to uncover vulnerabilities in these models. We present a new safety evaluation dataset for multimodal LLMs called MultiBench, which is composed of these jailbreak methods. These methods can easily be applied and evaluated using our toolkit. We test these methods against six safety aligned frontier models from Google, OpenAI, and Anthropic, revealing significant safety vulnerabilities. Our findings suggest that even the most powerful language models remain vulnerable against compositional adversarial attacks, specifically those composed of multiple images.

2024-10-08

NeurIPS.cc/2024/Workshop/Red_Teaming_GenAI (poster)

Web Retrieval Agents for Evidence-Based Misinformation Detection

Jacob-Junqi Tian

Hao Yu

Yury Orlovskiy

Tyler Vergho

Mauricio Rivera

Mayank Goel

2024-07-09

colmweb.org/COLM/2024/Conference (accepté)

Regional and Temporal Patterns of Partisan Polarization during the COVID-19 Pandemic in the United States and Canada

Anne Imouza

Maximilian Puelma Touzel

C'ecile Amadoro

Gabrielle Desrosiers-Brisebois

Sacha Lévy

Public health measures were among the most polarizing topics debated online during the COVID-19 pandemic. Much of the discussion surrounded … (voir plus)specific events, such as when and which particular interventions came into practise. In this work, we develop and apply an approach to measure subnational and event-driven variation of partisan polarization and explore how these dynamics varied both across and within countries. We apply our measure to a dataset of over 50 million tweets posted during late 2020, a salient period of polarizing discourse in the early phase of the pandemic. In particular, we examine regional variations in both the United States and Canada, focusing on three specific health interventions: lockdowns, masks, and vaccines. We find that more politically conservative regions had higher levels of partisan polarization in both countries, especially in the US where a strong negative correlation exists between regional vaccination rates and degree of polarization in vaccine related discussions. We then analyze the timing, context, and profile of spikes in polarization, linking them to specific events discussed on social media across different regions in both countries. These typically last only a few days in duration, suggesting that online discussions reflect and could even drive changes in public opinion, which in the context of pandemic response impacts public health outcomes across different regions and over time.

2024-07-02

ArXiv (prépublication)

Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation

Mauricio Rivera

2024-01-12

ArXiv (prépublication)

Comparing GPT-4 and Open-Source Language Models in Misinformation Mitigation

Tyler Vergho

Recent large language models (LLMs) have been shown to be effective for misinformation detection. However, the choice of LLMs for experiment… (voir plus)s varies widely, leading to uncertain conclusions. In particular, GPT-4 is known to be strong in this domain, but it is closed source, potentially expensive, and can show instability between different versions. Meanwhile, alternative LLMs have given mixed results. In this work, we show that Zephyr-7b presents a consistently viable alternative, overcoming key limitations of commonly used approaches like Llama-2 and GPT-3.5. This provides the research community with a solid open-source option and shows open-source models are gradually catching up on this task. We then highlight how GPT-3.5 exhibits unstable performance, such that this very widely used model could provide misleading results in misinformation detection. Finally, we validate new tools including approaches to structured output and the latest version of GPT-4 (Turbo), showing they do not compromise performance, thus unlocking them for future research and potentially enabling more complex pipelines for misinformation mitigation.

2024-01-11

ArXiv (prépublication)

Uncertainty Resolution in Misinformation Detection

Yury Orlovskiy

Camille Thibault

Anne Imouza

2024-01-01

ArXiv (prépublication)

An Evaluation of Language Models for Hyperpartisan Ideology Detection in Persian Twitter

Sahar Omidi Shayegan

Isar Nejadgholi

Hao Yu

Sacha Lévy