Reihaneh Rabbany

Hao Yu

Tom Gibbs

Ethan Kosak-Hine

Andreea Musulan

Camille Thibault

Busra Tugce Gurbuz

The rise of AI-driven manipulation poses significant risks to societal trust and democratic processes. Yet, studying these effects in real-w… (voir plus)orld settings at scale is ethically and logistically impractical, highlighting a need for simulation tools that can model these dynamics in controlled settings to enable experimentation with possible defenses. We present a simulation environment designed to address this. We elaborate upon the Concordia framework that simulates offline, `real life' activity by adding online interactions to the simulation through social media with the integration of a Mastodon server. Through a variety of means we then improve simulation efficiency and information flow, and add a set of measurement tools, particularly longitudinal surveys of the agents' political positions. We demonstrate the simulator with a tailored example of how partisan manipulation of agents can affect election results.

2024-10-12

NeurIPS.cc/2024/Workshop/SafeGenAi (poster)

The Structural Safety Generalization Problem

Tom Gibbs

Julius Broomfield

George Ingebretsen

Ethan Kosak-Hine

Tia Nasir

Jason Zhang

Reihaneh Iranmanesh

Sara Pieri

It is widely known that AI is vulnerable to adversarial examples, from pixel perturbations to jailbreaks. We propose that there is a key, ea… (voir plus)sier class of problems that is also still unsolved: failures of safety to generalize over structure, despite semantic equivalence. We demonstrate this vulnerability by showing how recent AI systems are differently vulnerable both to multi-turn and multi-image attacks, compared to their single-turn and single-image counterparts with equivalent meaning. We suggest this is the same class of vulnerability as that found in yet unconnected threads of the literature: vulnerabilities to low-resource languages and indefensibility of strongly superhuman Go AIs to cyclic attacks. When viewed together, these reveal a common picture: models that are not only vulnerable to attacks, but vulnerable to attacks with near identical meaning in their benign and harmful components both, and only different in structure. In contrast to attacks with identical benign input (e.g., pictures that look like cats) but unknown semanticity of the harmful component (e.g., diverse noise that is all unintelligible to humans), these represent a class of attacks where semantic understanding and defense against one version should guarantee defense against others—yet current AI safety measures do not. This vulnerability represents a necessary but not sufficient condition towards defending against attacks whose harmful component has arbitrary semanticity. Consequently, by building on the data and approaches we highlight, we frame an intermediate problem for AI safety to solve, that represents a critical checkpoint towards safe AI while being far more tractable than trying to solve it directly and universally.

2024-10-12

NeurIPS.cc/2024/Workshop/SafeGenAi (poster)

The Structural Safety Generalization Problem

Tom Gibbs

Julius Broomfield

George Ingebretsen

Ethan Kosak-Hine

Tia Nasir

Jason Zhang

Reihaneh Iranmanesh

Sara Pieri

It is widely known that AI is vulnerable to adversarial examples, from pixel perturbations to jailbreaks. We propose that there is a key, ea… (voir plus)sier class of problems that is also still unsolved: failures of safety to generalize over structure, despite semantic equivalence. We demonstrate this vulnerability by showing how recent AI systems are differently vulnerable both to multi-turn and multi-image attacks, compared to their single-turn and single-image counterparts with equivalent meaning. We suggest this is the same class of vulnerability as that found in yet unconnected threads of the literature: vulnerabilities to low-resource languages and indefensibility of strongly superhuman Go AIs to cyclic attacks. When viewed together, these reveal a common picture: models that are not only vulnerable to attacks, but vulnerable to attacks with near identical meaning in their benign and harmful components both, and only different in structure. In contrast to attacks with identical benign input (e.g., pictures that look like cats) but unknown semanticity of the harmful component (e.g., diverse noise that is all unintelligible to humans), these represent a class of attacks where semantic understanding and defense against one version should guarantee defense against others—yet current AI safety measures do not. This vulnerability represents a necessary but not sufficient condition towards defending against attacks whose harmful component has arbitrary semanticity. Consequently, by building on the data and approaches we highlight, we frame an intermediate problem for AI safety to solve, that represents a critical checkpoint towards safe AI while being far more tractable than trying to solve it directly and universally.

2024-10-12

NeurIPS.cc/2024/Workshop/SafeGenAi (poster)

Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries

Julius Broomfield

George Ingebretsen

Reihaneh Iranmanesh

Sara Pieri

Ethan Kosak-Hine

Tom Gibbs

Large Language Models have been extensively studied for their vulnerabilities, particularly in the context of adversarial attacks. However, … (voir plus)the emergence of Vision Language Models introduces new modalities of risk that have not yet been thoroughly explored, especially when processing multiple images simultaneously. In this paper, we introduce two black-box jailbreak methods that leverage multi-image inputs to uncover vulnerabilities in these models. We present a new safety evaluation dataset for multimodal LLMs called MultiBench, which is composed of these jailbreak methods. These methods can easily be applied and evaluated using our toolkit. We test these methods against six safety aligned frontier models from Google, OpenAI, and Anthropic, revealing significant safety vulnerabilities. Our findings suggest that even the most powerful language models remain vulnerable against compositional adversarial attacks, specifically those composed of multiple images.

2024-10-09

NeurIPS.cc/2024/Workshop/Red_Teaming_GenAI (poster)

TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous Graphs

Erfan Loghmani

Emanuele Rossi

Ioannis Koutis

Heiner Stuckenschmidt

Guillaume Rabusseau

2024-09-26

NeurIPS.cc/2024/Datasets_and_Benchmarks_Track (poster)

ToxiSight: Insights Towards Detected Chat Toxicity

Domenico Tullo

We present a comprehensive explainability dashboard designed for in-game chat toxicity. This dashboard integrates various existing explainab… (voir plus)le AI (XAI) techniques, including token importance analysis, model output visualization, and attribution to the training dataset. It also provides insights through the closest positive and negative examples, facilitating a deeper understanding and potential correction of the training data. Additionally, the dashboard includes word sense analysis—particularly useful for new moderators—and offers free-text explanations for both positive and negative predictions. This multi-faceted approach enhances the interpretability and transparency of toxicity detection models.

2024-09-21

EMNLP/2024/Workshop/BlackBoxNLP (accepté)

UTG: Towards a Unified View of Snapshot and Event Based Models for Temporal Graphs

Emanuele Rossi

Many real world graphs are inherently dynamic, constantly evolving with node and edge additions. These graphs can be represented by temporal… (voir plus) graphs, either through a stream of edge events or a sequence of graph snapshots. Until now, the development of machine learning methods for both types has occurred largely in isolation, resulting in limited experimental comparison and theoretical crosspollination between the two. In this paper, we introduce Unified Temporal Graph (UTG), a framework that unifies snapshot-based and event-based machine learning models under a single umbrella, enabling models developed for one representation to be applied effectively to datasets of the other. We also propose a novel UTG training procedure to boost the performance of snapshot-based models in the streaming setting. We comprehensively evaluate both snapshot and event-based models across both types of temporal graphs on the temporal link prediction task. Our main findings are threefold: first, when combined with UTG training, snapshot-based models can perform competitively with event-based models such as TGN and GraphMixer even on event datasets. Second, snapshot-based models are at least an order of magnitude faster than most event-based models during inference. Third, while event-based methods such as NAT and DyGFormer outperforms snapshot-based methods on both types of temporal graphs, this is because they leverage joint neighborhood structural features thus emphasizing the potential to incorporate these features into snapshotbased models as well. These findings highlight the importance of comparing model architectures independent of the data format and suggest the potential of combining the efficiency of snapshot-based models with the performance of event-based models in the future.

2024-07-17

ArXiv (prépublication)

arxiv.org

Web Retrieval Agents for Evidence-Based Misinformation Detection

Jacob-Junqi Tian

Hao Yu

Yury Orlovskiy

Tyler Vergho

Mauricio Rivera

Mayank Goel

2024-07-10

colmweb.org/COLM/2024/Conference (accepté)

Regional and Temporal Patterns of Partisan Polarization during the COVID-19 Pandemic in the United States and Canada

Anne Imouza

Maximilian Puelma Touzel

C'ecile Amadoro

Gabrielle Desrosiers-Brisebois

Sacha Lévy

Public health measures were among the most polarizing topics debated online during the COVID-19 pandemic. Much of the discussion surrounded … (voir plus)specific events, such as when and which particular interventions came into practise. In this work, we develop and apply an approach to measure subnational and event-driven variation of partisan polarization and explore how these dynamics varied both across and within countries. We apply our measure to a dataset of over 50 million tweets posted during late 2020, a salient period of polarizing discourse in the early phase of the pandemic. In particular, we examine regional variations in both the United States and Canada, focusing on three specific health interventions: lockdowns, masks, and vaccines. We find that more politically conservative regions had higher levels of partisan polarization in both countries, especially in the US where a strong negative correlation exists between regional vaccination rates and degree of polarization in vaccine related discussions. We then analyze the timing, context, and profile of spikes in polarization, linking them to specific events discussed on social media across different regions in both countries. These typically last only a few days in duration, suggesting that online discussions reflect and could even drive changes in public opinion, which in the context of pandemic response impacts public health outcomes across different regions and over time.

2024-07-03

ArXiv (prépublication)

arxiv.org

Regional and Temporal Patterns of Partisan Polarization during the COVID-19 Pandemic in the United States and Canada

Anne Imouza

Maximilian Puelma Touzel

C'ecile Amadoro

Gabrielle Desrosiers-Brisebois

Sacha Lévy

Public health measures were among the most polarizing topics debated online during the COVID-19 pandemic. Much of the discussion surrounded … (voir plus)specific events, such as when and which particular interventions came into practise. In this work, we develop and apply an approach to measure subnational and event-driven variation of partisan polarization and explore how these dynamics varied both across and within countries. We apply our measure to a dataset of over 50 million tweets posted during late 2020, a salient period of polarizing discourse in the early phase of the pandemic. In particular, we examine regional variations in both the United States and Canada, focusing on three specific health interventions: lockdowns, masks, and vaccines. We find that more politically conservative regions had higher levels of partisan polarization in both countries, especially in the US where a strong negative correlation exists between regional vaccination rates and degree of polarization in vaccine related discussions. We then analyze the timing, context, and profile of spikes in polarization, linking them to specific events discussed on social media across different regions in both countries. These typically last only a few days in duration, suggesting that online discussions reflect and could even drive changes in public opinion, which in the context of pandemic response impacts public health outcomes across different regions and over time.

2024-07-03

ArXiv (prépublication)

arxiv.org

Game On, Hate Off: A Study of Toxicity in Online Multiplayer Environments

Nicolas Grenon-Godbout

2024-06-28

Games Res. Pract. (publié)