Publications

LLM-Safety Evaluations Lack Robustness

Tim Beyer

Simon Geisler

Stephan Günnemann

In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of… (see more) noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.

2025-03-01

arXiv (published)

doi.org

arxiv.org

Negotiative Alignment: Embracing Disagreement to Achieve Fairer Outcomes -- Insights from Urban Studies

Rashid A. Mushkani

Hugo Berard

Shin (Alexandre) Koseki

2025-03-01

arXiv (published)

doi.org

arxiv.org

Normalizing Spinal Cord Compression Measures in Degenerative Cervical Myelopathy.

Sandrine Bédard

Jan Valosek

Maryam Seif

Armin Curt

Simon Schading-Sassenhausen

Nikolai Pfender

P. Freund

Markus Hupp

Julien Cohen-Adad

2025-03-01

The spine journal (published)

doi.org

PRISM: High-Resolution & Precise Counterfactual Medical Image Generation using Language-guided Stable Diffusion

Developing reliable and generalizable deep learning systems for medical imaging faces significant obstacles due to spurious correlations, da… (see more)ta imbalances, and limited text annotations in datasets. Addressing these challenges requires architectures robust to the unique complexities posed by medical imaging data. The rapid advancements in vision-language foundation models within the natural image domain prompt the question of how they can be adapted for medical imaging tasks. In this work, we present PRISM, a framework that leverages foundation models to generate high-resolution, language-guided medical image counterfactuals using Stable Diffusion. Our approach demonstrates unprecedented precision in selectively modifying spurious correlations (the medical devices) and disease features, enabling the removal and addition of specific attributes while preserving other image characteristics. Through extensive evaluation, we show how PRISM advances counterfactual generation and enables the development of more robust downstream classifiers for clinically deployable solutions. To facilitate broader adoption and research, we make our code publicly available at https://github.com/Amarkr1/PRISM.

2025-03-01

arXiv (published)

doi.org

RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models

Parham Saremi

Amar Kumar

Mohammed Mohammed

Zahra Tehrani Nasab

Tal Arbel

2025-03-01

arXiv (published)

doi.org

arxiv.org

Self-adaptive cyber defense for sustainable IoT: A DRL-based IDS optimizing security and energy efficiency

Saeid Jamshidi

Ashkan Amirnia

Amin Nikanjam

Kawser Wazed Nafi

Foutse Khomh

Samira Keivanpour

2025-03-01

Journal of Network and Computer Applications (published)

doi.org

SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection

Shamsuddeen Hassan Muhammad

Nedjma OUSIDHOUM

Idris Abdulmumin

Seid Muhie Yimam

Jan Philip Wahle

Terry Lima Ruas

Meriem Beloucif

Christine de Kock

Tadesse Belay

Ibrahim Ahmad

Nirmal Surange

Daniela Teodorescu

David Ifeoluwa Adelani

Alham Fikri Aji

Felermino Ali

Vladimir Araujo

Abinew Ayele

Oana Ignat

Alexander Panchenko

Yi Zhou … (see 1 more)

Saif M. Mohammad

2025-03-01

arXiv (published)

doi.org

arxiv.org

A three-state coupled Markov switching model for COVID-19 outbreaks across Quebec based on hospital admissions

Dirk Douwes-Schultz

Alexandra M. Schmidt

Yannan Shen

David Buckeridge

2025-03-01

The Annals of Applied Statistics (published)

doi.org

arxiv.org

Tractable Representations for Convergent Approximation of Distributional HJB Equations

Julie Alhosh

Harley Wiltzer

David Meger

2025-03-01

arXiv (published)

doi.org

arxiv.org

Unveiling Inefficiencies in LLM-Generated Code: Toward a Comprehensive Taxonomy

Altaf Allah Abbassi

Leuson Da Silva

Amin Nikanjam

Foutse Khomh

2025-03-01

arXiv (published)

doi.org

arxiv.org

PRISM: High-Resolution&Precise Counterfactual Medical Image Generation using Language-guided Stable Diffusion

Developing reliable and generalizable deep learning systems for medical imaging faces significant obstacles due to spurious correlations, da… (see more)ta imbalances, and limited text annotations in datasets. Addressing these challenges requires architectures robust to the unique complexities posed by medical imaging data. The rapid advancements in vision-language foundation models within the natural image domain prompt the question of how they can be adapted for medical imaging tasks. In this work, we present PRISM, a framework that leverages foundation models to generate high-resolution, language-guided medical image counterfactuals using Stable Diffusion. Our approach demonstrates unprecedented precision in selectively modifying spurious correlations (the medical devices) and disease features, enabling the removal and addition of specific attributes while preserving other image characteristics. Through extensive evaluation, we show how PRISM advances counterfactual generation and enables the development of more robust downstream classifiers for clinically deployable solutions. To facilitate broader adoption and research, we make our code publicly available at https://github.com/Amarkr1/PRISM.

2025-02-28

ArXiv (preprint)

arxiv.org

Steering Large Language Model Activations in Sparse Spaces

Reza Bayat

Ali Rahimi-Kalahroudi

Mohammad Pezeshki

Sarath Chandar

Pascal Vincent

A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. Activation steering, which… (see more) modifies internal model activations during inference, offers a potential solution. However, prior work in dense activation spaces struggles with superposition, wherein multiple features become entangled, limiting interpretability and precise control. In contrast, sparse representations provide an untapped opportunity for more interpretable behavior modulation. In this work, we introduce sparse activation steering (SAS), a method that leverages sparse autoencoders (SAEs) to steer LLM behavior in sparse spaces. By isolating behavior-specific features through a contrastive prompt-pairing approach, we define a set of features that can selectively reinforce or suppress behaviors. Experiments on Gemma 2 LLMs show that SAS vectors enable nuanced behavioral modulation and finer-grained control. Furthermore, scaling SAEs improves monosemanticity of SAS vectors, suggesting more reliable and interpretable interventions.

2025-02-28

ArXiv (preprint)

doi.org

arxiv.org

Speed Science

Leading in a New Era

Supervision Requests

Publications

Speed Science

Leading in a New Era

Supervision Requests

Popular keywords:

Publications