A generative approach to LLM harmfulness detection with special red flag tokens
Sophie Xhonneux
David Dobre
Mehrnaz Mofakhami
Leo Schwinn
Most safety training methods for large language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of … (voir plus)the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond. These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response. To avoid that, we propose to expand the model's vocabulary with a special token we call red flag token () and propose to fine-tune the model to generate this token at any time harmful content is generated or about to be generated. This novel safety training method effectively augments LLMs into generative classifiers of harmfulness at all times during the conversation. This method offers several advantages: it enables the model to explicitly learn the concept of harmfulness while marginally affecting the generated distribution, thus maintaining the model's utility. It also evaluates each generated answer rather than just the input prompt and provides a stronger defence against sampling-based attacks. In addition, it simplifies the evaluation of the model's robustness and reduces correlated failures when combined with a classifier. We further show an increased robustness to long contexts, and supervised fine-tuning attacks.
A generative approach to LLM harmfulness detection with special red flag tokens
Sophie Xhonneux
David Dobre
Mehrnaz Mofakhami
Leo Schwinn
Improving the Scaling Laws of Synthetic Data with Deliberate Practice
Reyhane Askari Hemmat
Mohammad Pezeshki
Florian Bordes
Pietro Astolfi
Melissa Hall
Jakob Verbeek
Michal Drozdzal
Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a nov… (voir plus)el framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.
PairBench: Are Vision-Language Models Reliable at Comparing What They See?
Aarash Feizi
Sai Rajeswar
Valentina Zantedeschi
Spandana Gella
Joao Monteiro
Understanding how effectively large vision language models (VLMs) compare visual inputs is crucial across numerous applications, yet this fu… (voir plus)ndamental capability remains insufficiently assessed. While VLMs are increasingly deployed for tasks requiring comparative judgment, including automated evaluation, re-ranking, and retrieval-augmented generation, no systematic framework exists to measure their performance in these scenarios. We present PairBench, a simple framework that evaluates VLMs as customizable similarity tools using widely available image datasets. Our approach introduces four key metrics for reliable comparison: alignment with human annotations, consistency across pair ordering, distribution smoothness, and controllability through prompting. Our analysis reveals that no model consistently excels across all metrics, with each demonstrating distinct strengths and weaknesses. Most concerning is the widespread inability of VLMs to maintain symmetric similarity scores. Interestingly, we demonstrate that performance on our benchmark strongly correlates with popular benchmarks used for more complex tasks, while providing additional metrics into controllability, smoothness and ordering. This makes PairBench a unique and comprehensive framework to evaluate the performance of VLMs for automatic evaluation depending on the task.
PairBench: A Systematic Framework for Selecting Reliable Judge VLMs
Aarash Feizi
Sai Rajeswar
Spandana Gella
Valentina Zantedeschi
Joao Monteiro
As large vision language models (VLMs) are increasingly used as automated evaluators, understanding their ability to effectively compare dat… (voir plus)a pairs as instructed in the prompt becomes essential. To address this, we present PairBench, a low-cost framework that systematically evaluates VLMs as customizable similarity tools across various modalities and scenarios. Through PairBench, we introduce four metrics that represent key desiderata of similarity scores: alignment with human annotations, consistency for data pairs irrespective of their order, smoothness of similarity distributions, and controllability through prompting. Our analysis demonstrates that no model, whether closed- or open-source, is superior on all metrics; the optimal choice depends on an auto evaluator's desired behavior (e.g., a smooth vs. a sharp judge), highlighting risks of widespread adoption of VLMs as evaluators without thorough assessment. For instance, the majority of VLMs struggle with maintaining symmetric similarity scores regardless of order. Additionally, our results show that the performance of VLMs on the metrics in PairBench closely correlates with popular benchmarks, showcasing its predictive power in ranking models.
Harnessing artificial intelligence to fill global shortfalls in biodiversity knowledge
Justin Kitzes
Sara Beery
Kaitlyn M. Gaynor
Marta A. Jarzyna
Oisin Mac Aodha
Bernd Meyer
Graham W. Taylor
Devis Tuia
Tanya Berger-Wolf
How to Get Your LLM to Generate Challenging Problems for Evaluation
The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional hum… (voir plus)an annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems, particularly for tasks such as long-context reasoning. Moreover, the rapid saturation of existing human-curated benchmarks by LLMs further necessitates the need to develop scalable and automatically renewable evaluation methodologies. In this work, we introduce **CHASE**, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover since we want to generate synthetic data for evaluation, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: document-based question answering, repository-level code completion, and math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60\% accuracy, thereby demonstrating the effectiveness of our framework at generating hard problems. Our experiments further reveal that the Gemini models significantly outperform other LLMs at long-context reasoning, and that the performance of all LLMs drastically drops by as much as 70\% when we scale up the context size to 50k tokens.
OBELiX: A Curated Dataset of Crystal Structures and Experimentally Measured Ionic Conductivities for Lithium Solid-State Electrolytes
F'elix Therrien
Jamal Abou Haibeh
Divya Sharma
Rhiannon Hendley
Alex Hern'andez-Garc'ia
Sun Sun
Alain Tchagang
Jiang Su
Samuel Huberman
Hongyu Guo
Homin Shin
Solid-state electrolyte batteries are expected to replace liquid electrolyte lithium-ion batteries in the near future thanks to their higher… (voir plus) theoretical energy density and improved safety. However, their adoption is currently hindered by their lower effective ionic conductivity, a quantity that governs charge and discharge rates. Identifying highly ion-conductive materials using conventional theoretical calculations and experimental validation is both time-consuming and resource-intensive. While machine learning holds the promise to expedite this process, relevant ionic conductivity and structural data is scarce. Here, we present OBELiX, a domain-expert-curated database of
OBELiX: A Curated Dataset of Crystal Structures and Experimentally Measured Ionic Conductivities for Lithium Solid-State Electrolytes
Félix Therrien
Jamal Abou Haibeh
Divya Sharma
Rhiannon Hendley
Alex Hernandez-Garcia
Sun Sun
Alain Tchagang
Jiang Su
Samuel Huberman
Hongyu Guo
Homin Shin
Solid-state electrolyte batteries are expected to replace liquid electrolyte lithium-ion batteries in the near future thanks to their higher… (voir plus) theoretical energy density and improved safety. However, their adoption is currently hindered by their lower effective ionic conductivity, a quantity that governs charge and discharge rates. Identifying highly ion-conductive materials using conventional theoretical calculations and experimental validation is both time-consuming and resource-intensive. While machine learning holds the promise to expedite this process, relevant ionic conductivity and structural data is scarce. Here, we present OBELiX, a domain-expert-curated database of
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
Shaona Ghosh
Heather Frase
Adina Williams
Sarah Luger
Paul Rottger
Fazl Barez
Sean McGregor
Kenneth Fricklas
Mala Kumar
Quentin Feuillade--Montixi
Kurt Bollacker
Felix Friedrich
Ryan Tsang
Bertie Vidgen
Alicia Parrish
Chris Knotz
Eleonora Presani
Jonathan Bennion
Marisa Ferrara Boston
Mike Kuniavsky … (voir 81 de plus)
Wiebke Hutiri
James Ezick
Malek Ben Salem
Rajat Sahay
Sujata Goswami
Usman Gohar
Ben Huang
Supheakmungkol Sarin
Elie Alhajjar
Canyu Chen
Roman Eng
K. Manjusha
Virendra Mehta
Eileen Peters Long
Murali Krishna Emani
Natan Vidra
Benjamin Rukundo
Abolfazl Shahbazi
Kongtao Chen
Rajat Ghosh
Vithursan Thangarasa
Pierre Peign'e
Abhinav Singh
Max Bartolo
Satyapriya Krishna
Mubashara Akhtar
Rafael Gold
Cody Coleman
Luis Oala
Vassil Tashev
Joseph Marvin Imperial
Amy Russ
Sasidhar Kunapuli
Nicolas Miailhe
Julien Delaunay
Bhaktipriya Radharapu
Rajat Shinde
Tuesday
Debojyoti Dutta
Declan Grabb
Ananya Gangavarapu
Saurav Sahay
Agasthya Gangavarapu
Patrick Schramowski
Stephen Singam
Tom David
Xudong Han
Priyanka Mary Mammen
Tarunima Prabhakar
Venelin Kovatchev
Ahmed M. Ahmed
Kelvin Manyeki
Sandeep Madireddy
Fedor Zhdanov
Joachim Baumann
N. Vasan
Xianjun Yang
Carlos Mougn
Jibin Rajan Varghese
Hussain Chinoy
Seshakrishna Jitendar
Manil Maskey
Claire V. Hardgrove
Tianhao Li
Aakash Gupta
Emil Joswin
Yifan Mai
Shachi H. Kumar
Çigdem Patlak
Kevin Lu
Vincent Alessi
Sree Bhargavi Balija
Chenhe Gu
Robert Sullivan
James Gealy
Matt Lavrisa
James Goel
Peter Mattson
Percy Liang
Joaquin Vanschoren
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
Shaona Ghosh
Heather Frase
Adina Williams
Sarah Luger
Paul Rottger
Fazl Barez
Sean McGregor
Kenneth Fricklas
Mala Kumar
Quentin Feuillade--Montixi
Kurt Bollacker
Felix Friedrich
Ryan Tsang
Bertie Vidgen
Alicia Parrish
Chris Knotz
Eleonora Presani
Jonathan Bennion
Marisa Ferrara Boston
Mike Kuniavsky … (voir 81 de plus)
Wiebke Hutiri
James Ezick
Malek Ben Salem
Rajat Sahay
Sujata Goswami
Usman Gohar
Ben Huang
Supheakmungkol Sarin
Elie Alhajjar
Canyu Chen
Roman Eng
K. Manjusha
Virendra Mehta
Eileen Peters Long
Murali Krishna Emani
Natan Vidra
Benjamin Rukundo
Abolfazl Shahbazi
Kongtao Chen
Rajat Ghosh
Vithursan Thangarasa
Pierre Peign'e
Abhinav Singh
Max Bartolo
Satyapriya Krishna
Mubashara Akhtar
Rafael Gold
Cody Coleman
Luis Oala
Vassil Tashev
Joseph Marvin Imperial
Amy Russ
Sasidhar Kunapuli
Nicolas Miailhe
Julien Delaunay
Bhaktipriya Radharapu
Rajat Shinde
Tuesday
Debojyoti Dutta
D. Grabb
Ananya Gangavarapu
Saurav Sahay
Agasthya Gangavarapu
Patrick Schramowski
Stephen Singam
Tom David
Xudong Han
Priyanka Mary Mammen
Tarunima Prabhakar
Venelin Kovatchev
Ahmed M. Ahmed
Kelvin Manyeki
Sandeep Madireddy
Fedor Zhdanov
Joachim Baumann
N. Vasan
Xianjun Yang
Carlos Mougn
Jibin Rajan Varghese
Hussain Chinoy
Seshakrishna Jitendar
Manil Maskey
Claire V. Hardgrove
Tianhao Li
Aakash Gupta
Emil Joswin
Yifan Mai
Shachi H. Kumar
Çigdem Patlak
Kevin Lu
Vincent Alessi
Sree Bhargavi Balija
Chenhe Gu
Robert Sullivan
James Gealy
Matt Lavrisa
James Goel
Peter Mattson
Percy Liang
Joaquin Vanschoren
Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems
Myra Cheng
Su Lin Blodgett
Alicia DeVrio
Lisa Egede
As text generation systems' outputs are increasingly anthropomorphic -- perceived as human-like -- scholars have also raised increasing conc… (voir plus)erns about how such outputs can lead to harmful outcomes, such as users over-relying or developing emotional dependence on these systems. How to intervene on such system outputs to mitigate anthropomorphic behaviors and their attendant harmful outcomes, however, remains understudied. With this work, we aim to provide empirical and theoretical grounding for developing such interventions. To do so, we compile an inventory of interventions grounded both in prior literature and a crowdsourced study where participants edited system outputs to make them less human-like. Drawing on this inventory, we also develop a conceptual framework to help characterize the landscape of possible interventions, articulate distinctions between different types of interventions, and provide a theoretical basis for evaluating the effectiveness of different interventions.