We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Publications
A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens
Most safety training methods for large-language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of … (see more)the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond.
These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response.
To avoid that, we propose to expand the model's vocabulary with a special token we call a *red flag token* (
Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods.
I… (see more)n this work, we demonstrate that grokking can be induced by regularization, either explicit or implicit. More precisely, we show that when there exists a model with a property
Misinformation is a complex societal issue, and mitigating solutions are difficult to create due to data deficiencies. To address this probl… (see more)em, we have curated the largest collection of (mis)information datasets in the literature, totaling 75. From these, we evaluated the quality of all of the 36 datasets that consist of statements or claims, as well as the 9 datasets that consists of data in purely paragraph form. We assess these datasets to identify those with solid foundations for empirical work and those with flaws that could result in misleading and non-generalizable results, such as insufficient label quality, spurious correlations. We further provide state-of-the-art baselines on all these datasets, but show that regardless of label quality, categorical labels may no longer give an accurate evaluation of detection model performance. We discuss alternatives to mitigate this problem. Overall, this guide aims to provide a roadmap for obtaining higher quality data and conducting more effective evaluations, ultimately improving research in misinformation detection. All datasets and other artifacts are available at [anonymized].
Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavi… (see more)or of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts-the first visualization tool for users to inspect the reasoning paths of chain-of-thought and its derivatives on any multi-choice dataset. Specifically, we represent the states in a reasoning path as feature vectors that quantify their distances to all answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative analysis shows that the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt our tool to a neural model that predicts any property they observe. We showcase this advantage by adapting our tool to a lightweight verifier, which significantly improves reasoning by evaluating the correctness of reasoning paths.
Integrating expert knowledge, e.g. from large language models, into causal discovery algorithms can be challenging when the knowledge is not… (see more) guaranteed to be correct. Expert recommendations may contradict data-driven results, and their reliability can vary significantly depending on the domain or specific query. Existing methods based on soft constraints or inconsistencies in predicted causal relationships fail to account for these variations in expertise. To remedy this, we propose L2D-CD, a method for gauging the correctness of expert recommendations and optimally combining them with data-driven causal discovery results. By adapting learning-to-defer (L2D) algorithms for pairwise causal discovery (CD), we learn a deferral function that selects whether to rely on classical causal discovery methods using numerical data or expert recommendations based on textual meta-data. We evaluate L2D-CD on the canonical Tübingen pairs dataset and demonstrate its superior performance compared to both the causal discovery method and the expert used in isolation. Moreover, our approach identifies domains where the expert's performance is strong or weak. Finally, we outline a strategy for generalizing this approach to causal discovery on graphs with more than two variables, paving the way for further research in this area.
Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of
AI systems by constraining their decisions on a set of human-unde… (see more)rstandable
concepts. However, CBMs typically assume that datasets contain accurate concept
labels—an assumption often violated in practice, which we show can significantly
degrade performance (by 25% in some cases). To address this, we introduce the
Concept Preference Optimization (CPO) objective, a new loss function based on
Direct Preference Optimization, which effectively mitigates the negative impact
of concept mislabeling on CBM performance. We provide an analysis of some
key properties of the CPO objective showing it directly optimizes for the concept’s
posterior distribution, and contrast it against Binary Cross Entropy (BCE) where
we show CPO is inherently less sensitive to concept noise. We empirically confirm
our analysis finding that CPO consistently outperforms BCE in three real-world
datasets with and without added label noise.
Large Language Models (LLMs) have exhibited an impressive capability to perform reasoning tasks, especially if they are encouraged to genera… (see more)te a sequence of intermediate steps. Reasoning performance can be improved by suitably combining multiple LLM responses, generated either in parallel in a single query, or via sequential interactions with LLMs throughout the reasoning process. Existing strategies for combination, such as self-consistency and progressive-hint-prompting, make inefficient usage of the LLM responses. We present Refined Answer Distributions, a novel and principled algorithmic framework to enhance the reasoning capabilities of LLMs. Our approach can be viewed as an iterative sampling strategy for forming a Monte Carlo approximation of an underlying distribution of answers, with the goal of identifying the mode --- the most likely answer. Empirical evaluation on several reasoning benchmarks demonstrates the superiority of the proposed approach.
This paper takes a position on how anti-misinformation AI works should be developed for the online misinformation context. We observe that t… (see more)he current literature is dominated by works that produce more information for users to process and that this function faces various challenges in bringing meaningful effects to reality. We use anti-misinformation insights from other domains to suggest a redirection of the existing line of work and identify an under-explored opportunity AI can facilitate exploring.
This paper takes a position on how anti-misinformation AI works should be developed for the online misinformation context. We observe that t… (see more)he current literature is dominated by works that produce more information for users to process and that this function faces various challenges in bringing meaningful effects to reality. We use anti-misinformation insights from other domains to suggest a redirection of the existing line of work and identify an under-explored opportunity AI can facilitate exploring.
Causal discovery aims to automatically uncover causal relationships from data, a capability with significant potential across many scientifi… (see more)c disciplines. However, its real-world applications remain limited. Current methods often rely on unrealistic assumptions and are evaluated only on simple synthetic toy datasets, often with inadequate evaluation metrics. In this paper, we substantiate these claims by performing a systematic review of the recent causal discovery literature. We present applications in biology, neuroscience, and Earth sciences - fields where causal discovery holds promise for addressing key challenges. We highlight available simulated and real-world datasets from these domains and discuss common assumption violations that have spurred the development of new methods. Our goal is to encourage the community to adopt better evaluation practices by utilizing realistic datasets and more adequate metrics.
Large language models (LLMs) are known to "hallucinate" by generating false or misleading outputs. Existing hallucination benchmarks often o… (see more)verlook prompt sensitivity, due to stable accuracy scores despite prompt variations. However, such stability can be misleading. In this work, we introduce prompt multiplicity--the multiplicity of individual hallucinations depending on the input prompt--and study its role in LLM hallucination benchmarks. We find severe multiplicity, with even more than 50% of responses changing between correct and incorrect answers simply based on the prompt for certain benchmarks, like Med-HALT. Prompt multiplicity also gives us the lens to distinguish between randomness in generation and consistent factual inaccuracies, providing a more nuanced understanding of LLM hallucinations and their real-world harms. By situating our discussion within existing hallucination taxonomies--supporting their quantification--and exploring its relationship with uncertainty in generation, we highlight how prompt multiplicity fills a critical gap in the literature on LLM hallucinations.