We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Publications
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavi… (see more)or of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts-the first visualization tool for users to inspect the reasoning paths of chain-of-thought and its derivatives on any multi-choice dataset. Specifically, we represent the states in a reasoning path as feature vectors that quantify their distances to all answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative analysis shows that the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt our tool to a neural model that predicts any property they observe. We showcase this advantage by adapting our tool to a lightweight verifier, which significantly improves reasoning by evaluating the correctness of reasoning paths.
Integrating expert knowledge, e.g. from large language models, into causal discovery algorithms can be challenging when the knowledge is not… (see more) guaranteed to be correct. Expert recommendations may contradict data-driven results, and their reliability can vary significantly depending on the domain or specific query. Existing methods based on soft constraints or inconsistencies in predicted causal relationships fail to account for these variations in expertise. To remedy this, we propose L2D-CD, a method for gauging the correctness of expert recommendations and optimally combining them with data-driven causal discovery results. By adapting learning-to-defer (L2D) algorithms for pairwise causal discovery (CD), we learn a deferral function that selects whether to rely on classical causal discovery methods using numerical data or expert recommendations based on textual meta-data. We evaluate L2D-CD on the canonical Tübingen pairs dataset and demonstrate its superior performance compared to both the causal discovery method and the expert used in isolation. Moreover, our approach identifies domains where the expert's performance is strong or weak. Finally, we outline a strategy for generalizing this approach to causal discovery on graphs with more than two variables, paving the way for further research in this area.
Integrating expert knowledge, e.g. from large language models, into causal discovery algorithms can be challenging when the knowledge is not… (see more) guaranteed to be correct. Expert recommendations may contradict data-driven results, and their reliability can vary significantly depending on the domain or specific query. Existing methods based on soft constraints or inconsistencies in predicted causal relationships fail to account for these variations in expertise. To remedy this, we propose L2D-CD, a method for gauging the correctness of expert recommendations and optimally combining them with data-driven causal discovery results. By adapting learning-to-defer (L2D) algorithms for pairwise causal discovery (CD), we learn a deferral function that selects whether to rely on classical causal discovery methods using numerical data or expert recommendations based on textual meta-data. We evaluate L2D-CD on the canonical Tübingen pairs dataset and demonstrate its superior performance compared to both the causal discovery method and the expert used in isolation. Moreover, our approach identifies domains where the expert's performance is strong or weak. Finally, we outline a strategy for generalizing this approach to causal discovery on graphs with more than two variables, paving the way for further research in this area.
Recent advances in language models have enabled framing molecule generation as sequence modeling. However, existing approaches often rely on… (see more) single-objective reinforcement learning, limiting their applicability to real-world drug design, where multiple competing properties must be optimized. Traditional multi-objective reinforcement learning (MORL) methods require costly retraining for each new objective combination, making rapid exploration of trade-offs impractical. To overcome these limitations, we introduce Mol-MoE, a mixture-of-experts (MoE) architecture that enables efficient test-time steering of molecule generation without retraining. Central to our approach is a preference-based router training objective that incentivizes the router to combine experts in a way that aligns with user-specified trade-offs. This provides improved flexibility in exploring the chemical property space at test time, facilitating rapid trade-off exploration. Benchmarking against state-of-the-art methods, we show that Mol-MoE achieves superior sample quality and steerability.
Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of
AI systems by constraining their decisions on a set of human-unde… (see more)rstandable
concepts. However, CBMs typically assume that datasets contain accurate concept
labels—an assumption often violated in practice, which we show can significantly
degrade performance (by 25% in some cases). To address this, we introduce the
Concept Preference Optimization (CPO) objective, a new loss function based on
Direct Preference Optimization, which effectively mitigates the negative impact
of concept mislabeling on CBM performance. We provide an analysis of some
key properties of the CPO objective showing it directly optimizes for the concept’s
posterior distribution, and contrast it against Binary Cross Entropy (BCE) where
we show CPO is inherently less sensitive to concept noise. We empirically confirm
our analysis finding that CPO consistently outperforms BCE in three real-world
datasets with and without added label noise.
Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of
AI systems by constraining their decisions on a set of human-unde… (see more)rstandable
concepts. However, CBMs typically assume that datasets contain accurate concept
labels—an assumption often violated in practice, which we show can significantly
degrade performance (by 25% in some cases). To address this, we introduce the
Concept Preference Optimization (CPO) objective, a new loss function based on
Direct Preference Optimization, which effectively mitigates the negative impact
of concept mislabeling on CBM performance. We provide an analysis of some
key properties of the CPO objective showing it directly optimizes for the concept’s
posterior distribution, and contrast it against Binary Cross Entropy (BCE) where
we show CPO is inherently less sensitive to concept noise. We empirically confirm
our analysis finding that CPO consistently outperforms BCE in three real-world
datasets with and without added label noise.
Large Language Models (LLMs) have exhibited an impressive capability to perform reasoning tasks, especially if they are encouraged to genera… (see more)te a sequence of intermediate steps. Reasoning performance can be improved by suitably combining multiple LLM responses, generated either in parallel in a single query, or via sequential interactions with LLMs throughout the reasoning process. Existing strategies for combination, such as self-consistency and progressive-hint-prompting, make inefficient usage of the LLM responses. We present Refined Answer Distributions, a novel and principled algorithmic framework to enhance the reasoning capabilities of LLMs. Our approach can be viewed as an iterative sampling strategy for forming a Monte Carlo approximation of an underlying distribution of answers, with the goal of identifying the mode --- the most likely answer. Empirical evaluation on several reasoning benchmarks demonstrates the superiority of the proposed approach.
This paper takes a position on how anti-misinformation AI works should be developed for the online misinformation context. We observe that t… (see more)he current literature is dominated by works that produce more information for users to process and that this function faces various challenges in bringing meaningful effects to reality. We use anti-misinformation insights from other domains to suggest a redirection of the existing line of work and identify an under-explored opportunity AI can facilitate exploring.
This paper takes a position on how anti-misinformation AI works should be developed for the online misinformation context. We observe that t… (see more)he current literature is dominated by works that produce more information for users to process and that this function faces various challenges in bringing meaningful effects to reality. We use anti-misinformation insights from other domains to suggest a redirection of the existing line of work and identify an under-explored opportunity AI can facilitate exploring.
Large language models (LLMs) are known to "hallucinate" by generating false or misleading outputs. Hallucinations pose various harms, from e… (see more)rosion of trust to widespread misinformation. Existing hallucination evaluation, however, focuses only on "correctness" and often overlooks "consistency", necessary to distinguish and address these harms. To bridge this gap, we introduce _prompt multiplicity_, a framework for quantifying consistency through prompt sensitivity. Our analysis reveals significant multiplicity (over 50% inconsistency in benchmarks like Med-HALT), suggesting that hallucination-related harms have been severely underestimated. Furthermore, we study the role of consistency in hallucination detection and mitigation. We find that: (a) detection techniques capture consistency, not correctness, and (b) mitigation techniques like RAG can introduce additional inconsistencies. By integrating prompt multiplicity into hallucination evaluation, we provide an improved framework of potential harms and uncover critical limitations in current detection and mitigation strategies.
Transition path sampling (TPS) is an important method for studying rare events, such as they happen in chemical reactions or protein folding… (see more). These events occur so infrequently that traditional simulations are often impractical, and even recent machine-learning approaches struggle to address this issue for larger systems. In this paper, we propose using modern deep learning techniques to improve the scalability of TPS methods significantly. We highlight the need for better evaluations in the existing literature and start by formulating TPS as a sampling problem over an unnormalized target density and introduce relevant evaluation metrics to assess the effectiveness of TPS solutions from this perspective. To develop a scalable approach, we explore several design choices, including a problem-informed neural network architecture, simulated annealing, the integration of prior knowledge into the sampling process, and attention mechanisms. Finally, we conduct a comprehensive empirical study and compare these design choices with other recently developed deep-learning methods for rare event sampling.