Mikhail Galkin

Xuan Li

Xiao Feng

Sanmi Koyejo

Bo Han

Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavi… (see more)or of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts-the first visualization tool for users to inspect the reasoning paths of chain-of-thought and its derivatives on any multi-choice dataset. Specifically, we represent the states in a reasoning path as feature vectors that quantify their distances to all answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative analysis shows that the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt our tool to a model that predicts any property they observe. We showcase this advantage by adapting our tool to a lightweight verifier, which significantly improves reasoning by evaluating the correctness of reasoning paths. The code is publicly available at https://github.com/tmlr-group/landscape-of-thoughts.

2025-07-09

ICML.cc/2025/Workshop/AI4MATH (poster)

Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models

Zhanke Zhou

Xuan Li

Xiao Feng

Sanmi Koyejo

Bo Han

2025-06-30

ICML.cc/2025/Workshop/R2-FM (poster)

Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models

Zhanke Zhou

Xuan Li

Xiao Feng

Sanmi Koyejo

Bo Han

Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavi… (see more)or of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts-the first visualization tool for users to inspect the reasoning paths of chain-of-thought and its derivatives on any multi-choice dataset. Specifically, we represent the states in a reasoning path as feature vectors that quantify their distances to all answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative analysis shows that the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt our tool to a neural model that predicts any property they observe. We showcase this advantage by adapting our tool to a lightweight verifier, which significantly improves reasoning by evaluating the correctness of reasoning paths.

2025-03-05

ICLR.cc/2025/Workshop/LLM_Reason_and_Plan (published)

Fully-inductive Node Classification on Arbitrary Graphs

Hesham Mostafa

Michael M. Bronstein

2025-01-22

ICLR.cc/2025/Conference (poster)

SymmCD: Symmetry-Preserving Crystal Generation with Diffusion Models

Daniel Levy

Siba Smarak Panigrahi

Sékou-Oumar Kaba

Qiang Zhu

Kin Long Kelvin Lee

Santiago Miret

Siamak Ravanbakhsh

2025-01-22

ICLR.cc/2025/Conference (poster)

TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous Graphs

Erfan Loghmani

Emanuele Rossi

Ioannis Koutis

Heiner Stuckenschmidt

Reihaneh Rabbany

Guillaume Rabusseau

2024-09-26

NeurIPS.cc/2024/Datasets_and_Benchmarks_Track (poster)

A Foundation Model for Zero-shot Logical Query Reasoning

Jincheng Zhou

Bruno Ribeiro

Complex logical query answering (CLQA) in knowledge graphs (KGs) goes beyond simple KG completion and aims at answering compositional querie… (see more)s comprised of multiple projections and logical operations. Existing CLQA methods that learn parameters bound to certain entity or relation vocabularies can only be applied to the graph they are trained on which requires substantial training time before being deployed on a new graph. Here we present UltraQuery, the first foundation model for inductive reasoning that can zero-shot answer logical queries on any KG. The core idea of UltraQuery is to derive both projections and logical operations as vocabulary-independent functions which generalize to new entities and relations in any KG. With the projection operation initialized from a pre-trained inductive KG completion model, UltraQuery can solve CLQA on any KG after finetuning on a single dataset. Experimenting on 23 datasets, UltraQuery in the zero-shot inference mode shows competitive or better query answering performance than best available baselines and sets a new state of the art on 15 of them.

2024-09-25

NeurIPS.cc/2024/Conference (poster)

TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous Graphs

Erfan Loghmani

Emanuele Rossi

Ioannis Koutis

Heiner Stuckenschmidt

Reihaneh Rabbany

Guillaume Rabusseau

Multi-relational temporal graphs are powerful tools for modeling real-world data, capturing the evolving and interconnected nature of entiti… (see more)es over time. Recently, many novel models are proposed for ML on such graphs intensifying the need for robust evaluation and standardized benchmark datasets. However, the availability of such resources remains scarce and evaluation faces added complexity due to reproducibility issues in experimental protocols. To address these challenges, we introduce Temporal Graph Benchmark 2.0 (TGB 2.0), a novel benchmarking framework tailored for evaluating methods for predicting future links on Temporal Knowledge Graphs and Temporal Heterogeneous Graphs with a focus on large-scale datasets, extending the Temporal Graph Benchmark. TGB 2.0 facilitates comprehensive evaluations by presenting eight novel datasets spanning five domains with up to 53 million edges. TGB 2.0 datasets are significantly larger than existing datasets in terms of number of nodes, edges, or timestamps. In addition, TGB 2.0 provides a reproducible and realistic evaluation pipeline for multi-relational temporal graphs. Through extensive experimentation, we observe that 1) leveraging edge-type information is crucial to obtain high performance, 2) simple heuristic baselines are often competitive with more complex methods, 3) most methods fail to run on our largest datasets, highlighting the need for research on more scalable methods.

2024-06-14

ArXiv (preprint)

GraphAny: A Foundation Model for Node Classification on Any Graph

Hesham Mostafa

Michael M. Bronstein

2024-05-30

ArXiv (preprint)

GraphAny: A Foundation Model for Node Classification on Any Graph

Hesham Mostafa

Michael M. Bronstein

Foundation models that can perform inference on any new task without requiring specific training have revolutionized machine learning in vis… (see more)ion and language applications. However, applications involving graph-structured data remain a tough nut for foundation models, due to challenges in the unique feature- and label spaces associated with each graph. Traditional graph ML models such as graph neural networks (GNNs) trained on graphs cannot perform inference on a new graph with feature and label spaces different from the training ones. Furthermore, existing models learn functions specific to the training graph and cannot generalize to new graphs. In this work, we tackle these two challenges with a new foundational architecture for inductive node classification named GraphAny. GraphAny models inference on a new graph as an analytical solution to a LinearGNN, thereby solving the first challenge. To solve the second challenge, we learn attention scores for each node to fuse the predictions of multiple LinearGNNs. Specifically, the attention module is carefully parameterized as a function of the entropy-normalized distance-features between multiple LinearGNNs predictions to ensure generalization to new graphs. Empirically, GraphAny trained on the Wisconsin dataset with only 120 labeled nodes can effectively generalize to 30 new graphs with an average accuracy of 67.26\% in an inductive manner, surpassing GCN and GAT trained in the supervised regime, as well as other inductive baselines.

2024-05-30

ArXiv (preprint)

Towards Foundation Models for Knowledge Graph Reasoning

Hesham Mostafa

Foundation models in language and vision have the ability to run inference on any textual and visual inputs thanks to the transferable repre… (see more)sentations such as a vocabulary of tokens in language. Knowledge graphs (KGs) have different entity and relation vocabularies that generally do not overlap. The key challenge of designing foundation models on KGs is to learn such transferable representations that enable inference on any graph with arbitrary entity and relation vocabularies. In this work, we make a step towards such foundation models and present ULTRA, an approach for learning universal and transferable graph representations. ULTRA builds relational representations as a function conditioned on their interactions. Such a conditioning strategy allows a pre-trained ULTRA model to inductively generalize to any unseen KG with any relation vocabulary and to be fine-tuned on any graph. Conducting link prediction experiments on 57 different KGs, we find that the zero-shot inductive inference performance of a single pre-trained ULTRA model on unseen graphs of various sizes is often on par or better than strong baselines trained on specific graphs. Fine-tuning further boosts the performance.

2024-01-16

ICLR.cc/2024/Conference (poster)

Zero-shot Logical Query Reasoning on any Knowledge Graph

Jincheng Zhou

Bruno Ribeiro

Complex logical query answering (CLQA) in knowledge graphs (KGs) goes beyond simple KG completion and aims at answering compositional querie… (see more)s comprised of multiple projections and logical operations. Existing CLQA methods that learn parameters bound to certain entity or relation vocabularies can only be applied to the graph they are trained on which requires substantial training time before being deployed on a new graph. Here we present UltraQuery, an inductive reasoning model that can zero-shot answer logical queries on any KG. The core idea of UltraQuery is to derive both projections and logical operations as vocabulary-independent functions which generalize to new entities and relations in any KG. With the projection operation initialized from a pre-trained inductive KG reasoning model, UltraQuery can solve CLQA on any KG even if it is only finetuned on a single dataset. Experimenting on 23 datasets, UltraQuery in the zero-shot inference mode shows competitive or better query answering performance than best available baselines and sets a new state of the art on 14 of them.

2024-01-01

NeurIPS (published)