Publications

Measure Before You Look: Grounding Embeddings Through Manifold Metrics
Object-Centric Agentic Robot Policies
Executing open-ended natural language queries in previously unseen environments is a core problem in robotics. While recent advances in imit… (see more)ation learning and vision-language modeling have enabled promising end-to-end policies, these models struggle when faced with complex instructions and new scenes. Their short input context also limits their ability to solve tasks over larger spatial horizons. In this work, we introduce OCARP, a modular agentic robot policy that executes user queries by using a library of tools on a dynamic inventory of objects. The agent builds the inventory by grounding query-relevant objects using a rich 3D map representation that includes open-vocabulary descriptors and 3D affordances. By combining the flexible reasoning abilities of an agent with a general spatial representation, OCARP can execute complex open-vocabulary queries in a zero-shot manner. We showcase how OCARP can be deployed in both tabletop and mobile settings due to the underlying scalable map representation.
PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation
Ehsan Kamalloo
Rafael Pardinas
Reinforcement Learning (RL) is increasingly utilized to enhance the reasoning capabilities of Large Language Models (LLMs). However, effecti… (see more)vely scaling these RL methods presents significant challenges, primarily due to the difficulty in maintaining high AI accelerator utilization without generating stale, off-policy data that harms common RL algorithms. This paper introduces PipelineRL, an approach designed to achieve a superior trade-off between hardware efficiency and data on-policyness for LLM training. PipelineRL employs concurrent asynchronous data generation and model training, distinguished by the novel in-flight weight updates. This mechanism allows the LLM generation engine to receive updated model weights with minimal interruption during the generation of token sequences, thereby maximizing both the accelerator utilization and the freshness of training data. Experiments conducted on long-form reasoning tasks using 128 H100 GPUs demonstrate that PipelineRL achieves approximately
Self-Supervised Learning from Structural Invariance
Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models
Jan Wehner
Sahar Abdelnabi
Daniel Chee Hian Tan
David M. Krueger
Mario Fritz
Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs… (see more) or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models' performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.
When Data Falls Short: Grokking Below the Critical Threshold
Why all roads don't lead to Rome: Representation geometry varies across the human visual cortical hierarchy
Zahraa Chorghay
Blake Aaron Richards
AInstein: Can AI Rediscover Scientific Concepts from First Principles?
Shambhavi Mishra
Jose Dolz
Christopher Pal
Large language models have demonstrated remarkable capabilities across diverse tasks, yet a fundamental question remains: can these models g… (see more)enuinely rediscover complex scientific insights, or do they merely recite memorized information? We present AInstein, a novel framework for evaluating whether language models can derive established scientific concepts from first principles when stripped of domain-specific terminology. Rather than testing the recall of scientific facts, we reformulate landmark discoveries as conceptual puzzles, challenging models to reconstruct the underlying technical solutions independently.
Are Large Language Models Good Temporal Graph Learners?
Large Language Models (LLMs) have recently driven significant advancements in Natural Language Processing and various other applications. Wh… (see more)ile a broad range of literature has explored the graph-reasoning capabilities of LLMs, including their use of predictors on graphs, the application of LLMs to dynamic graphs -- real world evolving networks -- remains relatively unexplored. Recent work studies synthetic temporal graphs generated by random graph models, but applying LLMs to real-world temporal graphs remains an open question. To address this gap, we introduce Temporal Graph Talker (TGTalker), a novel temporal graph learning framework designed for LLMs. TGTalker utilizes the recency bias in temporal graphs to extract relevant structural information, converted to natural language for LLMs, while leveraging temporal neighbors as additional information for prediction. TGTalker demonstrates competitive link prediction capabilities compared to existing Temporal Graph Neural Network (TGNN) models. Across five real-world networks, TGTalker performs competitively with state-of-the-art temporal graph methods while consistently outperforming popular models such as TGN and HTGN. Furthermore, TGTalker generates textual explanations for each prediction, thus opening up exciting new directions in explainability and interpretability for temporal link prediction. The code is publicly available at https://github.com/shenyangHuang/TGTalker.
Conditional Adversarial Random Forest for Synthetic Electronic Health Record Generation
CrediBench: Building Web-Scale Network Datasets for Information Integrity
Online misinformation poses an escalating threat, amplified by the Internet's open nature and increasingly capable LLMs that generate persua… (see more)sive yet deceptive content. Existing misinformation detection methods typically focus on either textual content or network structure in isolation, failing to leverage the rich, dynamic interplay between website content and hyperlink relationships that characterizes real-world misinformation ecosystems. We introduce CrediBench: a large-scale data processing pipeline for constructing temporal web graphs that jointly model textual content and hyperlink structure for misinformation detection. Unlike prior work, our approach captures the dynamic evolution of general misinformation domains, including changes in both content and inter-site references over time. Our processed one-month snapshot extracted from the Common Crawl archive in December 2024 contains 45 million nodes and 1 billion edges, representing the largest web graph dataset made publicly available for misinformation research to date. From our experiments on this graph snapshot, we demonstrate the strength of both structural and webpage content signals for learning credibility scores, which measure source reliability. The pipeline and experimentation code are all available here, and the dataset is in this folder.
DIVERS-Bench: Evaluating Language Identification Across Domain Shifts and Code-Switching