Publications

How to Get Your LLM to Generate Challenging Problems for Evaluation
The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional hum… (voir plus)an annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems, particularly for tasks such as long-context reasoning. Moreover, the rapid saturation of existing human-curated benchmarks by LLMs further necessitates the need to develop scalable and automatically renewable evaluation methodologies. In this work, we introduce **CHASE**, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover since we want to generate synthetic data for evaluation, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: document-based question answering, repository-level code completion, and math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60\% accuracy, thereby demonstrating the effectiveness of our framework at generating hard problems. Our experiments further reveal that the Gemini models significantly outperform other LLMs at long-context reasoning, and that the performance of all LLMs drastically drops by as much as 70\% when we scale up the context size to 50k tokens.
Inferring dynamical features from neural data through joint learning of latents factors and weights
Anirudh Gururaj Jamkhandi
Matthew G Perich
Behavior arises from coordinated synaptic changes in recurrent neural populations. Inferring the underlying dynamics from limited, noisy, an… (voir plus)d high-dimensional neural recordings is a major challenge, as experimental data often provide only partial access to brain states. While data-driven recurrent neural networks (dRNNs) have been effective for modeling such dynamics, they are typically limited to single-task domains and struggle to generalize across behavioral conditions. Here, we propose a hierachical model that captures neural dynamics across multiple behavioral contexts by learning a shared embedding space over RNN weights. We demonstrate that our model captures diverse neural dynamics with a single, unified model using both simulated datasets of many tasks and neural recordings during monkey reaching. Using the learned task embeddings, we demonstrate accurate classification of dynamical regimes and generalization to unseen samples. Crucially, spectral analysis on the learnt weights provide valuable insights into network computations, highlighting the potential of joint embedding–weight learning for scalable inference of brain dynamics.
Leveraging Parameter Space Symmetries for Reasoning Skill Transfer in LLMs
Sangwoo Cho
Supriyo Chakraborty
Shi-Xiong Zhang
Sambit Sahu
Genta Indra Winata
Measure Before You Look: Grounding Embeddings Through Manifold Metrics
Object-Centric Agentic Robot Policies
Executing open-ended natural language queries in previously unseen environments is a core problem in robotics. While recent advances in imit… (voir plus)ation learning and vision-language modeling have enabled promising end-to-end policies, these models struggle when faced with complex instructions and new scenes. Their short input context also limits their ability to solve tasks over larger spatial horizons. In this work, we introduce OCARP, a modular agentic robot policy that executes user queries by using a library of tools on a dynamic inventory of objects. The agent builds the inventory by grounding query-relevant objects using a rich 3D map representation that includes open-vocabulary descriptors and 3D affordances. By combining the flexible reasoning abilities of an agent with a general spatial representation, OCARP can execute complex open-vocabulary queries in a zero-shot manner. We showcase how OCARP can be deployed in both tabletop and mobile settings due to the underlying scalable map representation.
PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation
Ehsan Kamalloo
Rafael Pardinas
Reinforcement Learning (RL) is increasingly utilized to enhance the reasoning capabilities of Large Language Models (LLMs). However, effecti… (voir plus)vely scaling these RL methods presents significant challenges, primarily due to the difficulty in maintaining high AI accelerator utilization without generating stale, off-policy data that harms common RL algorithms. This paper introduces PipelineRL, an approach designed to achieve a superior trade-off between hardware efficiency and data on-policyness for LLM training. PipelineRL employs concurrent asynchronous data generation and model training, distinguished by the novel in-flight weight updates. This mechanism allows the LLM generation engine to receive updated model weights with minimal interruption during the generation of token sequences, thereby maximizing both the accelerator utilization and the freshness of training data. Experiments conducted on long-form reasoning tasks using 128 H100 GPUs demonstrate that PipelineRL achieves approximately
Self-Supervised Learning from Structural Invariance
Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models
Jan Wehner
Sahar Abdelnabi
Daniel Chee Hian Tan
David M. Krueger
Mario Fritz
Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs… (voir plus) or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models' performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.
When Data Falls Short: Grokking Below the Critical Threshold
Why all roads don't lead to Rome: Representation geometry varies across the human visual cortical hierarchy
Zahraa Chorghay
Blake Aaron Richards
AInstein: Can AI Rediscover Scientific Concepts from First Principles?
Shambhavi Mishra
Jose Dolz
Christopher Pal
Large language models have demonstrated remarkable capabilities across diverse tasks, yet a fundamental question remains: can these models g… (voir plus)enuinely rediscover complex scientific insights, or do they merely recite memorized information? We present AInstein, a novel framework for evaluating whether language models can derive established scientific concepts from first principles when stripped of domain-specific terminology. Rather than testing the recall of scientific facts, we reformulate landmark discoveries as conceptual puzzles, challenging models to reconstruct the underlying technical solutions independently.
Are Large Language Models Good Temporal Graph Learners?
Large Language Models (LLMs) have recently driven significant advancements in Natural Language Processing and various other applications. Wh… (voir plus)ile a broad range of literature has explored the graph-reasoning capabilities of LLMs, including their use of predictors on graphs, the application of LLMs to dynamic graphs -- real world evolving networks -- remains relatively unexplored. Recent work studies synthetic temporal graphs generated by random graph models, but applying LLMs to real-world temporal graphs remains an open question. To address this gap, we introduce Temporal Graph Talker (TGTalker), a novel temporal graph learning framework designed for LLMs. TGTalker utilizes the recency bias in temporal graphs to extract relevant structural information, converted to natural language for LLMs, while leveraging temporal neighbors as additional information for prediction. TGTalker demonstrates competitive link prediction capabilities compared to existing Temporal Graph Neural Network (TGNN) models. Across five real-world networks, TGTalker performs competitively with state-of-the-art temporal graph methods while consistently outperforming popular models such as TGN and HTGN. Furthermore, TGTalker generates textual explanations for each prediction, thus opening up exciting new directions in explainability and interpretability for temporal link prediction. The code is publicly available at https://github.com/shenyangHuang/TGTalker.