Publications

Human-AI Alignment of Learning Trajectories in Video Games: a continual RL benchmark proposal

Yann Harel

Lune P Bellec

We propose a design for a continual reinforcement learning (CRL) benchmark called GHAIA, centered on human-AI alignment of learning trajecto… (see more)ries in structured video game environments. Using \textit{Super Mario Bros.} as a case study, gameplay is decomposed into short, annotated scenes organized into diverse task sequences based on gameplay patterns and difficulty. Evaluation protocols measure both plasticity and stability, with flexible revisit and pacing schedules. A key innovation is the inclusion of high-resolution human gameplay data collected under controlled conditions, enabling direct comparison of human and agent learning. In addition to adapting classical CRL metrics like forgetting and backward transfer, we introduce semantic transfer metrics capturing learning over groups of scenes sharing similar game patterns. We demonstrate the feasibility of our approach on human and agent data, and discuss key aspects of the first release for community input.

2025-06-19

rl-conference.cc/RLC/2025/Workshop/RLVG (accepted)

openreview.net

An Empirical Study of Sensitive Information in Logs

Roozbeh Aghili

Heng Li

Foutse Khomh

2025-06-18

Proceedings of the ACM on Software Engineering (published)

doi.org

arxiv.org

Next-Token Prediction Should be Ambiguity-Sensitive: A Meta-Learning Perspective

The rapid adaptation ability of auto-regressive foundation models is often attributed to the diversity of their pre-training data. This is b… (see more)ecause, from a Bayesian standpoint, minimizing prediction error in such settings requires integrating over all plausible latent hypotheses consistent with observations. While this behavior is desirable in principle, it often proves too ambitious in practice: under high ambiguity, the number of plausible latent alternatives makes Bayes-optimal prediction computationally intractable. Cognitive science has long recognized this limitation, suggesting that under such conditions, heuristics or information-seeking strategies are preferable to exhaustive inference. Translating this insight to next-token prediction, we hypothesize that low- and high-ambiguity predictions pose different computational demands, making ambiguity-agnostic next-token prediction a detrimental inductive bias. To test this, we introduce MetaHMM, a synthetic sequence meta-learning benchmark with rich compositional structure and a tractable Bayesian oracle. We show that Transformers indeed struggle with high-ambiguity predictions across model sizes. Motivated by cognitive theories, we propose a method to convert pre-trained models into Monte Carlo predictors that decouple task inference from token prediction. Preliminary results show substantial gains in ambiguous contexts through improved capacity allocation and test-time scalable inference, though challenges remain.

2025-06-18

ArXiv (preprint)

doi.org

arxiv.org

SlepNet: Spectral Subgraph Representation Learning for Neural Dynamics

Siddharth Viswanath

Rahul Singh

Yanlei Zhang

J. Adam Noah

Joy Hirsch

Smita Krishnaswamy

Graph neural networks have been useful in machine learning on graph-structured data, particularly for node classification and some types of … (see more)graph classification tasks. However, they have had limited use in representing patterning of signals over graphs. Patterning of signals over graphs and in subgraphs carries important information in many domains including neuroscience. Neural signals are spatiotemporally patterned, high dimensional and difficult to decode. Graph signal processing and associated GCN models utilize the graph Fourier transform and are unable to efficiently represent spatially or spectrally localized signal patterning on graphs. Wavelet transforms have shown promise here, but offer non-canonical representations and cannot be tightly confined to subgraphs. Here we propose SlepNet, a novel GCN architecture that uses Slepian bases rather than graph Fourier harmonics. In SlepNet, the Slepian harmonics optimally concentrate signal energy on specifically relevant subgraphs that are automatically learned with a mask. Thus, they can produce canonical and highly resolved representations of neural activity, focusing energy of harmonics on areas of the brain which are activated. We evaluated SlepNet across three fMRI datasets, spanning cognitive and visual tasks, and two traffic dynamics datasets, comparing its performance against conventional GNNs and graph signal processing constructs. SlepNet outperforms the baselines in all datasets. Moreover, the extracted representations of signal patterns from SlepNet offers more resolution in distinguishing between similar patterns, and thus represent brain signaling transients as informative trajectories. Here we have shown that these extracted trajectory representations can be used for other downstream untrained tasks. Thus we establish that SlepNet is useful both for prediction and representation learning in spatiotemporal data.

2025-06-18

ArXiv (preprint)

doi.org

arxiv.org

Learning From the Past with Cascading Eligibility Traces

Tokiniaina Raharison Ralambomihanta

Ivan Anokhin

Roman Pogodin

S Ebrahimi Kahou

Jonathan Cornford

Blake Aaron Richards

2025-06-16

ArXiv (preprint)

arxiv.org

Less is More: Undertraining Experts Improves Model Upcycling

Stefan Horoi

Eugene Belilovsky

Guy Wolf

Gintare Karolina Dziugaite

Modern deep learning is increasingly characterized by the use of open-weight foundation models that can be fine-tuned on specialized dataset… (see more)s. This has led to a proliferation of expert models and adapters, often shared via platforms like HuggingFace and AdapterHub. To leverage these resources, numerous model upcycling methods have emerged, enabling the reuse of fine-tuned models in multi-task systems. A natural pipeline has thus formed to harness the benefits of transfer learning and amortize sunk training costs: models are pre-trained on general data, fine-tuned on specific tasks, and then upcycled into more general-purpose systems. A prevailing assumption is that improvements at one stage of this pipeline propagate downstream, leading to gains at subsequent steps. In this work, we challenge that assumption by examining how expert fine-tuning affects model upcycling. We show that long fine-tuning of experts that optimizes for their individual performance leads to degraded merging performance, both for fully fine-tuned and LoRA-adapted models, and to worse downstream results when LoRA adapters are upcycled into MoE layers. We trace this degradation to the memorization of a small set of difficult examples that dominate late fine-tuning steps and are subsequently forgotten during merging. Finally, we demonstrate that a task-dependent aggressive early stopping strategy can significantly improve upcycling performance.

2025-06-16

arXiv (Cornell University) (preprint)

doi.org

arxiv.org

Scalable Tree Search over Graphs with Learned Action Pruning for Power Grid Control

Florence Cloutier

Cyrus Neary

Adriana Hugessen

Viktor Todosijevic

Zina Kamel

Glen Berseth

As real-world infrastructure systems become increasingly complex and large-scale, there is a growing need for learning-based control strateg… (see more)ies that can make informed decisions in complex and dynamic environments. However, large-scale problems — such as power grid control — introduce high-dimensional action spaces and necessitate transferability across varying grid topologies. We introduce **H**ierarchical **E**xpert-Guided **R**econfiguration **O**ptimization for **G**raph **T**opologies, **HERO-GT**, a model-based planning approach that combines a pretrained graph neural network (GNN) for topology-aware action pruning with a Monte Carlo Tree Search (MCTS) planner for targeted, structured exploration. More specifically, the high-level GNN predicts a promising subset of actions, which the low-level MCTS agent uses to focus its search and reduce computational overhead while remaining adaptable to unseen graph structures. Furthermore, the MCTS planner leverages a given *default policy*---which may be defined, for example, by heuristics, problem relaxations, or rule-based methods---to bias the search and prioritize actions that are expected to improve performance over the default. We deploy HERO-GT in power grid environments, demonstrating that it not only improves over a strong default policy, but also scales to a realistic operational setting where exhaustive search becomes computationally infeasible.

2025-06-16

rl-conference.cc/RLC/2025/Workshop/RL4RS (published)

openreview.net

Discovering Temporal Structure: An Overview of Hierarchical Reinforcement Learning

Martin Klissarov

Akhil Bagaria

Ziyan Luo

George Konidaris

Doina Precup

Marlos C. Machado

Developing agents capable of exploring, planning and learning in complex open-ended environments is a grand challenge in artificial intellig… (see more)ence (AI). Hierarchical reinforcement learning (HRL) offers a promising solution to this challenge by discovering and exploiting the temporal structure within a stream of experience. The strong appeal of the HRL framework has led to a rich and diverse body of literature attempting to discover a useful structure. However, it is still not clear how one might define what constitutes good structure in the first place, or the kind of problems in which identifying it may be helpful. This work aims to identify the benefits of HRL from the perspective of the fundamental challenges in decision-making, as well as highlight its impact on the performance trade-offs of AI agents. Through these benefits, we then cover the families of methods that discover temporal structure in HRL, ranging from learning directly from online experience to offline datasets, to leveraging large language models (LLMs). Finally, we highlight the challenges of temporal structure discovery and the domains that are particularly well-suited for such endeavours.

2025-06-15

ArXiv (preprint)

doi.org

arxiv.org

Can GPT4 Generate Effective Feedback on Code Readability?

Xiaotian Su

Yajie Song

Marcus Messer

Jaromir Savelka

Maria Cutumisu

April Wang

2025-06-12

Annual Conference on Innovation and Technology in Computer Science Education (published)

doi.org

MetaWriter: Personalized Handwritten Text Recognition Using Meta-Learned Prompt Tuning

Wenhao Gu

Li Gu

Ching Yee Suen

Yang Wang

Recent advancements in handwritten text recognition (HTR) have enabled the effective conversion of handwritten text to digital formats. Howe… (see more)ver, achieving robust recognition across diverse writing styles remains challenging. Traditional HTR methods lack writer-specific personalization at test time due to limitations in model architecture and training strategies. Existing attempts to bridge this gap, through gradient-based meta-learning, still require labeled examples and suffer from parameter-inefficient fine-tuning, leading to substantial computational and memory overhead. To overcome these challenges, we propose an efficient framework that formulates personalization as prompt tuning, incorporating an auxiliary image reconstruction task with a self-supervised loss to guide prompt adaptation with unlabeled test-time examples. To ensure self-supervised loss effectively minimizes text recognition error, we leverage meta-learning to learn the optimal initialization of the prompts. As a result, our method allows the model to efficiently capture unique writing styles by updating less than 1% of its parameters and eliminating the need for time-intensive annotation processes. We validate our approach on the RIMES and IAM Handwriting Database benchmarks, where it consistently outperforms previous state-of-the-art methods while using 20x fewer parameters. We believe this represents a significant advancement in personalized handwritten text recognition, paving the way for more reliable and practical deployment in resource-constrained scenarios.

2025-06-12

IEEE/CVF Conference on Computer Vision and Pattern Recognition (Accept)

doi.org

arxiv.org

Exploration by Exploitation: Curriculum Learning for Reinforcement Learning Agents through Competence-Based Curriculum Policy Search

Tabitha Edith Lee

Nan Rosemary Ke

Sarvesh Patil

Annya Dahmani

Eunice Yiu

Esra'a Saleh

Alison Gopnik

Oliver Kroemer

Glen Berseth

2025-06-11

ICML.cc/2025/Workshop/EXAIT (poster)

openreview.net

Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving

Luke Rowe

Rodrigue De Schaetzen

Roger Girgis

Christopher Pal

Liam Paull

We present Poutine, a 3B-parameter vision-language model (VLM) tailored for end-to-end autonomous driving in long-tail driving scenarios. Po… (see more)utine is trained in two stages. To obtain strong base driving capabilities, we train Poutine-Base in a self-supervised vision-language-trajectory (VLT) next-token prediction fashion on 83 hours of CoVLA nominal driving and 11 hours of Waymo long-tail driving. Accompanying language annotations are auto-generated with a 72B-parameter VLM. Poutine is obtained by fine-tuning Poutine-Base with Group Relative Policy Optimization (GRPO) using less than 500 preference-labeled frames from the Waymo validation set. We show that both VLT pretraining and RL fine-tuning are critical to attain strong driving performance in the long-tail. Poutine-Base achieves a rater-feedback score (RFS) of 8.12 on the validation set, nearly matching Waymo's expert ground-truth RFS. The final Poutine model achieves an RFS of 7.99 on the official Waymo test set, placing 1st in the 2025 Waymo Vision-Based End-to-End Driving Challenge by a significant margin. These results highlight the promise of scalable VLT pre-training and lightweight RL fine-tuning to enable robust and generalizable autonomy.

2025-06-11

ArXiv (preprint)

doi.org

arxiv.org

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Publications

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Popular keywords:

Publications