Publications

Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries
Sachin Goyal
Badr Youbi Idrissi
David Lopez-Paz
Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, … (voir plus)and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.
A Capacitated Collection-and-Delivery-Point Location Problem with Random Utility Maximizing Customers
David Pinzon Ulloa
Ammar Metnani
Comparing Virtual Reality Trauma Training Across Diverse Clinical Backgrounds: A Mixed-Methods Study in Canada And India.
Boaz Laor
Samia Benabess
S. Kundu
Ayla Gerk
F. Botelho
Jean-Robert Kwizera
Arjunaditya Kundu
Tom Dolby
Elena Guadagno
Dhruva Ghosh
Vishal Micheal
Rohit Theodore
Thejus Varghese
$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers
Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can strug… (voir plus)gle to optimize unseen tasks (*meta-generalize*), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization (
Contractive Diffusion Policies: Robust Action Diffusion via Contractive Score-Based Sampling with Differential Equations
Charlotte Morissette
Anas El Houssaini
Diffusion policies have emerged as powerful generative models for offline policy learning, whose sampling process can be rigorously characte… (voir plus)rized by a score function guiding a Stochastic Differential Equation (SDE). However, the same score-based SDE modeling that grants diffusion policies the flexibility to learn diverse behavior also incurs solver and score-matching errors, large data requirements, and inconsistencies in action generation. While less critical in image generation, these inaccuracies compound and lead to failure in continuous control settings. We introduce Contractive Diffusion Policies (CDPs) to induce contractive behavior in the diffusion sampling dynamics. Contraction pulls nearby flows closer to enhance robustness against solver and score-matching errors while reducing unwanted action variance. We develop an in-depth theoretical analysis along with a practical implementation recipe to incorporate CDPs into existing diffusion policy architectures with minimal modification and computational cost. We evaluate CDPs for offline learning by conducting extensive experiments in simulation and real-world settings. Across benchmarks, CDPs often outperform baseline policies, with pronounced benefits under data scarcity.
Decoding Dynamic Visual Experience from Calcium Imaging via Cell-Pattern-Aware SSL
Sangyoon Bae
Mehdi Azabou
Blake A Richards
Jiook Cha
Self-supervised learning (SSL) holds a great deal of promise for applications in neuroscience, due to the lack of large-scale, consistently … (voir plus)labeled neural datasets. However, most neural datasets contain heterogeneous populations that mix stable, predictable cells with highly stochastic, stimulus-contingent ones, which has made it hard to identify consistent activity patterns during SSL. As a result, self-supervised pretraining has yet to show clear signs of benefits from scale on neural data. Here, we present a novel approach to self-supervised pretraining, POYO-SSL that exploits the heterogeneity of neural data to improve pretraining and achieve benefits of scale. Specifically, in POYO-SSL we pretrain only on predictable neurons---identified on the pretraining split via simple higher-order statistics (skewness and kurtosis)---then we fine-tune on the unpredictable population for downstream tasks. On the Allen Brain Observatory dataset, this strategy yields approximately 12--13\% relative gains over from-scratch training and exhibits smooth, monotonic scaling with model size. In contrast, existing state-of-the-art baselines plateau or destabilize as model size increases. By making predictability an explicit metric for crafting the data diet, POYO-SSL turns heterogeneity from a liability into an asset, providing a robust, biologically grounded recipe for scalable neural decoding and a path toward foundation models of neural dynamics.
A Deep Learning and Inertia-Aware Load Shedding Framework for Mitigating Load-Altering Attacks
Anoosh Dini
Keyhan Sheshyekani
The widespread integration of information and communication technologies into modern power systems has increased their vulnerability to cybe… (voir plus)r-physical threats, such as load-altering attacks (LAA). These attacks can cause rapid load changes, potentially triggering protective mechanisms like under-frequency load shedding (UFLS). Existing approaches for mitigating these attacks are limited, and they mostly rely on preventive measures or neglect system dynamics. In this paper, we propose a novel online framework for the detection and mitigation of LAAs that addresses these limitations. The detection component employs a convolutional neural network–long short-term memory autoencoder (CNN-LSTM AE) architecture to capture anomalies in load consumption data. For mitigation, we propose an inertia-aware load shedding scheme that dynamically adjusts the shedding amount based on the real-time frequency and the magnitude of the attack. This approach prevents overshedding caused by predefined UFLS relay settings and mitigates undershedding by considering the system’s real-time inertia. To this end, a variable forgetting factor recursive least squares (VFF-RLS) algorithm is proposed, which can track inertia variations within a few seconds. The proposed framework is compatible with both synchronous generator-based and converter-interfaced generator-dominated grids. Simulations indicate the effectiveness of the proposed framework in maintaining frequency stability under a wide range of attack scenarios.
Diffusion tractography outside the brain: the road less travelled
Kurt G. Schilling
Irvin Teh
Richard Dortch
Ibrahim Ibrahim
Nian Wang
Bruce Damon
Rory L. Cochran
Alexander Leemans
Diffusion tractography is a powerful MRI technique for mapping fibrous tissue architecture, traditionally applied to the white matter of the… (voir plus) brain. This report surveys the growing application of tractography to anatomical structures outside the brain, a domain that presents both unique challenges and unique opportunities. We examine its use in the heart, spinal cord, peripheral nerves, brachial plexus, kidney, skeletal muscle, and prostate. For each region, we detail the necessary methodological adaptations for acquisition, modeling, and processing, and highlight the unique anatomical information that can be derived for research and clinical applications. While significant challenges remain - spanning technical hurdles like physiological motion and susceptibility artifacts, to biological complexities like lower anisotropy and the interpretation of streamline validity - tractography beyond the brain provides invaluable, non-invasive insights into tissue micro-organization, opening a new frontier for biomedical imaging.
Discovering Diverse Behaviors via Temporal Contrastive Learning
Catherine Ji
Benjamin Eysenbach
Effective exploration in reinforcement learning requires not only tracking where an agent has been, but also understanding how the agent per… (voir plus)ceives and represents the world. To learn powerful representations, an agent should actively explore states that contribute to its knowledge of the environment. Temporal representations can capture the information necessary to solve a wide range of potential tasks while avoiding the computational cost associated with full state reconstruction. In this paper, we propose an exploration method that leverages temporal contrastive representations to guide exploration, prioritizing states with unpredictable future outcomes. We demonstrate that such representations can enable the learning of complex exploratory behaviors in locomotion, manipulation, and embodied-AI tasks, revealing capabilities and behaviors that traditionally require extrinsic rewards. Unlike approaches that rely on explicit distance learning or episodic memory mechanisms (e.g., quasimetric-based methods), our method builds directly on temporal similarities, yielding a simpler yet effective strategy for exploration.
DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging
Kotaro Yoshida
Yuji Naraki
Takafumi Horie
Ryotaro Shimizu
Model merging has emerged as an efficient and flexible paradigm for multi-task learning, with numerous methods being proposed in recent year… (voir plus)s. However, these state-of-the-art techniques are typically evaluated on benchmark suites that are highly favorable to model merging, and their robustness in more realistic settings remains largely unexplored. In this work, we first investigate the vulnerabilities of model-merging methods and pinpoint the source-model characteristics that critically underlie them. Specifically, we identify two factors that are particularly harmful to the merging process: (1) disparities in task vector norms, and (2) the low confidence of the source models. To address this issue, we propose **DisTaC** (**Dis**tillation for **Ta**sk vector **C**onditioning), a novel method that pre-conditions these problematic task vectors before the merge. DisTaC leverages knowledge distillation to adjust a task vector's norm and increase source-model confidence while preserving its essential task-specific knowledge. Our extensive experiments demonstrate that by pre-conditioning task vectors with DisTaC, state-of-the-art merging techniques can successfully integrate models that exhibit these harmful traits, where they would otherwise fail, and achieve significant performance gains.
Dual Optimistic Ascent (PI Control) is the Augmented Lagrangian Method in Disguise
Constrained optimization is a powerful framework for enforcing requirements on neural networks. These constrained deep learning problems are… (voir plus) typically solved using first-order methods on their min-max Lagrangian formulation, but such approaches often suffer from oscillations and can fail to find all local solutions. While the Augmented Lagrangian method (ALM) addresses these issues, practitioners often favor dual optimistic ascent schemes (PI control) on the standard Lagrangian, which perform well empirically but lack formal guarantees. In this paper, we establish a previously unknown equivalence between these approaches: dual optimistic ascent on the Lagrangian is equivalent to gradient descent-ascent on the Augmented Lagrangian. This finding allows us to transfer the robust theoretical guarantees of the ALM to the dual optimistic setting, proving it converges linearly to all local solutions. Furthermore, the equivalence provides principled guidance for tuning the optimism hyper-parameter. Our work closes a critical gap between the empirical success of dual optimistic methods and their theoretical foundation.
An Efficient Model Maintenance Approach for MLOps
Heng Li
Amin Nikanjam
In recent years, many industries have utilized machine learning models (ML) in their systems. Ideally, machine learning models should be tra… (voir plus)ined on and applied to data from the same distributions. However, the data evolves over time in many application areas, leading to data and concept drift, which in turn causes the performance of the ML models to degrade over time. Therefore, maintaining up to date ML models plays a critical role in the MLOps pipeline. Existing ML model maintenance approaches are often computationally resource intensive, costly, time consuming, and model dependent. Thus, we propose an improved MLOps pipeline, a new model maintenance approach and a Similarity Based Model Reuse (SimReuse) tool to address the challenges of ML model maintenance. We identify seasonal and recurrent distribution patterns in time series datasets throughout a preliminary study. Recurrent distribution patterns enable us to reuse previously trained models for similar distributions in the future, thus avoiding frequent retraining. Then, we integrated the model reuse approach into the MLOps pipeline and proposed our improved MLOps pipeline. Furthermore, we develop SimReuse, a tool to implement the new components of our MLOps pipeline to store models and reuse them for inference of data segments with similar data distributions in the future. Our evaluation results on four time series datasets demonstrate that our model reuse approach can maintain the performance of models while significantly reducing maintenance time and costs. Our model reuse approach achieves ML performance comparable to the best baseline, while being 15 times more efficient in terms of computation time and costs. Therefore, industries and practitioners can benefit from our approach and use our tool to maintain the performance of their ML models in the deployment phase to reduce their maintenance costs.