Publications

Gistify: Codebase-Level Understanding via Runtime Execution

Hyunji Lee

Minseon Kim

Chinmay Singh

Matheus Pereira

Atharv Sonwane

Isadora White

Elias Stengel-Eskin

Mohit Bansal

Zhengxiang Shi

Alessandro Sordoni

Marc-Alexandre Côté

Eric Yuan

Lucas Caccia

As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is ce… (see more)ntral. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.

2025-12-31

International Conference on Learning Representations (Accept (Poster))

openreview.net

GraphOmni: A Comprehensive and Extensible Benchmark Framework for Large Language Models on Graph-theoretic Tasks

Hao Xu

Xiangru Jian

Xinjian Zhao

Wei Pang

Chao Zhang

Suyuchen Wang

Qixin Zhang

Zhengyuan Dong

Joao Monteiro

Bang Liu

Qiuzhuang Sun

Tianshu Yu

This paper introduces GraphOmni, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs on graph-theoretic tasks … (see more)articulated in natural language. GraphOmni spans diverse graph types, serialization formats, and prompting schemes, substantially extending upon prior efforts in both scope and depth. Through systematic evaluation, we uncover critical interactions among these dimensions, revealing their decisive impact on model performance. Our experiments show that state-of-the-art closed-source models such as Claude-3.5 and o4-mini consistently lead overall, yet still leave considerable headroom, while open-source models display pronounced sensitivity to various design choices. Beyond the standard scope, larger graphs, real-world graphs, and additional NP-hard tasks are further discussed. We further analyze efficiency via output token usage, highlighting cost–accuracy trade-offs, and introduce a reinforcement learning-based optimizer that adaptively selects factor combinations, reducing evaluation cost by 75\% while retaining strong accuracy. This flexible and extensible benchmark not only deepens understanding of LLM performance on structured graph reasoning but also establishes a robust foundation for advancing model design and evaluation. The code and datasets are available at https://anonymous.4open.science/r/ID-14092.

2025-12-31

International Conference on Learning Representations (Accept (Poster))

doi.org

openreview.net

Grounding Computer Use Agents on Human Demonstrations

Aarash Feizi

Shravan Nayak

Xiangru Jian

Kevin Qinghong Lin

Kaixin Li

Rabiul Awal

Xing Han Lu

Johan Obando-Ceron

Juan A. Rodriguez

Nicolas Chapados

David Vázquez

Adriana Romero-Soriano

Reihaneh Rabbany

Perouz Taslakian

Christopher Pal

Spandana Gella

Sai Rajeswar

Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen eleme… (see more)nts. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.

2025-12-31

International Conference on Learning Representations (Accept (Poster))

doi.org

openreview.net

Heterogeneous Low-Bandwidth Pre-Training of LLMs

Yazan Obeidi

Amir Sarfi

Joel Lidin

Paul Janson

Eugene Belilovsky

Pre-training large language models (LLMs) increasingly requires distributed compute, yet bandwidth constraints make it difficult to scale be… (see more)yond well-provisioned datacenters-especially when model parallelism forces frequent, large inter-device communications. We study whether SparseLoCo, a low-communication data parallel method based on infrequent synchronization and sparse pseudo-gradient exchange, can be combined with low-bandwidth pipeline model parallelism via activation and activation-gradient compression. We introduce a heterogeneous distributed training framework where some participants host full replicas on high-bandwidth interconnects, while resource-limited participants are grouped to jointly instantiate a replica using pipeline parallelism with subspace-projected inter-stage communication. To make the recently introduced subspace pipeline compression compatible with SparseLoCo, we study a number of adaptations. Across large-scale language modeling experiments (178M-1B parameters) on standard pretraining corpora, we find that activation compression composes with SparseLoCo at modest cost, while selective (heterogeneous) compression consistently improves the loss-communication tradeoff relative to compressing all replicas-especially at aggressive compression ratios. These results suggest a practical path to incorporating low-bandwidth model parallelism and heterogeneous participants into LLM pre-training.

2025-12-31

arXiv (preprint)

doi.org

openreview.net

h-MINT: Modeling Pocket-Ligand Binding with Hierarchical Molecular Interaction Network

Yanru Qu

Yijie Zhang

Wenjuan Tan

Xiangzhe Kong

Xiangxin Zhou

Chaoran Cheng

Mathieu Blanchette

Jiaxuan You

Ge Liu

Accurate molecular representations are critical for drug discovery, and a central challenge lies in capturing the chemical environment of mo… (see more)lecular fragments, as key interactions, such as H-bond and π stacking—occur only under specific local conditions. Most existing approaches represent molecules as atom-level graphs; however, individual atoms cannot express stereochemistry, lone pairs, conjugation, and other complex features. Fragment-based methods (e.g., principal subgraph or functional group libraries) fail to preserve essential information such as chirality, aromatic bond integrity, and ionic states. This work addresses these limitations from two aspects. (i) **OverlapBPE tokenization**. We propose a novel data-driven molecule tokenization method. Unlike existing approaches, our method allows overlapping fragments, reflecting the inherently fuzzy boundaries of small-molecule substructures and, together with enriched chemical information at the token level, thereby preserving a more complete chemical context. (ii) **h- MINT model**. We develop a hierarchical molecular interaction network capable of jointly modeling drug–target interactions at both atom and fragment levels. By supporting fragment overlaps, the model naturally accommodates the many-to- many atom–fragment mappings introduced by the OverlapBPE scheme. Extensive evaluation against state-of-the-art methods shows our method improves binding affinity prediction by 2-4% Pearson/Spearman correlation on PDBBind and LBA, enhances virtual screening by 1-3% in key metrics on DUD-E and LIT-PCBA, and achieves the best overall HTS performance on PubChem assays. Further analysis demonstrates that our method effectively captures interactive information while maintaining good generalization.

2025-12-31

International Conference on Learning Representations (Accept (Poster))

openreview.net

How AI Is Reshaping Pricing Litigation

Maxime C. Cohen

2025-12-31

SSRN Electronic Journal (published)

doi.org

Impact of an LLM-based Review Assistant in Practice: A Mixed Open-/Closed-source Case Study

Doriane Olewicki

Leuson Da Silva

Oussama Ben Sghaier

Suhaib Mujahid

Arezou Amini

Benjamin Mah

Marco Castelluccio

Sarra Habchi

Foutse Khomh

Bram Adams

2025-12-31

IEEE Transactions on Software Engineering (published)

doi.org

In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior

Anaïs Berkes

In-context reinforcement learning (ICRL) promises fast adaptation to unseen environments without parameter updates, but current methods eith… (see more)er cannot improve beyond the training distribution or require near-optimal data, limiting practical adoption. We introduce SPICE, a Bayesian ICRL method that learns a prior over Q-values via deep ensemble and updates this prior at test-time using in-context information through Bayesian updates. To recover from poor priors resulting from training on sub-optimal data, our online inference follows an Upper-Confidence Bound rule that favours exploration and adaptation. We prove that SPICE achieves regret-optimal behaviour in both stochastic bandits and finite-horizon MDPs, even when pretrained only on suboptimal trajectories. We validate these findings empirically across bandit and control benchmarks. SPICE achieves near-optimal decisions on unseen tasks, substantially reduces regret compared to prior ICRL and meta-RL approaches while rapidly adapting to unseen tasks and remaining robust under distribution shift.

2025-12-31

arXiv (preprint)

doi.org

arxiv.org

Inference-time Physics Alignment of Video Generative Models with Latent World Models

Jianhao Yuan

Xiaofeng Zhang

Felix Friedrich

Nicolas Beltran-Velez

Melissa Hall

Reyhane Askari-Hemmat

Xiaochuang Han

Nicolas Ballas

Michal Drozdzal

Adriana Romero-Soriano

State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility… (see more). While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, on the challenging PhysicsIQ benchmark we achieve 62.00% final score, outperforming previous state of the art by 6.78%. Our work demonstrates the viability of using latent world models to improve physical plausibility of video generation, beyond this specific instantiation or parameterization.

2025-12-31

IEEE/CVF Conference on Computer Vision and Pattern Recognition (Accept (Highlight))

doi.org

arxiv.org

Integrating Generative and Experimental Platforms for Biomolecular Design

Chenghao Liu

Jarrid Rector-Brooks

Soojung Yang

Sidney Lisanza

Jacob Gershon

Lauren Hong

Pranam Chatterjee

Yoshua Bengio

Biomolecular design, through artificial engineering of proteins, ligands, nucleic acids, and cells, holds immense promise in addressing pres… (see more)sing medical, industrial, and environmental challenges. While generative machine learning has shown significant potential in this area, a disconnect exists with experimental biology: many ML research efforts prioritize static benchmark performance, potentially sidelining impactful biological applications. This workshop seeks to bridge this gap by bringing computationalists and experimentalists together, catalyzing a deeper interdisciplinary discourse. Together, we will explore the strengths and challenges of generative ML in biology, experimental integration of generative ML, and biological problems ready for ML. To attract high-quality and diverse research, we partnered with Nature Biotechnology for a special collection, and we created dedicated tracks for in-silico ML research and hybrid ML-experimental biology research. Our lineup features emerging leaders as speakers and renowned scientists as panelists, encapsulating a spectrum from high-throughput experimentation and computational biology to generative ML. To catalyze new collaborations, we will host a seed-grant competition for pairs of experimentalists and computationalists proposing fresh joint projects. To connect dry and wet lab practice, a wet-lab challenge sponsored by Adaptyv Bio will empirically evaluate protein design models. With a diverse organizing team and backed by industry sponsors, we dedicate the workshop to pushing the boundaries of ML's role in biology. This will be the third edition of this workshop following the previous versions of it we organized at ICLR 2024 and 2025.

2025-12-31

Workshop Proposals @ International Conference on Learning Representations (published)

openreview.net

Interpreting Physics in Video World Models

Sonia Joseph

Quentin Garrido

Randall Balestriero

Matthew Kowal

Thomas Fel

Shahab Bakhtiari

Blake Richards

Mike Rabbat

A long-standing question in physical reasoning is whether video-based models need to rely on factorized representations of physical variable… (see more)s in order to make physically accurate predictions, or whether they can implicitly represent such variables in a distributed manner. While modern video world models achieve strong performance on intuitive physics benchmarks, it remains unclear which of these representational regimes they implement internally. Here, we present the first interpretability study to directly examine physical representations inside large-scale video encoders. Using layerwise probing, subspace geometry, patch-level decoding, and targeted attention ablations, we characterize where physical information becomes accessible and how it is organized within encoder-based video transformers. Across architectures, we identify a sharp intermediate-depth transition— which we call the \emph{Physics Emergence Zone}—at which physical variables become accessible. Physics-related representations peak shortly after this transition and degrade toward the output layers. Decomposing motion into explicit variables, we find that scalar quantities such as speed and acceleration are available from early layers onwards, whereas motion direction becomes accessible only at the Physics Emergence Zone. Notably, we find that direction is encoded through a high-dimensional population structure with circular geometry, requiring coordinated multi-feature intervention to control. These findings suggest that modern video models do not use factorized representations of physical variables like a classical physics engine. Instead, they use a distributed representation that is nonetheless sufficient for making physical predictions.

2025-12-31

International Conference on Machine Learning (Accept (regular))

doi.org

openreview.net

KILO-EKF: Koopman-Inspired Learned Observations Extended Kalman Filter

Zi Cong Guo

James Richard Forbes

Timothy D. Barfoot

We present the Koopman-Inspired Learned Observations Extended Kalman Filter (KILO-EKF), which combines a standard EKF prediction step with a… (see more) correction step based on a Koopman-inspired measurement model learned from data. By lifting measurements into a feature space where they are linear in the state, KILO-EKF enables flexible modeling of complex or poorly calibrated sensors while retaining the structure and efficiency of recursive filtering. The resulting linear-Gaussian measurement model is learned in closed form from groundtruth training data, without iterative optimization or reliance on an explicit parametric sensor model. At inference, KILO-EKF performs a standard EKF update using Jacobians obtained via the learned lifting. We validate the approach on a real-world quadrotor localization task using an IMU, ultra-wideband (UWB) sensors, and a downward-facing laser. We compare against multiple EKF baselines with varying levels of sensor calibration. KILO-EKF achieves better accuracy and consistency compared to data-calibrated baselines, and significantly outperforms EKFs that rely on imperfect geometric models, while maintaining real-time inference and fast training. These results demonstrate the effectiveness of Koopman-inspired measurement learning as a scalable alternative to traditional model-based calibration.

2025-12-31

arXiv (preprint)

doi.org

arxiv.org

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Publications

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Publications