Publications

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Johannes Von Oswald

Nino Scherrer

Seijin Kobayashi

Luca Versari

Songlin Yang

Maximilian Schlegel

Kaitlin Maile

Yanick Schimpf

Oliver Sieberling

Alexander Meulemans

Rif A. Saurous

Guillaume Lajoie

Charlotte Frenkel

Razvan Pascanu

Blaise Agüera y Arcas

João Sacramento

Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, trans… (voir plus)formers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.

2025-05-31

arXiv (publié)

doi.org

arxiv.org

Self-Refining Training for Amortized Density Functional Theory

Majdi Hassan

Cristian Gabellini

Hatem Helal

Dominique Beaini

Kirill Neklyudov

Density Functional Theory (DFT) allows for predicting all the chemical and physical properties of molecular systems from first principles by… (voir plus) finding an approximate solution to the many-body Schrödinger equation. However, the cost of these predictions becomes infeasible when increasing the scale of the energy evaluations, e.g., when calculating the ground-state energy for simulating molecular dynamics. Recent works have demonstrated that, for substantially large datasets of molecular conformations, Deep Learning-based models can predict the outputs of the classical DFT solvers by amortizing the corresponding optimization problems. In this paper, we propose a novel method that reduces the dependency of amortized DFT solvers on large pre-collected datasets by introducing a self-refining training strategy. Namely, we propose an efficient method that simultaneously trains a deep-learning model to predict the DFT outputs and samples molecular conformations that are used as training data for the model. We derive our method as a minimization of the variational upper bound on the KL-divergence measuring the discrepancy between the generated samples and the target Boltzmann distribution defined by the ground state energy. To demonstrate the utility of the proposed scheme, we perform an extensive empirical study comparing it with the models trained on the pre-collected datasets. Finally, we open-source our implementation of the proposed algorithm, optimized with asynchronous training and sampling stages, which enables simultaneous sampling and training. Code is available at https://github.com/majhas/self-refining-dft.

2025-05-31

arXiv (publié)

doi.org

arxiv.org

Sparse-Reg: Improving Sample Complexity in Offline Reinforcement Learning using Sparsity

Samin Yeasar Arnob

Scott Fujimoto

Doina Precup

In this paper, we investigate the use of small datasets in the context of offline reinforcement learning (RL). While many common offline RL … (voir plus)benchmarks employ datasets with over a million data points, many offline RL applications rely on considerably smaller datasets. We show that offline RL algorithms can overfit on small datasets, resulting in poor performance. To address this challenge, we introduce"Sparse-Reg": a regularization technique based on sparsity to mitigate overfitting in offline reinforcement learning, enabling effective learning in limited data settings and outperforming state-of-the-art baselines in continuous control.

2025-05-31

arXiv (publié)

doi.org

arxiv.org

SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?

Senyu Li

Jiayi Wang

Felermino Dario Mario Ali

Colin Cherry

Daniel Deutsch

Eleftheria Briakou

Rui Sousa-Silva

Henrique Lopes Cardoso

Pontus Stenetorp

David Ifeoluwa Adelani

Evaluating machine translation (MT) quality for under-resourced African languages remains a significant challenge, as existing metrics often… (voir plus) suffer from limited language coverage and poor performance in low-resource settings. While recent efforts, such as AfriCOMET, have addressed some of the issues, they are still constrained by small evaluation sets, a lack of publicly available training data tailored to African languages, and inconsistent performance in extremely low-resource scenarios. In this work, we introduce SSA-MTE, a large-scale human-annotated MT evaluation (MTE) dataset covering 13 African language pairs from the News domain, with over 63,000 sentence-level annotations from a diverse set of MT systems. Based on this data, we develop SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free evaluation metrics. We also benchmark prompting-based approaches using state-of-the-art LLMs like GPT-4o and Claude. Our experimental results show that SSA-COMET models significantly outperform AfriCOMET and are competitive with the strongest LLM (Gemini 2.5 Pro) evaluated in our study, particularly on low-resource languages such as Twi, Luo, and Yoruba. All resources are released under open licenses to support future research.

2025-05-31

arXiv (publié)

doi.org

arxiv.org

STAMP: Differentiable Task and Motion Planning via Stein Variational Gradient Descent

Yewon Lee

Philip Huang

Yizhou Huang

Krishna Murthy

Andrew Zou Li

Fabian Damken

Eric Heiden

Kevin A. Smith

D. Nowrouzezahrai

Fabio Ramos

Florian Shkurti

Carnegie-mellon University

M. I. O. Technology

Technische Universitat Darmstadt

Nvidia

M. University

University of Sydney

Planning for many manipulation tasks, such as using tools or assembling parts, often requires both symbolic and geometric reasoning. Task an… (voir plus)d Motion Planning (TAMP) algorithms typically solve these problems by conducting a tree search over high-level task sequences while checking for kinematic and dynamic feasibility. While performant, most existing algorithms are highly inefficient as their time complexity grows exponentially with the number of possible actions and objects. Additionally, they only find a single solution to problems in which many feasible plans may exist. To address these limitations, we propose a novel algorithm called Stein Task and Motion Planning (STAMP) that leverages parallelization and differentiable simulation to efficiently search for multiple diverse plans. STAMP relaxes discrete-and-continuous TAMP problems into continuous optimization problems that can be solved using variational inference. Our algorithm builds upon Stein Variational Gradient Descent, a gradient-based variational inference algorithm, and parallelized differentiable physics simulators on the GPU to efficiently obtain gradients for inference. Further, we employ imitation learning to introduce action abstractions that reduce the inference problem to lower dimensions. We demonstrate our method on two TAMP problems and empirically show that STAMP is able to: 1) produce multiple diverse plans in parallel; and 2) search for plans more efficiently compared to existing TAMP baselines.

2025-05-31

IEEE Robotics and Automation Letters (publié)

doi.org

openreview.net

A systematic review of hyperscanning in clinical encounters

Lena Adel

Lisane Moses

Elisabeth Irvine

Kyle T Greenway

Guillaume Dumas

Michael Lifshitz

2025-05-31

Neuroscience and Biobehavioral Reviews (publié)

doi.org

ToothForge: Automatic Dental Shape Generation using Synchronized Spectral Embeddings

Tibor Kubík

Franccois Guibault

Michal vSpanvel

Hervé Lombaert

We introduce ToothForge, a spectral approach for automatically generating novel 3D teeth, effectively addressing the sparsity of dental shap… (voir plus)e datasets. By operating in the spectral domain, our method enables compact machine learning modeling, allowing the generation of high-resolution tooth meshes in milliseconds. However, generating shape spectra comes with the instability of the decomposed harmonics. To address this, we propose modeling the latent manifold on synchronized frequential embeddings. Spectra of all data samples are aligned to a common basis prior to the training procedure, effectively eliminating biases introduced by the decomposition instability. Furthermore, synchronized modeling removes the limiting factor imposed by previous methods, which require all shapes to share a common fixed connectivity. Using a private dataset of real dental crowns, we observe a greater reconstruction quality of the synthetized shapes, exceeding those of models trained on unaligned embeddings. We also explore additional applications of spectral analysis in digital dentistry, such as shape compression and interpolation. ToothForge facilitates a range of approaches at the intersection of spectral analysis and machine learning, with fewer restrictions on mesh structure. This makes it applicable for shape analysis not only in dentistry, but also in broader medical applications, where guaranteeing consistent connectivity across shapes from various clinics is unrealistic. The code is available at https://github.com/tiborkubik/toothForge.

2025-05-31

arXiv (publié)

doi.org

arxiv.org

Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization

Wojciech Masarczyk

Mateusz Ostaszewski

Tin Sum Cheng

Tomasz Trzci'nski

Aurélien Lucchi

Razvan Pascanu

The softmax function is a fundamental building block of deep neural networks, commonly used to define output distributions in classification… (voir plus) tasks or attention weights in transformer architectures. Despite its widespread use and proven effectiveness, its influence on learning dynamics and learned representations remains poorly understood, limiting our ability to optimize model behavior. In this paper, we study the pivotal role of the softmax function in shaping the model's representation. We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes. This bias depends on the softmax function's logits norm, which is implicitly influenced by hyperparameters or directly modified by softmax temperature. Furthermore, we demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data. We validate our findings across diverse architectures and real-world datasets, highlighting the broad applicability of temperature tuning in improving model performance. Our work provides new insights into the mechanisms of softmax, enabling better control over representation learning in deep neural networks.

2025-05-31

arXiv (publié)

doi.org

arxiv.org

Visual symbolic mechanisms: Emergent symbol processing in vision language models

Rim Assouel

Declan Campbell

Taylor Webb

To accurately process a visual scene, observers must bind features together to represent individual objects. This capacity is necessary, for… (voir plus) instance, to distinguish an image containing a red square and a blue circle from an image containing a blue square and a red circle. Recent work has found that language models solve this'binding problem'via a set of symbol-like, content-independent indices, but it is unclear whether similar mechanisms are employed by vision language models (VLMs). This question is especially relevant, given the persistent failures of VLMs on tasks that require binding. Here, we identify a set of emergent symbolic mechanisms that support binding in VLMs via a content-independent, spatial indexing scheme. Moreover, we find that binding errors can be traced directly to failures in these mechanisms. Taken together, these results shed light on the mechanisms that support symbol-like processing in VLMs, and suggest possible avenues for addressing the persistent binding failures exhibited by these models.

2025-05-31

arXiv (publié)

doi.org

arxiv.org

Weak Supervision for Real World Graphs

Pratheeksha Nair

Reihaneh Rabbany

2025-05-31

arXiv (publié)

doi.org

arxiv.org

Graph Representation Learning for the Prediction of Medication Usage in the UK Biobank Based on Pharmacogenetic Variants

Bill Qi

Yannis Trakadis

2025-05-30

Bioengineering (publié)

doi.org

Manifold Learning for Olfactory Habituation to Strongly Fluctuating Backgrounds

François X. P. Bourassa

Paul François

Gautam Reddy

Massimo Vergassola

Animals rely on their sense of smell to survive, but important olfactory cues are mixed with confounding background odors that fluctuate due… (voir plus) to atmospheric turbulence. It is unclear how the olfactory system habituates to such stochastic backgrounds to detect behaviorally important odors. Here, we explicitly consider the high-dimensional nature of odor coding, the natural statistics of odor fluctuations, and the architecture of the early olfactory pathway. We show that their combination favors a manifold learning mechanism for olfactory habituation over alternatives based on predictive filtering. Manifold learning is implemented in our model by a biologically plausible network of inhibitory interneurons in the early olfactory pathway. We demonstrate that plasticity rules based on the Intrator, Bienenstock, Cooper, and Munro (IBCM) model or an online principal components analysis algorithm are effective at implementing this mechanism in turbulent conditions and outperform previous models relying on mean background subtraction. Interneurons with an IBCM plasticity rule acquire selectivity to independently varying odors. This manifold learning mechanism offers a path toward distinguishing plasticity rules in experiments and could be leveraged by other biological circuits facing fluctuating environments.

2025-05-29

bioRxiv (publié)

doi.org

TRAIL : IA responsable pour les professionnels et les leaders

Fondateur en résidence Mila Ventures

Avantage IA : productivité dans la fonction publique

Publications

TRAIL : IA responsable pour les professionnels et les leaders

Fondateur en résidence Mila Ventures

Avantage IA : productivité dans la fonction publique

Mots-clés populaires:

Publications