Publications

Training Diffusion Language Models for Black-Box Optimization

Jiayao Gu

Xue Liu

We study offline black-box optimization (BBO), aiming to discover improved designs from an offline dataset of designs and labels, a problem … (voir plus)common in robotics, DNA, and materials science with limited labeled samples. While recent work applies autoregressive LLMs to BBO by formatting tasks as natural-language prompts, their left-to-right design generation struggles to capture the strong bidirectional dependencies inherent in design problems. To address this, we propose adapting diffusion LLMs to offline BBO to leverage their bidirectional modeling capabilities. However, a domain gap exists between the natural text pre-training of diffusion LLMs and the heterogeneous signals in BBO (prompts, designs, and labels). To bridge this gap, we construct a unified prompt–response corpus and introduce delimiter tokens to explicitly mark field boundaries for *domain adaptation*. We further propose a two-stage *post-training* framework to align the diffusion LLM generation with high-label designs. The first stage performs supervised fine-tuning on the unified dataset via masked-response prediction, and the second stage adopts reinforcement learning with rewards defined by label improvements. Our method achieves state-of-the-art results on Design-Bench small-data settings. Code for our work is available here: https://anonymous.4open.science/r/Anonymous-dllm4bbo-D78A/README.md.

2025-12-31

International Conference on Machine Learning (Accept (spotlight))

doi.org

openreview.net

TRecViT: A Recurrent Video Transformer

Viorica Patraucean

Xu Owen He

Joseph Heyward

Chuhan Zhang

Mehdi S. M. Sajjadi

George-Cristian Muraru

Artem Zholus

Mahdi Karami

Ross Goroshin

Yutian Chen 0001

Simon Kayode Osindero

João Carreira

Razvan Pascanu

We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gate… (voir plus)d linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having

2025-12-31

Trans. Mach. Learn. Res. (publié)

doi.org

arxiv.org

Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL.

Erfan Miahi

Eugene Belilovsky

Reinforcement learning (RL) is a critical component for post-training large language models (LLMs). However, in bandwidth-constrained distri… (voir plus)buted RL, scalability is often bottlenecked by the synchronization of policy weights from trainers to inference workers, particularly over commodity networks or in decentralized settings. While recent studies suggest that RL updates modify only a small fraction of model parameters, these observations are typically based on coarse checkpoint differences. We present a systematic empirical study of weight-update sparsity at both step-level and multi-step granularities, examining its evolution across training dynamics, off-policy delay, and model scale. We find that update sparsity is consistently high, frequently exceeding 99% across practically relevant settings. Leveraging this structure, we propose PULSE (Patch Updates via Lossless Sparse Encoding), a simple yet highly efficient lossless weight synchronization method that transmits only the indices and values of modified parameters. PULSE is robust to transmission errors and avoids floating-point drift inherent in additive delta schemes. In bandwidth-constrained decentralized environments, our approach achieves over 100x (14 GB to ~108 MB) communication reduction while maintaining bit-identical training dynamics and performance compared to full weight synchronization. By exploiting this structure, PULSE enables decentralized RL training to approach centralized throughput, reducing the bandwidth required for weight synchronization from 20 Gbit/s to 0.2 Gbit/s to maintain high GPU utilization.

2025-12-31

arXiv (prépublication)

doi.org

arxiv.org

Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters

Takashi Morita

Timothy J. O’Donnell

Abstract Cross-linguistically, native words and loanwords follow different phonological rules. In English, for example, words of Germanic an… (voir plus)d Latinate origin exhibit different stress patterns, and a certain syntactic structure—double-object datives—is predominantly associated with Germanic verbs rather than Latinate verbs. From the perspective of language acquisition, however, such etymology-based generalizations raise learnability concerns, since the historical origins of words are presumably inaccessible information for general language learners. In this study, we present computational evidence indicating that the Germanic-Latinate distinction in the English lexicon is learnable from the phonotactic information of individual words. Specifically, we performed an unsupervised clustering on corpus-extracted words, and the resulting word clusters largely aligned with the etymological distinction. The model-discovered clusters also recovered various linguistic generalizations documented in the previous literature regarding the corresponding etymological classes. Moreover, our model also uncovered previously unrecognized features of the quasi-etymological clusters. Taken together with prior results from Japanese, our findings indicate that the proposed method provides a general, cross-linguistic approach to discovering etymological structure from phonotactic cues in the lexicon.

2025-12-31

Open Mind: Discoveries in Cognitive Science (publié)

doi.org

arxiv.org

Using an eye tracker to capture reading skills as measured by a digital adaptation of TOWRE-2

Maria Cutumisu

Krystle-Lee Turgeon

2025-12-31

Educ. Inf. Technol. (publié)

doi.org

Zero-Shot NAS for TinyML Semantic Segmentation via Weight Sharing

Zhuoran Xiong

Warren J. Gross

Brett Meyer

2025-12-31

IEEE Embedded Systems Letters (publié)

doi.org

Family caregivers' acceptance of Artificial Intelligence-enabled technologies for providing care to older adults

Amanda Yee

Mark J. Yaffe

Tibor Schuster

Sylvie Lambert

Samira Abbasgholizadeh-Rahimi

Artificial intelligence (AI)-enabled technologies hold promise for assisting in the care of an aging population. Few studies have focused on… (voir plus) exploring family caregivers’ (FCGs) behavioural intention of using such innovation, and even fewer have employed a technology acceptance framework. This study examined FCGs of older adults’ behavioural intention of using AI-enabled technologies for caregiving. We conducted a theory-based cross-sectional quantitative survey. Eligible FCGs for this study were: (1) aged 45–64; (2) residing in Quebec, Canada; (3) providing care for at least one older adult (65+); (4) having access to a computer or smartphone with internet connectivity; and, (5) having proficiency in reading and comprehending English or French. We adapted and expanded the Unified Theory of Acceptance and Use of Technology (UTAUT) framework to measure their behavioural intention of using AI-enabled technologies for caregiving. We used descriptive statistics and a random forest model to assess the most important predictive factors across nine variables and their direction of association with behavioural intention. The Consensus-Based Checklist for Reporting of Survey Studies (CROSS) guidelines was used for reporting the study’s results. Among the polling firm’s 100,000 panelists, 2740 eligible individuals were randomly chosen to receive an email invitation to the study. Of 465 panelists who opened the survey (i.e., unique visitors),199 were eligible and completed the online survey. The random forest model explained between 56% and 86% of the behavioural intention variance of using AI, with social influence demonstrating the highest predictive relevance as indicated by a 35% increase in mean-squared error once removed from the model. Among the nine variables considered, six demonstrated a positive association with behavioural intention. These variables included social influence, effort expectancy, performance expectancy, perceived trust, confidence in healthcare professionals’ advice for the use of AI-enabled technologies, and facilitating connditions. The variables perceived cost and technology anxiety indicated a negative association with behavioural intention. Our extended UTAUT model identified factors associated with FCGs' intention to use AI. While all nine variables contributed, attitudes toward AI within caregivers’ social circles was the strongest predictor. Stakeholders from industry, government, and healthcare can enhance the adoption of AI-enabled technologies in older adult care by leveraging facilitators and addressing barriers experienced by caregivers.

2025-12-30

BMC Geriatrics (publié)

doi.org

On the geometry and topology of representations: the manifolds of modular addition

Gabriela Moisescu-Pareja

Colin Daniels

Jonathan Love

The Clock and Pizza interpretations, associated with architectures differing in either uniform or learnable attention, were introduced to ar… (voir plus)gue that different architectural designs can yield distinct circuits for modular addition. In this work, we show that this is not the case, and that both uniform attention and trainable attention architectures implement the same algorithm via topologically and geometrically equivalent representations. Our methodology goes beyond the interpretation of individual neurons and weights. Instead, we identify all of the neurons corresponding to each learned representation and then study the collective group of neurons as one entity. This method reveals that each learned representation is a manifold that we can study utilizing tools from topology. Based on this insight, we can statistically analyze the learned representations across hundreds of circuits to demonstrate the similarity between learned modular addition circuits that arise naturally from common deep learning paradigms.

2025-12-30

arXiv (prépublication)

doi.org

openreview.net

Combining Constraint Programming and Machine Learning: From Current Progress to Future Opportunities

Quentin Cappart

Tias Guns

Michele Lombardi

Gilles Pesant

Dimos Tsouros

The integration of constraint programming (CP) together with machine learning (ML) has emerged as a promising direction for tackling complex… (voir plus) decision-making and combinatorial optimization problems. While CP offers expressive modeling capabilities and formal guarantees, ML provides adaptive methods for learning from data and generalizing across instances. This survey presents a comprehensive overview of recent advances in combining CP and ML. We first show how ML has been used to improve the CP toolbox, both in modeling and in the efficiency of solving. Then, we examine how CP can support ML, particularly in providing structure, guarantees, and symbolic reasoning capabilities. Finally, we identify key open challenges inherent to such hybrid approaches and outline promising directions for future research. This survey provides a first conceptual and structured review of recent advancements in this emerging field, aiming to serve as a resource for practitioners and researchers in both the CP and ML communities. To keep the progress up to date, a curated list of references is hosted on an accompanying repository (https://github.com/corail-research/CPML-paper-list) and is open to community contributions.

2025-12-29

Journal of Artificial Intelligence Research (publié)

doi.org

MS-SSM: A Multi-Scale State Space Model for Efficient Sequence Modeling

Mahdi Karami

Ali Behrouz

Peilin Zhong

Razvan Pascanu

Seyed Vahab Mirrokni

State-space models (SSMs) have recently attention as an efficient alternative to computationally expensive attention-based models for sequen… (voir plus)ce modeling. They rely on linear recurrences to integrate information over time, enabling fast inference, parallelizable training, and control over recurrence stability. However, traditional SSMs often suffer from limited effective memory, requiring larger state sizes for improved recall. Moreover, existing SSMs struggle to capture multi-scale dependencies, which are essential for modeling complex structures in time series, images, and natural language. This paper introduces a multi-scale SSM framework that addresses these limitations by representing sequence dynamics across multiple resolution and processing each resolution with specialized state-space dynamics. By capturing both fine-grained, high-frequency patterns and coarse, global trends, MS-SSM enhances memory efficiency and long-range modeling. We further introduce an input-dependent scale-mixer, enabling dynamic information fusion across resolutions. The proposed approach significantly improves sequence modeling, particularly in long-range and hierarchical tasks, while maintaining computational efficiency. Extensive experiments on benchmarks, including Long Range Arena, hierarchical reasoning, time series classification, and image recognition, demonstrate that MS-SSM consistently outperforms prior SSM-based models, highlighting the benefits of multi-resolution processing in state-space architectures.

2025-12-28

ArXiv (prépublication)

doi.org

arxiv.org

Multi-Agent Framework for Threat Mitigation and Resilience in AI-Based Systems

Armstrong Foundjem

Lionel Nganyewou Tidjon

Leuson Da Silva

Foutse Khomh

2025-12-28

ArXiv (prépublication)

doi.org

arxiv.org

Probabilistic Modelling is Sufficient for Causal Inference

Bruno Mlodozeniec

David S. Krueger

Richard E. Turner

Causal inference is a key research area in machine learning, yet confusion reigns over the tools needed to tackle it. There are prevalent cl… (voir plus)aims in the machine learning literature that you need a bespoke causal framework or notation to answer causal questions. In this paper, we want to make it clear that you \emph{can} answer any causal inference question within the realm of probabilistic modelling and inference, without causal-specific tools or notation. Through concrete examples, we demonstrate how causal questions can be tackled by writing down the probability of everything. Lastly, we reinterpret causal tools as emerging from standard probabilistic modelling and inference, elucidating their necessity and utility.

2025-12-28

arXiv (prépublication)

doi.org

arxiv.org

La plateforme Mila Ventures

Mila sur Udemy

Publications du Fellowship en politiques de l'IA

Publications

La plateforme Mila Ventures

Mila sur Udemy

Publications du Fellowship en politiques de l'IA

Mots-clés populaires:

Publications