Publications

A Comedy of Estimators: On KL Regularization in RL Training of LLMs
The reasoning performance of large language models (LLMs) can be substantially improved by training them with reinforcement learning (RL). T… (voir plus)he RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy. Since computing the KL divergence exactly is intractable, various estimators are used in practice to estimate it from on-policy samples. Despite its wide adoption, including in several open-source libraries, there is no systematic study analyzing the numerous ways of incorporating KL estimators in the objective and their effect on the downstream performance of RL-trained models. Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation. In this paper, we further analyze these practices and study the gradients of several estimators configurations, revealing how design choices shape gradient bias. We substantiate these findings with empirical observations by RL fine-tuning \texttt{Qwen2.5-7B}, \texttt{Llama-3.1-8B-Instruct} and \texttt{Qwen3-4B-Instruct-2507} with different configurations and evaluating their performance on both in- and out-of-distribution tasks. Through our analysis, we observe that, in on-policy settings: (1) estimator configurations with biased gradients can result in training instabilities; and (2) using estimator configurations resulting in unbiased gradients leads to better performance on in-domain as well as out-of-domain tasks. We also investigate the performance resulting from different KL configurations in off-policy settings and observe that KL regularization can help stabilize off-policy RL training resulting from asynchronous setups.
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
Seijin Kobayashi
Yanick Schimpf
Maximilian Schlegel
Angelika Steger
Maciej Wolczyk
Johannes Von Oswald
Kaitlin Maile
Blake Aaron Richards
Rif A. Saurous
James Manyika
Blaise Agüera y Arcas
Alexander Meulemans
João Sacramento
Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecede… (voir plus)nted success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.
Energy-Efficient Multi-LLM Reasoning for Binary-Free Zero-Day Detection in IoT Firmware
Saeid Jamshidi
Omar Abdul-Wahab
Martine Bellaiche
Securing Internet of Things (IoT) firmware remains difficult due to proprietary binaries, stripped symbols, heterogeneous architectures, and… (voir plus) limited access to executable code. Existing analysis methods, such as static analysis, symbolic execution, and fuzzing, depend on binary visibility and functional emulation, making them unreliable when firmware is encrypted or inaccessible. To address this limitation, we propose a binary-free, architecture-agnostic solution that estimates the likelihood of conceptual zero-day vulnerabilities using only high-level descriptors. The approach integrates a tri-LLM reasoning architecture combining a LLaMA-based configuration interpreter, a DeepSeek-based structural abstraction analyzer, and a GPT-4o semantic fusion model. The solution also incorporates LLM computational signatures, including latency patterns, uncertainty markers, and reasoning depth indicators, as well as an energy-aware symbolic load model, to enhance interpretability and operational feasibility. In addition, we formally derive the mathematical foundations of the reasoning pipeline, establishing monotonicity, divergence, and energy-risk coupling properties that theoretically justify the model's behavior. Simulation-based evaluation reveals that high exposure conditions increase the predicted zero-day likelihood by 20 to 35 percent across models, with GPT-4o demonstrating the strongest cross-layer correlations and the highest sensitivity. Energy and divergence metrics significantly predict elevated risk (p < 0.01), reinforcing the effectiveness of the proposed reasoning framework.
Hidden sampling biases inflate performance in gene regulatory network inference
Florin Ratajczak
Eva Hoermanseder
Jason Hartford
Pascal Falter-Braun
Matthias Heinig
Antonio Scialdone
Accurate reconstruction of gene regulatory networks (GRNs) from single-cell transcriptomic data remains a major methodological challenge. Re… (voir plus)cent machine learning approaches, particularly graph neural networks and graph autoencoders, have reported improved performance, yet these gains do not consistently translate to realistic biological settings. Here, we show that a key reason for that is the way negative regulatory interactions are sampled for supervised training and evaluation. We find that widely used sampling strategies introduce node-degree biases that allow models to exploit trivial graph-structural cues rather than biological signals. Across multiple benchmarks, simple degree-based heuristics match or exceed state-of-the-art graph neural network models under these biased evaluation protocols. We further introduce a degree-aware sampling approach that eliminates these artifacts and provides more reliable assessments of GRN inference methods. Our results call for standardized, bias-aware benchmarking practices to ensure meaningful progress in supervised GRN inference from single-cell RNA-seq data.
Fine-Tuned In-Context Learners for Efficient Adaptation
Clare Lyle
Yazhe Li
Amal Rannen-Triki
When adapting large language models (LLMs) to a specific downstream task, two primary approaches are commonly employed: (1) prompt engineeri… (voir plus)ng, often with in-context few-shot learning, leveraging the model's inherent generalization abilities, and (2) fine-tuning on task-specific data, directly optimizing the model's parameters. While prompt-based methods excel in few-shot scenarios, their effectiveness often plateaus as more data becomes available. Conversely, fine-tuning scales well with data but may underperform when training examples are scarce. We investigate a unified approach that bridges these two paradigms by incorporating in-context learning directly into the fine-tuning process. Specifically, we fine-tune the model on task-specific data augmented with in-context examples, mimicking the structure of k-shot prompts. This approach, while requiring per-task fine-tuning, combines the sample efficiency of in-context learning with the performance gains of fine-tuning, leading to a method that consistently matches and often significantly exceeds both these baselines. To perform hyperparameter selection in the low-data regime, we propose to use prequential evaluation, which eliminates the need for expensive cross-validation and leverages all available data for training while simultaneously providing a robust validation signal. We conduct an extensive empirical study to determine which adaptation paradigm - fine-tuning, in-context learning, or our proposed unified approach offers the best predictive performance on a concrete data downstream-tasks.
Latent brain subtypes of chronotype reveal unique behavioral and health profiles across population cohorts
Julie Carrier
Kai-Florian Storch
Robin I. M. Dunbar
Chronotype is shaped by the complex interplay of endogenous and exogenous factors. This time-enduring trait ties into societal behaviors an… (voir plus)d is linked to psychiatric and metabolic conditions. Despite its multifaceted nature, prior research has treated chronotype as a monolithic trait across the population, risking overlooking substantial heterogeneity in neural and behavioral fingerprints. To uncover hidden subgroups, we develop a supervised pattern-learning framework integrating three complementary brain-imaging modalities with deep behavioral and health profiling from 27,030 UK Biobank participants. We identify five distinct, biologically valid chronotype subtypes. Each demonstrates unique patterns across brain, behavioral and health profiles. External validation in 10,550 US children from the ABCD Study cohort reveals reversed age distributions and replicates sex-associated brain-behavioral patterns, suggesting that potential divergences between chronotype traits observed throughout adulthood may begin to emerge early in life. These findings highlight underappreciated sources of population variation that echo the rhythm of people’s inner clock.
Latent brain subtypes of chronotype reveal unique behavioral and health profiles across population cohorts
Julie Carrier
Kai-Florian Storch
Robin I. M. Dunbar
Chronotype is shaped by the complex interplay of endogenous and exogenous factors. This time-enduring trait ties into societal behaviors an… (voir plus)d is linked to psychiatric and metabolic conditions. Despite its multifaceted nature, prior research has treated chronotype as a monolithic trait across the population, risking overlooking substantial heterogeneity in neural and behavioral fingerprints. To uncover hidden subgroups, we develop a supervised pattern-learning framework integrating three complementary brain-imaging modalities with deep behavioral and health profiling from 27,030 UK Biobank participants. We identify five distinct, biologically valid chronotype subtypes. Each demonstrates unique patterns across brain, behavioral and health profiles. External validation in 10,550 US children from the ABCD Study cohort reveals reversed age distributions and replicates sex-associated brain-behavioral patterns, suggesting that potential divergences between chronotype traits observed throughout adulthood may begin to emerge early in life. These findings highlight underappreciated sources of population variation that echo the rhythm of people’s inner clock.
Perspective on patient and non-academic partner engagement for the responsible integration of large language models in health chatbots
Nikhil Jaiswal
Yuanchao Ma
Bertrand Lebouché
Marie-Pascale Pomey
Sofiane Achiche
David Lessard
Kim Engler
Zully Montiel
Hector Acevedo
Rodrigo Rosa Gameiro
Leo Anthony Celi
Esli Osmanlliu
Uses of large language models (LLMs) in health chatbots are expanding into high-stakes clinical contexts, heightening the need for tools tha… (voir plus)t are evidence-based, accountable, accurate, and patient-centred. This conceptual, practice-informed Perspective reflects on engaging patients and non-academic partners for the responsible integration of LLMs, grounded in the co-construction of MARVIN (for people living with HIV) and in an emerging collaboration with MIT Critical Data. Organised by the Software Development Life Cycle, we describe: conception/needs assessment with patient partners to identify use cases, acceptable trade-offs, and privacy expectations; development that prioritises grounding via vetted sources, structured human feedback, and data-validation committees including patient partners; testing and evaluation using patient-reported outcome measures (PROMs) and patient-reported experience measures (PREMs) chosen in collaboration with patients to capture usability, acceptability, trust, and perceived safety, alongside task performance and harmful-output monitoring; and implementation via diverse governance boards, knowledge-mobilisation materials to set expectations, and risk-management pathways for potentially unsafe outputs. Based on our experience with MARVIN, we recommend early and continuous engagement of patients and non-academic partners, fair compensation, shared decision-making power, transparent decision logging, and inclusive, adaptable governance that can evolve with changing models and standards. These lessons highlight how patient partnership can directly shape chatbot design and oversight, helping teams align LLM-enabled tools with patient-centred goals while building accountable, safe, and equitable systems. Health chatbots powered by large language models (LLMs) can make medical information more accessible, but most are developed without meaningful input from the people who will use them. This risks unsafe answers, hidden bias, and tools that mainly work for privileged groups. Our team built a chatbot called MARVIN to support people living with HIV, and we are now adapting it for cancer care and children’s health. Patients, caregivers, and community partners shaped what MARVIN should do, chose which sources it should trust, and tested early versions. Their feedback led to concrete improvements including clearer language, more relevant features, and safeguards against misinformation. We are also partnering with MIT Critical Data, which brings patients, members of the public, clinicians, engineers, and policymakers together at events to find and fix bias in medical AI. We have learned that technical fixes alone are not enough: trust, fairness, and accountability require active involvement of diverse users at every stage. Based on these lessons, we recommend: (1) including patients and non-academic partners from the start so their insights can shape core design decisions; (2) compensating them fairly so participation is sustainable; (3) giving them real decision-making power so their input is not tokenistic; and (4) being transparent about the limits of AI so expectations are realistic. In our experience, responsible health AI depends on the lived expertise of the people it serves.
The Historical Literature of Nicolae Filimon and the Reconciliation of Realism with the Tradition of the Popular Novel
A.R. Olteanu
Coord2Region: A Python Package for Mapping 3D Brain Coordinates to Atlas Labels, Literature, and AI Summaries
Yorguin-Jose Mantilla-Ramos
Sina Esmaeili
Annalisa Pascarella
Vanessa Hadid
Karim Jerbi CoCo Lab
We present Coord2Region, an open-source Python package that streamlines coordinate-based neuroimaging workflows by automatically mapping 3D … (voir plus)brain coordinates (e.g., MNI or Talairach) to anatomical regions across multiple atlases. The package links mapped coordinates to meta-analytic resources via the Neuroimaging Meta-Analysis Research Environment (NiMARE) , providing direct integration with Neurosynth and NeuroQuery. This directly connects coordinates and regions to the broader neuroimaging literature. In addition to atlas-based labeling and literature retrieval, Coord2Region offers an optional large language model (LLM) functionality that generates text summaries of linked studies and illustrative images of queried regions. These AI-assisted features are intended to support interpretation and exploration, while remaining clearly complementary to peer-reviewed literature and established neuroimaging tools. Coord2Region provides a unified pipeline with a robust command-line interface, flexible dataset management, and provider-agnostic LLM utilities, and it supports both single-coordinate and high-throughput batch queries with nearest-region fallback for volume and surface atlases. Furthermore, Coord2Region includes a web interface for interactive configuration (via JSON Schema forms) and cloud execution (via Hugging Face), enabling users to build YAML configurations and run analyses in-browser without local installation. Together, these capabilities lower friction, reduce manual errors, and improve reproducibility in coordinate-centric neuroimaging workflows, promoting more robust and transparent research practices.
E-RGB-D: Real-Time Event-Based Perception with Structured Light
Seyed Ehsan Marjani Bajestani
Event-based cameras (ECs) have emerged as bio-inspired sensors that report pixel brightness changes asynchronously, offering unmatched speed… (voir plus) and efficiency in vision sensing. Despite their high dynamic range, temporal resolution, low power consumption, and computational simplicity, traditional monochrome ECs face limitations in detecting static or slowly moving objects and lack color information essential for certain applications. To address these challenges, we present a novel approach that integrates a Digital Light Processing (DLP) projector, forming Active Structured Light (ASL) for RGB-D sensing. By combining the benefits of ECs and projection-based techniques, our method enables the detection of color and the depth of each pixel separately. Dynamic projection adjustments optimize bandwidth, ensuring selective color data acquisition and yielding colorful point clouds without sacrificing spatial resolution. This integration, facilitated by a commercial TI LightCrafter 4500 projector and a monocular monochrome EC, not only enables frameless RGB-D sensing applications but also achieves remarkable performance milestones. With our approach, we achieved a color detection speed equivalent to 1400 fps and 4 kHz of pixel depth detection, significantly advancing the realm of computer vision across diverse fields from robotics to 3D reconstruction methods. Our code is publicly available: https://github.com/MISTLab/event_based_rgbd_ros
Responsible AI measures dataset for ethics evaluation of AI systems
Meaningful governance of any system requires the system to be assessed and monitored effectively. In the domain of Artificial Intelligence (… (voir plus)AI), global efforts have established a set of ethical principles, including fairness, transparency, and privacy upon which AI governance expectations are being built. The computing research community has proposed numerous means of measuring an AI system's normative qualities along these principles. Current reporting of these measures is principle-specific, limited in scope, or otherwise dispersed across publication platforms, hindering the domain's ability to critique its practices. To address this, we introduce the Responsible AI Measures Dataset, consolidating 12,067 data points across 791 evaluation measures covering 11 ethical principles. It is extracted from a corpus of computing literature (n = 257) published between 2011 and 2023. The dataset includes detailed descriptions of each measure, AI system characteristics, and publication metadata. An accompanying, interactive visualization tool supports usability and interpretation of the dataset. The Responsible AI Measures Dataset enables practitioners to explore existing assessment approaches and critically analyze how the computing domain measures normative concepts.