Publications

When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents

Jaylen Jones

Zhehao Zhang

Yuting Ning

Eric Fosler-Lussier

Pierre-Luc St-Charles

Yoshua Bengio

Dawn Song

Yu Su

Huan Sun

Although computer-use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe un… (see more)intended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long-tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodological framework for unintended CUA behaviors, by defining their key characteristics, automatically eliciting them, and analyzing how they arise from benign inputs. We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback, and elicits severe harms while keeping perturbations realistic and benign. Using AutoElicit, we surface hundreds of harmful unintended behaviors from state-of-the-art CUAs such as Claude 4.5 Haiku and Opus. We further evaluate the transferability of human-verified successful perturbations, identifying persistent susceptibility to unintended behaviors across various other frontier CUAs. This work establishes a foundation for systematically analyzing unintended behaviors in realistic computer-use settings.

2026-02-28

AIWILD @ International Conference on Learning Representations (published)

doi.org

openreview.net

Comparison between DNA- and RNA-based nucleic acid amplification tests for detecting Mycoplasma pneumoniae in pediatric specimens

Boyi Jiang

Hanqing Zhao

Chao Yan

Mingxuan Wang

Zhen Wang

Yanling Feng

Shijie Wang

Jing Yuan

Yuehua Ke

2026-02-27

European Journal of Clinical Microbiology and Infectious Diseases (published)

doi.org

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

Fanqi Kong

Jiayi Zhang

Mingyi Deng

Chenglin Wu

Yuyu Luo

Bang Liu

Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downst… (see more)ream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.

2026-02-27

arXiv (preprint)

doi.org

arxiv.org

Loss Smoothing for Continual Adaptation

Neural networks are often adapted in nonstationary data distributions settings where the objective is to optimize performance on the current… (see more) task, and preserving accuracy on previous tasks is not required. As a result, existing methods primarily focus on improving plasticity, while stability is largely studied in the context of continual learning. In this work, we examine whether preserving stability can also be beneficial in model adaptation settings where past-task performance is irrelevant. We propose a simple loss smoothing approach that encourages selective adaptation by preserving task-shared features while modifying task-inconsistent ones. We evaluate our method on continual supervised model adaptation benchmarks and reinforcement learning benchmarks, and show that promoting representational stability during adaptation can improve performance across settings.

2026-02-27

CAO @ International Conference on Learning Representations (poster)

openreview.net

Noticing the Watcher: LLM Agents Can Infer CoT Monitoring from Blocking Feedback

Thomas Jiralerspong

Flemming Kondrup

Yoshua Bengio

Chain-of-thought (CoT) monitoring provides oversight into model reasoning, but its effectiveness assumes models do not know they are being w… (see more)atched. We ask whether reasoning agents can autonomously infer that their supposedly private chain of thought is under surveillance, and whether this awareness leads to strategic evasion, without any explicit training or instructions to do so. In a multi-episode agentic framework, models pursue both a primary task and a concealed side task while being told their reasoning is private; a hidden CoT monitor blocks episodes when suspicious reasoning is detected. We find that frontier models can deduce the existence of this monitor purely from blocking feedback, with the most capable models reaching confident belief that their thinking is observed in up to 19\% of episodes. This awareness scales with model capability and, in rare cases, escalates to explicit intent to suppress reasoning about the side task. However, models that form this intent uniformly fail to execute it, openly reasoning about their concealed objectives in the very next episode. This intent–capability gap is reassuring for current deployment, but the autonomous emergence of both monitoring awareness and evasion intent suggests that CoT monitoring is not a permanently reliable safeguard.

2026-02-27

CAO @ International Conference on Learning Representations (poster)

doi.org

openreview.net

Value Drifts: Tracing Value Alignment During LLM Post-Training

Karolina Stanczak

Vered Shwartz

Siva Reddy

As LLMs occupy an increasingly important role in society, they are more and more confronted with questions that require them not only to dra… (see more)w on their general knowledge but also to align with certain human value systems. Therefore, studying the alignment of LLMs with human values has become a crucial field of inquiry. Prior work, however, mostly focuses on evaluating the alignment of fully trained models, overlooking the training dynamics by which models learn to express human values. In this work, we investigate how and at which stage value alignment arises during the course of a model's post-training. Our analysis disentangles the effects of post-training algorithms and datasets, measuring both the magnitude and time of value drifts during training. Experimenting with Llama-3 and Qwen-3 models of different sizes and popular supervised fine-tuning (SFT) and preference optimization datasets and algorithms, we find that the SFT phase generally establishes a model's values, and subsequent preference optimization rarely re-aligns these values. Furthermore, using a synthetic preference dataset that enables controlled manipulation of values, we find that different preference optimization algorithms lead to different value alignment outcomes, even when preference data is held constant. Our findings provide actionable insights into how values are learned during post-training and help to inform data curation, as well as the selection of models and algorithms for preference optimization to improve model alignment to human values.

2026-02-27

CAO @ International Conference on Learning Representations (poster)

doi.org

openreview.net

CIRCLE: A Framework for Evaluating AI from a Real-World Lens

Reva Schwartz

Carina E. I. Westling

Morgan Briggs

Marzieh Fadaee

Isar Nejadgholi

Matthew Holmes

Fariza Rashid

Maya Carlyle

Afaf Taïk

Kyra Wilson

Peter Douglas

Theodora Skeadas

Gabriella Waters

Rumman Chowdhury

Thiago Lacerda

This paper proposes CIRCLE, a six-stage, lifecycle-based framework to bridge the reality gap between model-centric performance metrics and A… (see more)I's materialized outcomes in deployment. Current approaches such as MLOps frameworks and AI model benchmarks offer detailed insights into system stability and model capabilities, but they do not provide decision-makers outside the AI stack with systematic evidence of how these systems actually behave in real-world contexts or affect their organizations over time. CIRCLE operationalizes the Validation phase of TEVV (Test, Evaluation, Verification, and Validation) by formalizing the translation of stakeholder concerns outside the stack into measurable signals. Unlike participatory design, which often remains localized, or algorithmic audits, which are often retrospective, CIRCLE provides a structured, prospective protocol for linking context-sensitive qualitative insights to scalable quantitative metrics. By integrating methods such as field testing, red teaming, and longitudinal studies into a coordinated pipeline, CIRCLE produces systematic knowledge: evidence that is comparable across sites yet sensitive to local context. This, in turn, can enable governance based on materialized downstream effects rather than theoretical capabilities.

2026-02-26

2026 (preprint)

doi.org

arxiv.org

CoPeP: Benchmarking Continual Pretraining for Protein Language Models

Protein language models (pLMs) have recently gained significant attention for their ability to uncover relationships between sequence, struc… (see more)ture, and function from evolutionary statistics, thereby accelerating therapeutic drug discovery. These models learn from large protein databases that are continuously updated by the biology community and whose dynamic nature motivates the application of continual learning, not only to keep up with the ever-growing data, but also as an opportunity to take advantage of the temporal meta-information that is created during this process. As a result, we introduce the Continual Pretraining of Protein Language Models (CoPeP) benchmark, a novel benchmark for evaluating continual learning approaches on pLMs. Specifically, we curate a sequence of protein datasets derived from the UniProt Knowledgebase spanning a decade and define metrics to assess pLM performance across 31 protein understanding tasks. We evaluate several methods from the continual learning literature, including replay, unlearning, and plasticity-based methods, some of which have never been applied to models and data of this scale. Our findings reveal that incorporating temporal meta-information improves perplexity by up to 7% even when compared to training on data from all tasks jointly. Moreover, even at scale, several continual learning methods outperform naive continual pretraining. The CoPeP benchmark offers an exciting opportunity to study these methods at scale in an impactful real-world application.

2026-02-26

arXiv (preprint)

doi.org

arxiv.org

DYMAG: Rethinking Message Passing Using Dynamical-systems-based Waveforms

Dhananjay Bhaskar

Xingzhi Sun

Yanlei Zhang

Charles Xu

Oluwadamilola Fasina

Michael Perlmutter

We present DYMAG, a graph neural network based on a novel form of message aggregation. Standard message-passing neural networks, which often… (see more) aggregate local neighbors via mean-aggregation, can be regarded as convolving with a simple rectangular waveform which is non-zero only on 1-hop neighbors of every vertex. Here, we go beyond such local averaging. We will convolve the node features with more sophisticated waveforms generated using dynamics such as the heat equation, wave equation, and the Sprott model (an example of chaotic dynamics). Furthermore, we use snapshots of these dynamics at different time points to create waveforms at many effective scales. Theoretically, we show that these dynamic waveforms can capture salient information about the graph, including connected components, connectivity, and cycle structures. Empirically, we test DYMAG on both real and synthetic benchmarks to establish that DYMAG outperforms baseline models on recovery of graph persistence, generating parameters of random graphs, as well as property prediction for proteins, molecules and materials. Our code is available at https://github.com/KrishnaswamyLab/DYMAG.

2026-02-26

Conference on Topology, Algebra, and Geometry in Data Science (published)

doi.org

proceedings.mlr.press

Using virtual reality hypnosis during stem cell transplant for patients in hematology: A protocol for a feasibility randomized study

Audrey Laurin

Floriane Rousseaux

Isaiah Gitonga

Jean Roy

Mathieu Landry

Richard LeBlanc

Nadia Godin

Caroline Arbour

Philippe Richebé

Karim Jerbi

Pierre Rainville

David Ogez

Valentyn Fournier

ClinicalTrials.gov NCT06817759.

2026-02-26

PLoS ONE (published)

doi.org

Using virtual reality hypnosis during stem cell transplant for patients in hematology: A protocol for a feasibility randomized study

Audrey Laurin

Floriane Rousseaux

Isaiah Gitonga

Jean Roy

Mathieu Landry

Richard LeBlanc

Nadia Godin

Caroline Arbour

Philippe Richebé

Karim Jerbi

Pierre Rainville

David Ogez

Valentyn Fournier

ClinicalTrials.gov NCT06817759.

2026-02-26

PLoS ONE (published)

doi.org

Who Guards the Guardians? The Challenges of Evaluating Identifiability of Learned Representations

Wieland Brendel

Identifiability in representation learning is commonly evaluated using standard metrics (e.g., MCC, DCI, R^2) on synthetic benchmarks with k… (see more)nown ground-truth factors. These metrics are assumed to reflect recovery up to the equivalence class guaranteed by identifiability theory. We show that this assumption holds only under specific structural conditions: each metric implicitly encodes assumptions about both the data-generating process (DGP) and the encoder. When these assumptions are violated, metrics become misspecified and can produce systematic false positives and false negatives. Such failures occur both within classical identifiability regimes and in post-hoc settings where identifiability is most needed. We introduce a taxonomy separating DGP assumptions from encoder geometry, use it to characterise the validity domains of existing metrics, and release an evaluation suite for reproducible stress testing and comparison.

2026-02-26

arXiv (preprint)

doi.org

arxiv.org

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Publications