Flemming Kondrup

Noticing the Watcher: LLM Agents Can Infer CoT Monitoring from Blocking Feedback

Chain-of-thought (CoT) monitoring provides oversight into model reasoning, but its effectiveness assumes models do not know they are being w… (see more)atched. We ask whether reasoning agents can autonomously infer that their supposedly private chain of thought is under surveillance, and whether this awareness leads to strategic evasion, without any explicit training or instructions to do so. In a multi-episode agentic framework, models pursue both a primary task and a concealed side task while being told their reasoning is private; a hidden CoT monitor blocks episodes when suspicious reasoning is detected. We find that frontier models can deduce the existence of this monitor purely from blocking feedback, with the most capable models reaching confident belief that their thinking is observed in up to 19\% of episodes. This awareness scales with model capability and, in rare cases, escalates to explicit intent to suppress reasoning about the side task. However, models that form this intent uniformly fail to execute it, openly reasoning about their concealed objectives in the very next episode. This intent–capability gap is reassuring for current deployment, but the autonomous emergence of both monitoring awareness and evasion intent suggests that CoT monitoring is not a permanently reliable safeguard.

2026-02-27

CAO @ International Conference on Learning Representations (poster)

doi.org

openreview.net

Cracking the Code of Action: A Generative Approach to Affordances for Reinforcement Learning

David Venuto

Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboar… (see more)d actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through

2025-03-04

ICLR.cc/2025/Workshop/DL4C (published)

doi.org

openreview.net

Forecaster: Towards Temporally Abstract Tree-Search Planning from Pixels

The ability to plan at many different levels of abstraction enables agents to envision the long-term repercussions of their decisions and th… (see more)us enables sample-efficient learning. This becomes particularly beneficial in complex environments from high-dimensional state space such as pixels, where the goal is distant and the reward sparse. We introduce Forecaster, a deep hierarchical reinforcement learning approach which plans over high-level goals leveraging a temporally abstract world model. Forecaster learns an abstract model of its environment by modelling the transitions dynamics at an abstract level and training a world model on such transition. It then uses this world model to choose optimal high-level goals through a tree-search planning procedure. It additionally trains a low-level policy that learns to reach those goals. Our method not only captures building world models with longer horizons, but also, planning with such models in downstream tasks. We empirically demonstrate Forecaster's potential in both single-task learning and generalization to new tasks in the AntMaze domain.

2023-10-27

GenPlan @ Neural Information Processing Systems (published)

doi.org

openreview.net

Towards Safe Mechanical Ventilation Treatment Using Deep Offline Reinforcement Learning

Jacob Shkrob

My Duc Tran

Doina Precup

Sumana Basu

Mechanical ventilation is a key form of life support for patients with pulmonary impairment. Healthcare workers are required to continuously… (see more) adjust ventilator settings for each patient, a challenging and time consuming task. Hence, it would be beneficial to develop an automated decision support tool to optimize ventilation treatment. We present DeepVent, a Conservative Q-Learning (CQL) based offline Deep Reinforcement Learning (DRL) agent that learns to predict the optimal ventilator parameters for a patient to promote 90 day survival. We design a clinically relevant intermediate reward that encourages continuous improvement of the patient vitals as well as addresses the challenge of sparse reward in RL. We find that DeepVent recommends ventilation parameters within safe ranges, as outlined in recent clinical trials. The CQL algorithm offers additional safety by mitigating the overestimation of the value estimates of out-of-distribution states/actions. We evaluate our agent using Fitted Q Evaluation (FQE) and demonstrate that it outperforms physicians from the MIMIC-III dataset.

2022-10-04

ArXiv (preprint)

doi.org

arxiv.org

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Flemming Kondrup

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Flemming Kondrup

Publications